Rex-Omni is a 3B-parameter Multimodal Large Language Model (MLLM) that redefines object detection and a wide range of other visual perception tasks as a simple next-token prediction problem.

git clone https://github.com/IDEA-Research/Rex-Omni.git
cd Rex-Omni
conda create -n rexomni python=3.10 -y
conda activate rexomni
pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
Test Installation
CUDA_VISIBLE_DEVICES=1 python tutorials/detection_example/detection_example.py
If the installation is successful, you will find a visualization of the detection results at tutorials/detection_example/test_images/cafe_visualize.jpg
Below is a minimal example showing how to run object detection using the rex_omni package.
from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize
# 1) Initialize the wrapper (model loads internally)
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni", # HF repo or local path
backend="transformers", # or "vllm" for high-throughput inference
# Inference/generation controls (applied across backends)
max_tokens=2048,
temperature=0.0,
top_p=0.05,
top_k=1,
repetition_penalty=1.05,
)
# If you are using the AWQ quantized version of Rex-Omni, you can use the following code:
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni-AWQ",
backend="vllm",
quantization="awq",
max_tokens=2048,
temperature=0.0,
top_p=0.05,
top_k=1,
repetition_penalty=1.05,
)
# 2) Prepare input
image = Image.open("tutorials/detection_example/test_images/cafe.jpg").convert("RGB")
categories = [
"man", "woman", "yellow flower", "sofa", "robot-shope light",
"blanket", "microwave", "laptop", "cup", "white chair", "lamp",
]
# 3) Run detection
results = rex.inference(images=image, task="detection", categories=categories)
result = results[0]
# 4) Visualize
vis = RexOmniVisualize(
image=image,
predictions=result["extracted_predictions"],
font_size=20,
draw_width=5,
show_labels=True,
)
vis.save("tutorials/detection_example/test_images/cafe_visualize.jpg")
vllm package and a compatible environment.torch_dtype, attn_implementation, device_map, trust_remote_code, etc.tokenizer_mode, limit_mm_per_prompt, max_model_len, gpu_memory_utilization, tensor_parallel_size, trust_remote_code, etc.PIL.Image.Image or a list of images for batch inference."detection", "pointing", "visual_prompting", "keypoint", "ocr_box", "ocr_polygon", "gui_grounding", "gui_pointing".["person", "cup"]. Used to build task prompts.Returns a list of dictionaries (one per input image). Each dictionary includes:
{category: [{"type": "box", "coords": [x0,y0,x1,y1]}, ...], ...}{category: [{"type": "point", "coords": [x0,y0]}, ...], ...}{category: [{"type": "polygon", "coords": [x0,y0, ...]}, ...], ...}Tips:
backend="vllm" and tune gpu_memory_utilization and tensor_parallel_size according to your GPUs.We provide comprehensive tutorials for each supported task. Each tutorial includes both standalone Python scripts and interactive Jupyter notebooks.
| Task | Applications | Demo | Python Example | Notebook |
|---|---|---|---|---|
| Detection | object detection | ![]() | code | notebook |
object referring | ![]() | code | notebook | |
gui grounding | ![]() | code | notebook | |
layout grounding | ![]() | code | notebook | |
| Pointing | object pointing | ![]() | code | notebook |
gui pointing | ![]() | code | notebook | |
affordance pointing | ![]() | code | notebook | |
| Visual prompting | visual prompting | ![]() | code | notebook |
| OCR | ocr word box | ![]() | code | notebook |
ocr textline box | ![]() | code | notebook | |
ocr polygon | ![]() | code | notebook | |
| Keypointing | person keypointing | ![]() | code | notebook |
animal keypointing | ![]() | code | notebook | |
| Other | batch inference | code |
Rex-Omni's unified detection framework enables seamless integration with other vision models.
| Application | Description | Demo | Documentation |
|---|---|---|---|
| Rex-Omni + SAM | Combine language-driven detection with pixel-perfect segmentation. Rex-Omni detects objects → SAM generates precise masks | ![]() | README |
| Grounding Data Engine | Automatically generate phrase grounding annotations from image captions using spaCy and Rex-Omni. | ![]() | README |

We provide an interactive Gradio demo that allows you to test all Rex-Omni capabilities through a web interface.
# Launch the demo
CUDA_VISIBLE_DEVICES=0 python app.py --model_path IDEA-Research/Rex-Omni
# With custom settings
CUDA_VISIBLE_DEVICES=0 python app.py \
--model_path IDEA-Research/Rex-Omni \
--backend vllm \
--server_name 0.0.0.0 \
--server_port 7890
--model_path: Model path or HuggingFace repo ID (default: "IDEA-Research/Rex-Omni")--backend: Backend to use - "transformers" or "vllm" (default: "transformers")--server_name: Server host address (default: "192.168.81.138")--server_port: Server port (default: 5211)--temperature: Sampling temperature (default: 0.0)--top_p: Nucleus sampling parameter (default: 0.05)--max_tokens: Maximum tokens to generate (default: 2048)Please refer to Evaluation for more details.
Please refer to Fine-tuning Rex-Omni for more details.
Rex-Omni is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. This model is based on Qwen, which is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Rex-Omni comes from a series of prior works. If you’re interested, you can take a look.
@misc{jiang2025detectpointprediction, title={Detect Anything via Next Point Prediction}, author={Qing Jiang and Junan Huo and Xingyu Chen and Yuda Xiong and Zhaoyang Zeng and Yihao Chen and Tianhe Ren and Junzhi Yu and Lei Zhang}, year={2025}, eprint={2510.12798}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.12798}, }