AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Rex-Thinker: AI Model Achieves Human-Like Object Referring with Step-by-Step Reasoning

A new artificial intelligence model called “Rex-Thinker” can identify objects in images based on natural language descriptions, mimicking how humans approach the task through a step-by-step reasoning process. This research, outlined in a paper submitted to arXiv on June 4, 2025, addresses a significant challenge in computer vision: enabling AI to not only “see” but also “understand” and justify its predictions.

The Problem: Lack of Explainability and Hallucinations

Current object referring systems typically operate as “black boxes,” directly predicting bounding box coordinates with limited transparency. This lack of explainability makes it difficult to understand why a model made a particular prediction. Furthermore, these systems are prone to “hallucinations,” identifying objects that don’t actually exist in the image, hindering their reliability in real-world applications.

Rex-Thinker: A Chain-of-Thought Approach

Rex-Thinker tackles these issues by adopting a Chain-of-Thought (CoT) reasoning approach. Given an image and a textual description, Rex-Thinker first identifies candidate object instances corresponding to the referred category using an open-vocabulary object detector. For example, if asked to find “the person wearing a blue shirt,” Rex-Thinker first locates all people in the image. It then performs step-by-step reasoning for each candidate to verify whether it matches the description.

Here’s how it breaks down:

  1. Planning: Rex-Thinker decomposes the referring expression into subgoals. For instance, “the person sitting on the turtle” might be broken down into identifying turtles and then checking for a person on each turtle.
  2. Action: The model evaluates each candidate based on the plan. For the turtle example, each person in the image is assessed: “Person 3: A bearded figure wearing a red hat and red clothes. He is sitting on the green turtle. ✅”
  3. Summarization: The model aggregates these individual assessments to form the final prediction.

This process is grounded in specific regions of the image through “box hints,” enabling users to trace back each reasoning step to visual evidence. The model’s reasoning steps are enclosed within <think>...</think> blocks, while the final prediction is within <answer>...</answer> blocks.

HumanRef-CoT: A New Dataset for Grounded Reasoning

To facilitate this CoT approach, the researchers created HumanRef-CoT, a large-scale dataset of referring expressions annotated with step-by-step reasoning traces. The traces are generated by prompting GPT-40 (an enhanced version of GPT-4) to reason about images in the HumanRef dataset, following the structured planning, action, and summarization format.

Two-Stage Training for Accuracy and Generalization

Rex-Thinker is trained in two stages:

  1. Supervised Fine-Tuning (SFT): This phase teaches the model to perform structured reasoning in the defined CoT format, improving interpretability.
  2. Group Relative Policy Optimization (GRPO): This reinforcement learning phase improves accuracy and generalization by encouraging the model to explore alternative reasoning paths and selectively reinforcing responses that achieve higher task-level rewards.

Results: Improved Accuracy, Interpretability, and Generalization

Experiments demonstrate that Rex-Thinker surpasses standard baselines in both precision and interpretability. On the HumanRef benchmark, it achieves state-of-the-art results while significantly reducing hallucinations, particularly in rejection cases where no matching object exists. Furthermore, the model exhibits strong zero-shot generalization to out-of-domain datasets like RefCOCOg, meaning it can effectively refer to objects it hasn’t seen during training.

The Future of Grounded AI

Rex-Thinker is a significant step towards more grounded and trustworthy AI systems. By making its reasoning process transparent and verifiable, it enhances users’ understanding and trust in AI’s decisions. The authors point out potential improvements by addressing weaknesses in modeling complex relationships between multiple objects. This research offers valuable insights for applications requiring high reliability, from robotics to human-computer interaction.