AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Vision Language Models Can Now Reason Through Complex Problems

Large language models (LLMs) are getting better at reasoning, and new research shows they can now solve complex problems by using a “chain-of-thought” approach. This involves breaking down a problem into smaller steps and reasoning through them one at a time, which makes it easier for the model to understand the task and arrive at the correct answer.

One challenge in training LLMs to reason is that most existing datasets are dominated by short answers, which provide little insight into the reasoning process. This can make it difficult for LLMs to generalize their reasoning abilities to new tasks that require more detailed responses.

To overcome this challenge, researchers have proposed a two-fold approach. First, they distilled rationales from GPT-40, a powerful LLM, to enrich the training data for the target VLM. Second, they applied reinforcement learning to further calibrate the reasoning quality. This involves constructing positive and negative pairs of reasoning chains, generated by the model, and using the Direct Preference Optimization (DPO) algorithm to refine the model’s reasoning abilities.

For example, consider the task of counting the number of food items shown in a bar graph. Humans would typically enumerate the bars one by one and then calculate the total. However, most existing datasets simply provide the short answer “14” without explicitly stating the enumeration steps.

By using the proposed approach, researchers were able to train VLMs that can not only provide the correct answer but also generate a detailed chain of thought that explains the reasoning process. This makes the models more interpretable and trustworthy, as it allows users to understand how they arrived at their conclusions.

The researchers evaluated their approach on a variety of tasks, including common sense reasoning, chart interpretation, document information localization, real-world text extraction, scientific reasoning, and mathematical reasoning. They found that their method significantly improved the CoT reasoning abilities of VLMs across all tasks, leading to better generalization and performance on direct answer prediction as well.

This work highlights the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of LLMs. It paves the way for more robust and interpretable multimodal models, which can better understand and reason about the world.