KnowRL: Teaching Large Language Models to Think Factually
Large Language Models (LLMs) are becoming increasingly adept at complex reasoning. However, a significant problem known as “hallucination” persists, where these models confidently generate incorrect information. This issue is particularly prevalent in “slow-thinking” LLMs, which process information through multiple steps. While Reinforcement Learning (RL) has shown promise in improving LLM reasoning, traditional RL methods often focus on the final outcome, neglecting the factual accuracy of the step-by-step reasoning process. This can lead to models that are logically sound but factually wrong.
To combat this, researchers have introduced KnowRL, a novel framework that integrates a “factuality reward” into the RL training process. This reward, derived from verifying factual claims against an external knowledge base, guides the LLM to reason factually and recognize its own knowledge boundaries.
How KnowRL Works:
KnowRL employs a two-stage approach. First, it constructs a high-quality dataset of factual questions. This data is then used for an initial Supervised Fine-Tuning (SFT) phase, establishing a solid foundation. The core of KnowRL lies in the subsequent RL stage, where a specialized reward function is used. This reward function combines:
- Format Reward: Ensures the model adheres to a structured thinking and answering format (e.g.,
<think>...</think><answer>...</answer>
). - Correct Reward: Directly assesses the accuracy of the final answer, typically using an LLM like GPT-40-mini for evaluation.
- Fact Reward: This is the crucial component. It breaks down the model’s thinking process into atomic facts and verifies each fact against a knowledge base (like Wikipedia). The more facts supported by the knowledge base, the higher the reward.
By rewarding adherence to facts at each reasoning step, KnowRL encourages the model to build reliable reasoning chains and avoid “fabricating” information.
Key Findings and Examples:
The paper demonstrates that KnowRL significantly mitigates hallucinations across various datasets. For instance, models trained with KnowRL achieved leading accuracy on hallucination benchmarks like TruthfulQA and SimpleQA. Crucially, these models either maintained or even improved their reasoning capabilities on datasets like GPQA and AIME 2025, which test general and mathematical reasoning, respectively.
A concrete example of KnowRL’s impact can be seen in how it handles questions like: “Which military commander was the son of Simon Bolivar Buckner?” A typical slow-thinking model might retrieve information about Simon Bolivar himself, possibly getting confused about the specific lineage. KnowRL, by contrast, would prompt the model to first identify Simon Bolivar Buckner, then anchor this knowledge (e.g., “Simon Bolivar was a historical figure…”), and then cross-verify potential sons like “Philip Bolivar” against historical records. If “Philip Bolivar” is indeed the son and a military commander, the fact reward would validate this step, leading to a more accurate and factually grounded answer.
The research also highlights the importance of a “cold-start” SFT phase before RL training for factual tasks. Without this initial alignment, RL training alone struggles to establish proper factual reasoning. Furthermore, the study suggests that while KnowRL is effective, overtraining during the initial SFT phase can negatively impact subsequent RL performance.
In conclusion, KnowRL offers a promising approach to address the critical issue of hallucination in LLMs by directly supervising the reasoning process with factual grounding, paving the way for more reliable and trustworthy AI systems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.