AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Entropy Minimization Improves Reasoning in Large Language Models Without Labeled Data

Researchers at the University of Illinois Urbana-Champaign have discovered that a simple technique called entropy minimization (EM) can significantly enhance the reasoning capabilities of large language models (LLMs) on complex tasks like math, physics, and coding, without the need for labeled training data. This surprising finding challenges the conventional reliance on extensive datasets and complex training procedures for improving LLM performance.

The research explores three distinct EM-based approaches:

  1. EM Fine-Tuning (EM-FT): Analogous to instruction fine-tuning, but instead of using labeled data, EM-FT minimizes token-level entropy on the LLM’s own generated outputs based on input prompts. This is like giving the model “unsupervised homework” to refine its own responses. For example, imagine an LLM is given a mathematical equation without the answer. EM-FT would encourage the LLM to become more confident in its predicted answer, even without knowing the true solution.

  2. EM Reinforcement Learning (EM-RL): Uses reinforcement learning where the only reward is the negative entropy of the model’s output. This means the model is incentivized to produce more confident, less uncertain answers. Imagine the LLM is trying to write a program to solve a problem. EM-RL would reward the model for producing a concise and confident program, even if there is no external guidance on whether the program is correct.

  3. EM Inference (EM-INF): Adjusts the LLM’s output logits (the raw, unnormalized probabilities) at inference time to reduce entropy, requiring no training data or parameter updates. This method is particularly effective for complex tasks with high uncertainty. Consider an LLM is asked a difficult physics question. EM-INF would nudge the model to become more confident in one particular answer by altering its initial probability distribution without changing the underlying LLM.

The results are compelling. On the Qwen-7B model, EM-RL achieved performance comparable to, or even exceeding, state-of-the-art reinforcement learning baselines like GRPO and RLOO, which are trained on 60,000 labeled examples. Further, EM-INF allowed the Qwen-32B model to match or surpass the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the demanding SciCode benchmark for scientific coding.

The key is that by reducing uncertainty, LLMs that already possess some innate reasoning ability can be nudged towards more accurate and confident answers. For example, the authors found that Qwen-2.5 models improved more with EM on math tasks than Llama-3.1-8B, suggesting Qwen models have more initial mathematical reasoning capability to leverage.

The authors caution that EM is most effective when model confidence correlates with correctness. It is less suitable for tasks like aligning with human values, where confidence alone is not a reliable indicator. Also, EM relies on the pre-trained model already having some capability in the task; it’s not a substitute for pre-training itself. The researchers emphasize the importance of using EM as a baseline to assess the true impact of new algorithms for post-pretraining and inference-time scaling.

These findings open avenues for improving LLMs in resource-constrained environments and for tasks where labeled data is scarce, and suggest that minimizing entropy is a surprisingly effective method for eliciting latent reasoning capabilities.