Multi-Agent LLMs Show Promise in Solving Complex Reasoning Problems
A new paper from researchers at the University of Oxford and other institutions details a novel training method for large language models (LLMs) that significantly boosts their performance on complex reasoning tasks. The technique, dubbed Multi-Agent LLM Training (MALT), leverages a multi-agent system where specialized LLMs collaborate to solve problems. This collaborative approach contrasts with the typical single-model approach, where an LLM generates an answer which is then evaluated and refined by humans.
The core of MALT lies in its three specialized LLM agents: a generator, a verifier, and a refiner. The generator proposes an initial solution to a problem; the verifier critically evaluates the proposed solution, identifying potential flaws in reasoning or inaccuracies; and the refiner integrates the feedback from the verifier to produce a refined, improved solution. This iterative process mirrors human collaboration, where specialists with different skills work together to achieve a common goal.
To train these agents jointly, the researchers developed a trajectory-expansion-based synthetic data generation process. This process creates a large number of hypothetical problem-solving trajectories—sequences of interactions between the generator, verifier, and refiner—some correct and some incorrect. Instead of relying on human-labeled data, MALT uses a value iteration-based credit assignment strategy to determine which agent introduced errors in each trajectory. This allows the system to automatically attribute credit or blame for the outcome to individual agents.
The researchers evaluated MALT on three widely used benchmarks: MATH (mathematical reasoning), GSM8K (grade-school mathematics), and CSQA (commonsense reasoning). Using Llama 3.1 8B models, MALT achieved relative improvements of 14.14%, 7.12%, and 9.40% over a single-model baseline on MATH, GSM8K, and CSQA, respectively. This demonstrates that MALT significantly outperforms existing methods.
For instance, consider a problem from GSM8k involving word problems. A single LLM might struggle to correctly interpret and solve the problem. In contrast, MALT’s three agents collaborate:
-
Generator: Proposes a solution based on a direct interpretation of the problem, possibly making a mistake in its arithmetic or logical reasoning.
-
Verifier: Identifies the error in the generator’s solution, pointing out the incorrect step or flawed logic.
-
Refiner: Uses the verifier’s feedback to correct the flaw and proposes a refined solution, which is more likely to be accurate.
The synthetic data generation process provides a wealth of training data for all three agents, including both successful and unsuccessful solution paths. MALT’s credit assignment method ensures that each agent learns from its mistakes and from the successes of its collaborators. This feedback loop leads to continuous improvement in the team’s ability to solve complex reasoning problems.
While MALT is an early step, the results suggest a promising avenue for improving LLMs’ reasoning abilities. The researchers acknowledge limitations, such as the need for future investigation of the hyperparameters and the handling of biased data. Nevertheless, MALT’s strong empirical performance is a significant step towards building more capable and reliable autonomous systems. Future work could explore further refinements to the training methodology, the inclusion of more agents with specialized roles, and the application of MALT to a wider range of problems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.