Making, Not Taking: New Fusion Method Outperforms Best-of-N for Large Language Model Outputs

🔊

💬 Ask

San Francisco, CA – October 2, 2025 – A new research paper from Cohere Labs proposes a novel approach to generating outputs from large language models (LLMs) that moves beyond simply selecting the “best” single response. The proposed method, dubbed “Fusion-of-N” (FUSION), actively synthesizes information from multiple generated candidates to create a superior final output. This contrasts with the traditional “Best-of-N” (BON) approach, which discards all but the single highest-scoring generation.

The paper, titled “Making, not Taking, the Best of N,” argues that the BON method, while effective, is inherently wasteful. By discarding diverse and potentially valuable information from multiple generated options, BON limits the overall quality and utility of LLM outputs. FUSION, on the other hand, treats multiple LLM generations as collaborators, integrating their complementary strengths into a single, more robust and informative result.

Imagine asking an LLM to write a creative story. With BON, you might get several story drafts, and the system picks the one that a judge deems “best.” However, one draft might have a brilliant plot twist, while another has incredibly vivid descriptions. BON would discard the excellent descriptions. FUSION, however, would analyze all these drafts and intelligently combine the strongest elements – the compelling plot from one and the evocative imagery from another – to create an even better story.

The researchers evaluated FUSION in two key scenarios:

Test-time Scaling: This involves generating multiple responses from a single LLM at the time of inference and then aggregating them. FUSION consistently outperformed BON in this setting, demonstrating substantial improvements in tasks like open-ended generation and machine translation across 11 languages. For instance, in one test, FUSION increased win rates by up to 10.8% compared to BON.
Synthetic Data Generation: Here, FUSION is used to create higher-quality training data for fine-tuning smaller LLMs. The study found that models fine-tuned on data generated by FUSION showed significant downstream improvements, surpassing those trained on BON-generated data. This means that even when teaching a smaller LLM, the quality of the training material generated by FUSION leads to better performance.

The paper highlights that FUSION’s strengths lie in its ability to leverage diversity. It shows robustness even when working with smaller or less capable “teacher” models that generate the initial candidates. This “polylithic” understanding of quality, acknowledging that different parts of a generation can have varying degrees of merit, is crucial for generating complex and nuanced outputs.

“We should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature,” the authors state. “This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.”

FUSION’s approach is presented as a simple yet powerful alternative to BON, requiring only access to a capable LLM to act as a “fusor” or judge. The research suggests that FUSION represents a more effective and sustainable paradigm for leveraging the collective capabilities of modern LLMs.

AI Papers Reader

Personalized digests of latest AI research

Making, Not Taking: New Fusion Method Outperforms Best-of-N for Large Language Model Outputs

Chat about this paper