AI Papers Reader

Personalized digests of latest AI research

View on GitHub

FACTORY: A New Benchmark Challenges AI's Grasp of Factual Accuracy

In the rapidly evolving landscape of artificial intelligence, ensuring that large language models (LLMs) generate factually accurate information is paramount. However, evaluating this “factuality,” especially in long-form responses, has proven to be a significant hurdle. A new research paper introduces FACTORY, a comprehensive, human-verified prompt set designed to rigorously test the factual accuracy of these advanced AI systems.

Existing benchmarks often rely on automatically generated prompts, which can be too simplistic or contain inherent flaws like ambiguity or time-sensitivity. This can lead to LLMs achieving deceptively high scores, masking underlying issues. FACTORY aims to overcome these limitations by providing a more challenging and reliable evaluation framework.

The research team, from FAIR at Meta, developed FACTORY through a “model-in-the-loop” approach. They began with broad topics from Wikipedia, using LLMs to generate a diverse range of prompts. Crucially, they then employed LLMs to filter out easier prompts, ensuring a high level of difficulty. Human annotators then meticulously refined these prompts, making sure they were fact-seeking, answerable, unambiguous, and not time-sensitive or unsafe.

A key innovation in FACTORY is its “hard” split, a subset of prompts that significantly challenge current state-of-the-art LLMs. In tests, responses to these “hard” prompts contained factual inaccuracies in approximately 40% of claims, a stark contrast to the roughly 10% found in responses to prompts from older benchmarks. This indicates that FACTORY exposes genuine weaknesses in how LLMs handle complex, real-world information.

For instance, FACTORY includes prompts that require models to synthesize information from various sources or understand nuanced historical events. An example provided is a prompt asking about the legal framework for tenant rights in the UK, which requires more than just retrieving a single fact. In contrast, older benchmarks might have simpler prompts like “Who is Emilia Chico?”, where the answer might be obscure or simply unavailable, leading to a “not answerable” issue. Another example from an older benchmark asked about the “latest 50 kernel versions,” which quickly becomes outdated and thus time-sensitive, undermining its evaluation value. FACTORY avoids these pitfalls by focusing on verifiable, enduring factual information.

When benchmarked against six leading LLMs, FACTORY demonstrated its rigorous nature. While these models achieved around 90% factual precision on existing benchmarks, their performance dropped to approximately 60% on FACTORY’s hard subset. This significant decrease highlights FACTORY’s effectiveness in pushing the boundaries of LLM factuality evaluation.

The research also analyzes the types of errors found in older benchmarks, such as issues with answerability, hallucinations (generating plausible but false information), and time sensitivity. FACTORY’s human-verified nature and its focus on longer, more detailed prompts that necessitate reasoning across facts help to mitigate these problems. The study suggests that LLMs struggle with FACTORY due to both the need for “long-tailed” knowledge (less common but still important facts) and the complex reasoning required to synthesize this information.

In conclusion, FACTORY represents a significant advancement in evaluating the factual accuracy of LLMs, particularly for long-form content. By offering a challenging, human-verified dataset, it provides a more reliable measure of AI capabilities and sets a new standard for future developments in building more trustworthy and knowledgeable AI systems.