"SimpleQA Verified" Benchmark Sets New Standard for LLM Factuality Evaluation

🔊

💬 Ask

A new benchmark, “SimpleQA Verified,” has been introduced to more accurately assess the factual accuracy of Large Language Models (LLMs). Developed by researchers at Google, this benchmark aims to address significant limitations found in previous evaluation datasets, offering a more reliable and challenging way to measure an LLM’s ability to recall factual information stored within its parameters.

LLMs are increasingly relied upon for factual information, but concerns about “hallucinations” – instances where models generate incorrect or fabricated data – persist. While many benchmarks test an LLM’s ability to reason over provided context or search external information, “SimpleQA Verified” specifically focuses on the model’s internal knowledge base. This parametric factuality is crucial for applications where external tools are not available or practical.

The original “SimpleQA” benchmark, released by OpenAI, was a step forward in evaluating parametric factuality. However, the researchers behind “SimpleQA Verified” identified several issues: noisy and inaccurate labels, a bias towards specific topics, and redundant questions. These problems could lead to LLMs “overfitting” to the benchmark’s quirks rather than demonstrating genuine factual recall.

To overcome these limitations, “SimpleQA Verified” underwent a rigorous multi-stage curation process. This involved:

De-duplication: Removing highly similar questions, both semantically and through keyword analysis, to ensure each prompt presents a unique challenge. For instance, instead of multiple questions about the founding dates of various Colombian municipalities, only one representative question was kept.
Topic and Answer Type Balancing: Re-balancing the distribution of topics and answer types (e.g., dates, places, numbers) to avoid unfairly penalizing models that might be weaker in certain areas.
Source Reconciliation: Verifying ground truths by cross-referencing information from multiple sources. This also involved addressing conflicting information. For example, if multiple sources provided slightly different dates for an event, the benchmark would either reconcile these or, for numeric answers, define an acceptable margin of error.
Publisher Preferences: Respecting web publisher’s robots.txt files, which dictate how their content can be accessed, by removing questions linked to sites that opt-out of data usage for AI training.
Increasing Benchmark Headroom: Filtering out questions that are too easily answered by current frontier models, ensuring the benchmark remains challenging for state-of-the-art systems.
Manual Review and Metadata Enrichment: A final manual review ensured the quality of URLs, date precision, and added metadata about reasoning or multi-step questions.

The result is a dataset of 1,000 carefully curated prompts. On this new benchmark, Google’s Gemini 2.5 Pro achieved a state-of-the-art F1-score of 55.6, outperforming other leading models including GPT-5. The researchers have made “SimpleQA Verified,” along with its evaluation code and a public leaderboard, available to the research community. This release aims to provide a more precise tool for tracking genuine progress in LLM factuality and encourage the development of more trustworthy AI.

AI Papers Reader

Personalized digests of latest AI research

"SimpleQA Verified" Benchmark Sets New Standard for LLM Factuality Evaluation

Chat about this paper