The 99% Discount on AI Evaluation: How Researchers Predict Complex Agent Performance for Pennies

🔊

💬 Ask

Evaluating modern artificial intelligence has become a millionaire’s game. As large language models (LLMs) transition from simple chatbots into active “agents” capable of browsing the web, fixing software repositories, and utilizing complex tools, testing them has become painfully slow and prohibitively expensive.

Running a single comprehensive evaluation on agentic benchmarks like SWE-Bench (which tests AI on real-world GitHub issues) can take days, require specialized sandboxed environments, and run up thousands of dollars in API bills. Now, researchers from Carnegie Mellon University and Salesforce AI Research have proposed a clever workaround: a framework called PACE (Proxy for Agentic Capability Evaluation) that predicts complex agent performance at roughly one-hundredth of the cost.

The Interview Analogy

To understand how PACE works, imagine hiring a software engineer. The most thorough way to evaluate a candidate is to have them shadow your team for a month, working on real code in your production environment. This is highly accurate, but it is also incredibly slow and expensive.

Instead, most companies use a proxy: a technical interview assessing core competencies like coding, algorithmic reasoning, and planning. If a candidate excels at these atomic skills, you can reasonably predict they will succeed on the job.

PACE applies this exact philosophy to AI. Rather than forcing an LLM to spend hours trying to autonomously resolve a complex repository bug, PACE measures how the model performs on a carefully selected “exam paper” of small, static, and cheap-to-run questions.

Sifting for the Perfect Exam

The challenge lies in choosing the right questions. Out of tens of thousands of potential testing instances across dozens of existing, inexpensive benchmarks, which ones actually predict how well a model will perform as a full-fledged agent?

PACE solves this using a two-step mathematical filter. First, it looks at “target relevance”—finding cheap questions where a model’s score highly correlates with its performance on the ultimate, expensive benchmark. Second, it uses a technique called Singular Value Decomposition (SVD) to analyze “geometric importance,” ensuring the selected questions cover a diverse range of foundational skills rather than repeating the same test.

The results are remarkably tailored. For example, when predicting performance on GAIA—a benchmark testing general-purpose web-browsing assistants—PACE automatically selects heavily from IFEval (a benchmark for following precise formatting instructions) and PlanBench (which tests multi-step planning). This aligns with intuition: a web assistant’s success hinges on strictly following user constraints and planning search steps. Conversely, to predict SWE-Bench success, PACE prioritizes LiveCodeBench to evaluate raw programming capability.

Cutting the Bill

When evaluated across 14 frontier AI models, a proxy benchmark of just 100 questions selected by PACE predicted complex agent scores with an average error rate of under 4%. It also identified which of two models was superior with 85% accuracy.

Crucially, it achieved this predictive power at less than 1% of the cost of running the full agentic evaluation. By slashing testing budgets, PACE democratizes the field, allowing resource-constrained researchers and startup developers to rapidly iterate on advanced AI agents without breaking the bank.

AI Papers Reader

Personalized digests of latest AI research

The 99% Discount on AI Evaluation: How Researchers Predict Complex Agent Performance for Pennies

The Interview Analogy

Sifting for the Perfect Exam

Cutting the Bill

Chat about this paper