Prompt-to-Leaderboard: Personalizing LLM Evaluations and Routing
Large Language Models (LLMs) are typically evaluated using aggregated metrics which obscure variations in performance across different prompts and users. A new study from UC Berkeley introduces Prompt-to-Leaderboard (P2L), a method that addresses this by creating leaderboards specific to individual prompts or sets of prompts. This approach allows for a more nuanced understanding of LLM performance and enables personalized experiences.
The core idea behind P2L is to train an LLM to take natural language prompts as input and output a vector of Bradley-Terry coefficients. These coefficients are then used to predict human preference votes, effectively generating a prompt-dependent leaderboard. This enables several applications:
- Task-Specific Evaluation: Determine which models are best suited for specific tasks. For instance, the general-purpose leaderboard might not accurately reflect performance on SQL queries, which may only represent a small fraction of overall submissions. P2L can isolate and rank models based solely on their performance on SQL-related prompts.
- Optimal Query Routing: Direct queries to the most appropriate model based on the prompt. If a user asks a math question, the system can use P2L to determine which model excels in mathematics and route the query accordingly.
- Personalization: Create leaderboards tailored to individual users or enterprises based on their prompt history. A user who frequently asks creative writing prompts would see a leaderboard highlighting models strong in that area.
- Automated Strength and Weakness Analysis: Identify specific strengths and weaknesses of different models. For example, a model might excel at generating creative text but struggle with arithmetic operations.
The researchers trained a P2L model on human preference feedback from Chatbot Arena, a platform where users vote on the responses of different LLMs. Their findings suggest that P2L better captures the nuances of language model performance than a traditional averaged leaderboard. The researchers also found evidence of power law scaling, similar to that observed in LLMs themselves, in P2Lās ability to produce prompt-specific evaluations.
As a demonstration of P2Lās utility, the researchers implemented a prompt routing strategy based on P2L on the Chatbot Arena. The P2L-powered router achieved the #1 spot on the Chatbot Arena leaderboard during a testing period, outperforming the previous top model by a significant margin. This highlights P2Lās potential to dynamically select the best model for a given prompt, leading to improved overall performance.
The team has made their code publicly available, paving the way for further research and development in personalized LLM evaluation and routing. The P2L approach represents a significant step towards a more granular and adaptive understanding of LLM capabilities, promising more efficient and effective use of these powerful technologies.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.