AI Papers Reader

Personalized digests of latest AI research

View on GitHub

HackerRank-ASTRA: A New Benchmark for Evaluating Large Language Models in Multi-File Software Projects

Large language models (LLMs) are rapidly changing software development, enabling tasks like code generation and bug fixing. However, evaluating their real-world effectiveness remains challenging. Existing benchmarks often focus on isolated coding problems or specific libraries, neglecting the complexities of real-world, multi-file software projects. A new benchmark, HackerRank-ASTRA, tackles this gap by evaluating LLMs on 65 diverse, multi-file, project-based coding problems drawn from HackerRank’s proprietary library. A research paper published on arXiv details the benchmark and its initial findings.

Real-World Scenarios, Rigorous Evaluation

HackerRank-ASTRA’s strength lies in its focus on realistic project-based scenarios. Problems aren’t isolated functions but rather involve multiple files and mimic tasks like new feature development within common front-end frameworks such as React.js, Angular.js, Node.js, and Django. Each problem comes with an average of 12 source code files and a problem statement (around 718 characters on average), requiring models to demonstrate understanding of context and dependencies.

The evaluation goes beyond simple accuracy measures. The benchmark employs a robust approach that runs each model 32 times per problem, creating a distribution of results to assess not just correctness but also consistency. Metrics include:

  • Mean Score: The average percentage of test cases passed across the 32 runs for each problem, aggregated across all problems.
  • Mean Pass@1: The percentage of runs where the model achieved a perfect score (all test cases passed) on a given problem, aggregated across all problems.
  • Consistency (Median Standard Deviation): Measures the variability in model performance across the 32 runs for each problem, providing insights into its reliability. A lower median standard deviation reflects higher consistency.

Initial Findings and Model Performance

The initial evaluation of HackerRank-ASTRA (version 1) included five state-of-the-art LLMs: ol, ol-preview, Claude-3.5-Sonnet-1022, GPT-40-0513, and Gemini-1.5-pro. The top three models (ol, ol-preview, and Claude-3.5-Sonnet-1022) achieved comparable average scores (around 75%), showing no statistically significant difference in overall performance. However, Claude-3.5-Sonnet-1022 stood out with remarkably high consistency (standard deviation of 0.0497), demonstrating better reliability for real-world application.

Skill-Based Analysis and Formatting Impacts

The benchmark categorized problems into 10 primary skill domains (e.g., React Frontend Development, Node.js Backend Development) and 34 sub-skill categories (e.g., API Integration, Data Filtering). Analysis revealed that the best-performing model varied depending on the specific skills tested. For example, while ol, ol-preview, and Claude-3.5-Sonnet-1022 performed similarly on common skills, Claude-3.5-Sonnet-1022 excelled in problems involving skills with less frequent occurrences.

Interestingly, the study also highlights a significant impact of output format. Using XML consistently outperformed JSON, suggesting potential biases in LLM training data or inherent strengths of XML for multi-file code generation.

Limitations and Future Directions

The current version of HackerRank-ASTRA focuses primarily on front-end development. Future iterations plan to incorporate backend tasks and a broader range of technologies. The researchers also aim to incorporate agent-based approaches, allowing LLMs to iteratively refine their solutions, and expand the benchmark to include more LLMs and allow for community contributions.

Conclusion

HackerRank-ASTRA provides a much-needed, rigorous benchmark for evaluating LLMs in realistic multi-file software projects. The findings highlight the importance of assessing both correctness and consistency, and underscore the need for further research in improving LLMs’ reliability and handling of real-world complexities. The open-sourcing of the benchmark’s questions promises wider collaboration and faster progress in this crucial area of AI development.