LLMs as Data Scientists: A New Benchmark to Evaluate their Feature Engineering Abilities
đź“„ Full Paper
đź’¬ Ask
Large language models (LLMs) are rapidly becoming more sophisticated, capable of carrying out complex tasks that require deep domain knowledge and reasoning abilities. While benchmarks evaluating LLMs are increasingly common, many focus on isolated skills like language understanding or code generation. This leaves a gap in understanding how LLMs can integrate these abilities to solve real-world problems.
A new research paper, “Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists,” published in arXiv, proposes a novel benchmark called FeatEng designed to assess LLMs’ feature engineering capabilities. Feature engineering is a crucial but often tedious and laborious task for data scientists, involving transforming raw data into features that are more informative and suitable for machine learning models.
FeatEng’s unique design focuses on evaluating how well LLMs can perform this critical task. The benchmark challenges LLMs with real-world datasets from Kaggle competitions, presenting a description of each dataset and its features. Then, the models are tasked with generating Python code that transforms the data into features, ultimately aiming to improve the performance of a machine learning model.
The paper highlights four key qualities that any meaningful LLM benchmark should embody:
- Practical Usability: The tasks should be grounded in real-world problems, ensuring that any improvements in performance translate into tangible benefits.
- Application of World Knowledge: The benchmark should require LLMs to utilize both domain-specific and general world knowledge, demonstrating their ability to reason and make decisions based on a broader understanding of the world.
- Complex Skill Integration: The benchmark should assess how well LLMs can seamlessly combine various competencies, such as language understanding, code generation, and reasoning, to solve complex problems.
- Resistance to Exploitation: The benchmark should be robust against strategies that exploit dataset-specific biases or rely on memorization, ensuring that improvements in performance genuinely reflect the model’s capabilities.
By focusing on feature engineering, FeatEng challenges LLMs in a way that none of the existing benchmarks do, providing a more robust and practical assessment of their problem-solving abilities. The benchmark is carefully designed to align with the four key qualities mentioned above, making it a more reliable and meaningful way to evaluate LLMs’ overall capabilities.
The paper also presents a detailed analysis of how different feature engineering strategies influence the performance of various LLMs tested. The results show that models that successfully exploit domain knowledge and apply advanced feature engineering techniques generally perform better. Interestingly, the paper found that models often struggled with the tasks requiring code generation that adhered to the specific instructions.
The authors acknowledge that FeatEng has limitations, notably the lack of a human baseline for comparison. However, they believe that the benchmark provides a valuable framework for researchers and practitioners to evaluate and understand the capabilities of LLMs in real-world settings.
As LLMs continue to evolve, FeatEng serves as a crucial tool for assessing their ability to not only generate text but also to tackle complex, real-world problems that require deep domain knowledge and reasoning abilities. By pushing the boundaries of LLM evaluation, this new benchmark offers a glimpse into a future where LLMs play a more active and instrumental role in solving complex data science problems.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.