AI Papers Reader

Personalized digests of latest AI research

View on GitHub

LLMs as Data Scientists: A New Benchmark to Evaluate their Feature Engineering Abilities

Large language models (LLMs) are rapidly becoming more sophisticated, capable of carrying out complex tasks that require deep domain knowledge and reasoning abilities. While benchmarks evaluating LLMs are increasingly common, many focus on isolated skills like language understanding or code generation. This leaves a gap in understanding how LLMs can integrate these abilities to solve real-world problems.

A new research paper, “Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists,” published in arXiv, proposes a novel benchmark called FeatEng designed to assess LLMs’ feature engineering capabilities. Feature engineering is a crucial but often tedious and laborious task for data scientists, involving transforming raw data into features that are more informative and suitable for machine learning models.

FeatEng’s unique design focuses on evaluating how well LLMs can perform this critical task. The benchmark challenges LLMs with real-world datasets from Kaggle competitions, presenting a description of each dataset and its features. Then, the models are tasked with generating Python code that transforms the data into features, ultimately aiming to improve the performance of a machine learning model.

The paper highlights four key qualities that any meaningful LLM benchmark should embody:

By focusing on feature engineering, FeatEng challenges LLMs in a way that none of the existing benchmarks do, providing a more robust and practical assessment of their problem-solving abilities. The benchmark is carefully designed to align with the four key qualities mentioned above, making it a more reliable and meaningful way to evaluate LLMs’ overall capabilities.

The paper also presents a detailed analysis of how different feature engineering strategies influence the performance of various LLMs tested. The results show that models that successfully exploit domain knowledge and apply advanced feature engineering techniques generally perform better. Interestingly, the paper found that models often struggled with the tasks requiring code generation that adhered to the specific instructions.

The authors acknowledge that FeatEng has limitations, notably the lack of a human baseline for comparison. However, they believe that the benchmark provides a valuable framework for researchers and practitioners to evaluate and understand the capabilities of LLMs in real-world settings.

As LLMs continue to evolve, FeatEng serves as a crucial tool for assessing their ability to not only generate text but also to tackle complex, real-world problems that require deep domain knowledge and reasoning abilities. By pushing the boundaries of LLM evaluation, this new benchmark offers a glimpse into a future where LLMs play a more active and instrumental role in solving complex data science problems.