AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Programming Data Refinement: Lifting Pre-Training Quality Like Experts at Scale

Large language models (LLMs) are revolutionizing the way we interact with technology. Their remarkable capabilities in tasks such as creative writing, complex reasoning, and code generation are fueled by massive, high-quality pre-training corpora. However, the vast amount of data available on the internet is often noisy and unrefined, requiring extensive cleaning and quality enhancement before being used for pre-training.

Traditional approaches to data refinement rely heavily on human expertise and manual adjustments, limiting their ability to address the unique characteristics of each individual example. This paper introduces Programming Every Example (PROX), a novel framework that leverages the power of LLMs to automate data refinement.

PROX treats data refinement as a programming task. It first trains a small language model (less than 1B parameters) to perform data refining tasks by fine-tuning it on seed data. This model then generates programs for each example in the pre-training corpus, determining the appropriate operations such as filtering, string normalization, and noisy line removal. These programs are then executed by a pre-defined executor, producing a refined corpus ready for pre-training.

The key innovation of PROX is its ability to automatically generate and execute fine-grained operations for each individual example at scale. This allows PROX to adapt to the unique characteristics of every example, achieving a level of flexibility and granularity that is impractical for human experts.

For example, PROX can automatically remove navigation bars, meaningless URL links, and random code gibberish from a document. In a mathematical domain, PROX can identify and remove less relevant content such as functional buttons, thereby improving the quality and relevance of the pre-training data for downstream tasks.

PROX demonstrates significant improvements over both raw data and data curated by other selection methods across diverse pre-training corpora and model sizes. It outperforms state-of-the-art data selection methods by over 2% on 10 downstream benchmarks and achieves consistent improvements in domain-specific continual pre-training. Moreover, it significantly boosts pre-training efficiency, achieving similar performance with up to 20× less computing power.

PROX is a promising path towards more efficient and accessible development of LLMs. By automating data refinement, it can help researchers and developers overcome the limitations of traditional approaches, freeing up valuable resources for more creative and impactful research. The researchers are open-sourcing PROX with > 100B corpus, models, and implementation details, paving the way for broader adoption and further innovation.