Arctic-SnowCoder: A New Code Model That Makes High-Quality Data Crucial
📄 Full Paper
💬 Ask
Researchers at Snowflake AI Research have developed a new code model called Arctic-SnowCoder, which outperforms existing models of similar size, even though it’s trained on significantly less data. The key to its success? A novel three-phase pretraining strategy that prioritizes data quality over quantity.
Arctic-SnowCoder is trained on a total of 555 billion tokens, which is less than half the size of comparable models like StarCoder2-3B. However, it achieves state-of-the-art performance on the BigCodeBench benchmark, a collection of challenging programming tasks designed to mimic real-world scenarios. The authors credit this impressive result to their meticulous approach to data quality.
Phase 1: General Pretraining
Arctic-SnowCoder’s journey begins with general pretraining on 500 billion tokens of raw code data. This data comes from sources like The Stack v1 and GitHub crawls, but it’s preprocessed through basic filtering and decontamination, ensuring that it’s relatively clean and free of unnecessary or harmful content.
Phase 2: Continued Pretraining with High-Quality Data
In the second phase, the model is further trained on 50 billion tokens of high-quality code data. This data is carefully selected using a BERT-based quality annotator, which was trained on positive examples from high-quality code files and instruction data from Magicoder and StarCoder2-Instruct. The annotator helps to distinguish good code from random data, ensuring that the model learns from the best possible examples.
Phase 3: Enhanced Pretraining with Synthetic Data
The final phase involves enhanced pretraining on 5 billion tokens of synthetic data generated by Llama-3.1-70B, a powerful language model. This synthetic data is created using the high-quality data from phase two as seeds, leveraging an approach inspired by Magicoder to generate high-quality and problem-solving oriented code files.
The Power of Data Quality
Arctic-SnowCoder’s success highlights the importance of data quality for code models. The authors found that simply using more data isn’t enough to achieve optimal performance. By carefully curating their training data and prioritizing quality over quantity, they were able to train a smaller model that outperforms larger models on a challenging benchmark.
Beyond the Benchmark
The paper’s findings have implications for future research on code models. The authors advocate for a more nuanced approach to data quality, arguing that simply using more data is not enough. Instead, researchers should focus on selecting high-quality data that is aligned with the distribution of downstream applications. This means ensuring that the data the model is trained on closely reflects the types of tasks it will be asked to perform in real-world scenarios.
By demonstrating the power of data quality, Arctic-SnowCoder sheds light on a critical aspect of code model development. As the field continues to evolve, it’s likely that the focus on data quality will only grow stronger, leading to more robust and capable code models in the future.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.