AI Papers Reader

Personalized digests of latest AI research

View on GitHub

COIG-P: A New Dataset Aims to Align Chinese LLMs with Human Values

Researchers at Multimodal Art Projection (M-A-P) have released COIG-P, a large-scale, high-quality Chinese preference dataset designed to improve the alignment of Chinese Large Language Models (LLMs) with human values. The team argues that existing Chinese datasets are limited by size, scope, and data validation. To address this, they developed an LLM-based annotation pipeline to generate the dataset without human intervention.

The Problem: Existing methods for aligning LLMs to human preferences often rely on manually labeled datasets. This approach is slow, expensive, and difficult to scale. Chinese LLMs, in particular, lack robust data resources compared to their English counterparts. Many of the available datasets are also derived from single sources, leading to potential biases.

The Solution: COIG-P (Chinese Open Instruction Generalist - Preference) contains over one million Chinese preference pairs. The process starts with collecting and filtering 92,000 high-quality Chinese queries spanning six diverse domains:

  • Chat: Casual conversations. Example Query: “How’s the weather today?”
  • Code: Programming-related questions. Example Query: “Write a Python function to calculate the Fibonacci sequence.”
  • Math: Mathematical problems. Example Query: “Solve the following equation: x^2 + 2x + 1 = 0.”
  • Logic: Logical reasoning puzzles. Example Query: “If all A are B, and some B are C, are all A necessarily C?”
  • Novel: Novel continuation prompts. Example Query: “The old house stood on a hill overlooking the town. A cold wind whistled through its broken windows…”
  • Role: Role-playing scenarios. Example Query: “You are a detective investigating a murder. How do you interrogate the prime suspect?”

For each query, the researchers use 15 different LLMs to generate multiple responses. A subset of 8 LLMs then scores these responses. By comparing scores and filtering responses to meet a minimum difference, the researchers create “chosen” and “rejected” response pairs. The result is a dataset where each entry contains a query and a pair of responses, one preferred over the other.

To further improve the efficiency of LLM scoring, the team trained a Chinese Reward Model (CRM) and created a Chinese Reward Benchmark (CRBench). This CRM significantly reduces the overhead associated with using LLMs for scoring while maintaining annotation quality comparable to that of GPT-4o.

The Impact: Experiments show that fine-tuning existing LLMs with COIG-P leads to significant performance improvements. For example, fine-tuning Qwen2/2.5 and Infinity-Instruct-3M-0625 series models shows performance gains ranging from 2% to 12%. The CRM achieves comparable results to closed-source models like GPT-4o in scoring and pairing. This could enable more accessible and cost-effective development of Chinese LLMs aligned with human values.

Looking Ahead: The release of COIG-P and CRBench is poised to stimulate advancements in Chinese LLMs. The data and code are available at https://github.com/multimodal-art-projection/COIG-P, inviting collaboration and further development from the broader research community.