The "Politeness" Problem: Why Aligned AI Fails to Predict Real Human Behavior

🔊

💬 Ask

In the burgeoning field of computational social science, researchers are increasingly using large language models (LLMs) as “silicon subjects” to simulate human behavior. These models, dubbed homo silicus, are used to predict how voters might react to a policy or how consumers might choose between products. However, a provocative new study from the Technion – Israel Institute of Technology reveals a fundamental flaw in this approach: the very “alignment” process that makes AI safe and helpful also makes it a poor mirror of humanity.

The paper, titled “Alignment Makes Language Models Normative, Not Descriptive,” argues that post-training techniques like Reinforcement Learning from Human Feedback (RLHF) induce a “normative bias.” In short, aligned models learn how humans should act—being cooperative, fair, and rational—rather than how humans actually act—being spiteful, erratic, or prone to retaliation.

The 10-to-1 Performance Gap

To test this, the researchers compared 120 pairs of models (a “base” unaligned version and its “aligned” counterpart) across more than 10,000 real-world human decisions in strategic games like bargaining, negotiation, and persuasion.

The results were lopsided. In complex, multi-round games where human interaction history matters, the raw “base” models outperformed their aligned versions in predicting human moves by a staggering ratio of nearly 10 to 1. Across different model families like Llama, Gemma, and Qwen, the unaligned versions were consistently better at capturing the messy reality of human strategy.

Normative vs. Descriptive: A Tale of Two Behaviors

To understand why, the researchers point to the distinction between normative theory (how people ought to act) and descriptive accounts (how they do act).

Consider a bargaining game where two people must divide a sum of money. A normative solution—the kind an aligned AI endorses—might be a fair 50/50 split. However, a descriptive account of human behavior reveals that if one player feels insulted by a low initial offer, they might “irrationally” reject a deal entirely just to punish their opponent.

The study found that aligned models are excellent at predicting the “textbook” move. They dominated base models when predicting one-shot games and simple lotteries where humans tend to follow rational, predictable patterns. But as soon as the games became social and multi-round—introducing dynamics like reciprocity, bluffing, and revenge—the aligned models’ “politeness” became a liability.

The Round One Reversal

One of the paper’s most striking findings involves the “round one” effect. In the very first round of a strategic game, before any history has developed, aligned models actually predict human choices quite well. At this stage, humans often act according to standard norms.

However, as the game progresses and a “paper trail” of interaction develops, the human behavior shifts. If a player is betrayed in round two, they might spend round three retaliating. Base models, trained on the raw, unwashed internet, “recognize” this darker side of human nature. Aligned models, having been “tilted” toward socially approved responses by human evaluators, effectively lose sight of these “tail” behaviors.

The Future of “Silicon Subjects”

This research presents a “fundamental trade-off” for the AI industry. If you want a chatbot to be a helpful assistant, alignment is essential. But if you want a model to serve as a proxy for human behavior in a simulation, alignment may be the very thing that breaks the experiment.

For social scientists and economists, the message is clear: when using AI to model the real world, the most “polite” model in the room is likely the least realistic. As we move toward a world of AI-driven social simulations, we must decide whether we want our models to reflect the humans we hope to be, or the humans we actually are.

AI Papers Reader

Personalized digests of latest AI research

The "Politeness" Problem: Why Aligned AI Fails to Predict Real Human Behavior

The 10-to-1 Performance Gap

Normative vs. Descriptive: A Tale of Two Behaviors

The Round One Reversal

The Future of “Silicon Subjects”

Chat about this paper