AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Personality Judgments Are Often Just "Prejudice" in Disguise, New Research Warns

Multimodal Large Language Models (MLLMs) are increasingly being deployed as virtual HR interviewers, mental health screeners, and digital companions. These systems claim to read our personalities using the gold-standard “Big Five” psychological traits—openness, conscientiousness, extraversion, agreeableness, and neuroticism. But are they genuinely understanding human behavior, or are they just making shallow, stereotypical assumptions?

A team of researchers, primarily from the University of Tokyo and Shanda AI Research, has revealed that AI’s apparent social intelligence is largely a facade. In a new paper, they expose a striking “Prejudice Gap”: across 27 state-of-the-art AI models, 51% of correct personality assessments were entirely ungrounded in actual visual or auditory evidence. In other words, the models got the right answer for the wrong reason.

To understand this, imagine an AI interviewing a job candidate. The candidate smiles, and the AI immediately rates them as highly “agreeable.” This is prejudice—a superficial correlation. Real perception requires grounding. For instance, in a video of a quiet speaker, a truly perceptive observer might notice the candidate’s gaze drift down and to the left at a specific timestamp while speaking calmly. This subtle micro-behavior is a textbook indicator of internally-directed cognitive processing, justifying a rating of low extraversion. While many advanced models can guess “low extraversion” correctly, most fail to point out the actual gaze shift that proves it.

To expose this flaw, the researchers built a rigorous new benchmark called MM-OCEAN. It features 1,104 short videos paired with 5,320 expert-validated multiple-choice questions. The benchmark tests three tiers of cognitive depth: rating a personality trait, explaining the reasoning behind that rating, and grounding that reasoning in physical evidence (such as pinpointing an exact timestamp or identifying a specific body gesture).

The results were sobering. When forced to prove why they made a personality judgment, the models floundered. The average “Holistic-Grounding Rate”—the percentage of times a model got the rating, reasoning, and visual grounding all correct—ranged from a dismal 0% to a maximum of just 33.5%.

Google’s Gemini 3 Flash topped the leaderboard, but even this cutting-edge model fell victim to the Prejudice Gap. The researchers also identified distinct model archetypes. “Confident Raters,” like Meta’s Llama-4-Maverick, score highly on initial personality ratings but perform terribly when asked to locate the visual cues supporting their decisions. Conversely, “Cautious Reasoners” struggle to commit to a rating but excel at tracking specific physical movements.

This research arrives at a crucial moment. Under the European Union’s new AI Act, personality-based hiring and educational tools are classified as “high-risk,” legally requiring an explainable evidence trail for every automated decision. The findings of MM-OCEAN suggest that current commercial AIs are far from ready for these high-stakes roles. Until AI can ground its social instincts in genuine, observable perception, we are merely trusting automated prejudice.