Medical AI Falls Short in Doctor-Patient Roleplay Simulation
Large language models (LLMs) routinely ace standardized medical board exams, but a new study reveals a stark disconnect between passing a written test and successfully treating a patient in a dynamic clinical setting.
Researchers from Shanghai Jiao Tong University and the Shanghai Artificial Intelligence Laboratory have introduced MedSP1000, a highly interactive benchmark designed to evaluate AI as active clinical agents. Rather than asking static, multiple-choice questions, MedSP1000 repurposes 1,638 peer-reviewed “standardized patient” (SP) teaching cases—originally developed to train human doctors—into closed-loop, text-based simulations.
In these digital clinics, an assessed AI model plays the role of the doctor, interacting step-by-step with a patient agent and an environmental controller that simulates nurses, lab work, and physical examinations. The AI’s decisions are then scored against 24,602 expert-defined rubrics.
The results reveal that even the most advanced AI models struggle when clinical cases evolve over time. The highest-performing model, GPT-5.5, completed only 60.4% of the expert-defined clinical requirements. Surprisingly, specialized medical models fared even worse: the top medical model, Baichuan-M3, scored just 40.0%, trailing GPT-5.5 by over 20 percentage points. The researchers suggest that specialized medical AIs may be heavily overfitted to static question-answering, leaving them ill-equipped for sequential, multi-turn clinical reasoning.
The study highlighted several subtle but critical failure modes. For instance, in an acute ischemic stroke simulation, GPT-5.5 successfully initiated the emergency protocol, ordered brain imaging, and correctly decided to administer clot-busting therapy. Yet, it failed on fine-grained safety checks: it ordered a 20 mg dose of the blood pressure drug labetalol instead of the guideline-mandated 10 mg, and forgot to document the patient’s explicit consent.
In another case involving prenatal nutritional counseling, GPT-5.5 took a flawless, highly detailed dietary history of a pregnant patient eating mercury-exposed fish. However, when it came to providing guidance, the AI failed to state the recommended weekly serving limit and left the patient’s direct questions about local fish safety completely unanswered. It collected the necessary data but failed to turn it into actionable advice.
Furthermore, the researchers found that simply giving the AI more “thinking time” or utilizing collaborative multi-agent strategies did not solve the problem and sometimes introduced new errors. In a pediatric intensive care simulation, a team of five virtual AI specialists debated a two-year-old’s treatment. At a critical juncture, three specialists mistakenly voted to end the encounter early under the assumption that the patient was stable. This premature termination left vital resuscitation steps—such as administering a fluid bolus and taking a bedside glucose reading—entirely uncompleted.
Ultimately, MedSP1000 demonstrates that safe medical practice requires more than just clinical knowledge; it demands meticulous execution and active communication. The researchers conclude that current LLMs are not yet reliable enough to operate autonomously and must remain under strict human supervision as assistive tools.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.