A LARGE UK study finds that large language models (LLMs) as AI medical assistants do not reliably help the public identify health conditions or decide when to seek care, highlighting critical gaps between technical performance and real-world safety.
Why AI Medical Assistants Attract Attention
Interest in AI medical assistants has surged as healthcare systems face workforce shortages and rising demand for accessible advice. Large language models now achieve near perfect scores on medical licensing style exams, fuelling expectations that they could support patients outside clinical settings. However, translating expert level knowledge into safe, understandable guidance for non-specialists remains uncertain. To address this, researchers examined whether LLMs as medical assistants genuinely improve how members of the public interpret symptoms and choose appropriate actions, compared with using their usual information sources.
Testing LLMs as Medical Assistants with Real Users
In a randomised preregistered study, 1,298 UK adults were asked to respond to one of ten medical scenarios developed by doctors. Participants identified possible underlying conditions and selected a recommended course of action on a five-point scale, ranging from staying home to calling an ambulance. They were randomly assigned to receive help from GPT-4o, Llama 3, Command R+, or to use any source they preferred.
When tested alone, the models performed strongly, correctly identifying conditions in 94.9% of cases and disposition in 56.3% on average. However, when participants used the same tools, performance dropped sharply. Users identified relevant conditions in fewer than 34.5% of cases and chose the correct disposition in fewer than 44.2%, no better than the control group. Despite interacting freely with the models, participants often provided incomplete information or misunderstood responses. The study found no meaningful improvement in decision making, even though the underlying models were capable of producing correct answers.
Implications for Clinical Practice and Deployment
The findings raise important concerns for deploying AI medical assistants directly to the public. High benchmark scores and simulated patient tests did not predict how poorly humans would perform when interacting with these systems. For clinical practice, this suggests that unsupervised use could fail to improve safety and may create false reassurance. The authors argue that future development must prioritise human centred design, clearer communication, and rigorous user testing with diverse populations. Before LLMs as medical assistants are used at scale, healthcare systems will need evidence that they improve understanding and decision making, not just technical accuracy.
Reference
Bean A et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med. 2026;DOI:10.1038/s41591-025-04074-y.






