Collaborative AI Exams Set Accuracy Record in Medicine - EMJ

This site is intended for healthcare professionals

Collaborative AI Passes US Medical Exams with High Accuracy

AI exams

A COUNCIL of AI models, working together in a structured dialogue, has set a new record in passing U.S. medical exams, achieving up to 97% accuracy on questions spanning all three steps of the USMLE. This multi-agent “AI exams ” approach saw five GPT-4 models iteratively deliberate, discuss, and self-correct their answers, outperforming any single-instance AI.

AI Exams: Redefining Model Collaboration

Past studies showed that large language models (LLMs) could pass medical licensing exams, but their responses to the same question varied and some contained errors or hallucinations. By building an “AI exams” council, researchers harnessed collective reasoning, with a facilitator algorithm prompting the models to deliberate, summarise responses, and refine answers. Consensus was reached in 97%, 93%, and 94% of cases for Step 1, Step 2 CK, and Step 3, respectively – significantly higher than previous AI models and single-agent performance.

Results and Strengths of AI Council

When initial responses didn’t agree, the council engaged in debate, reaching the right answer 83% of the time and correcting more than half of previous majority vote errors. “AI exams” performance improved odds of converting an incorrect answer to correct by a factor of 5 after deliberation. This process reduced semantic entropy, meaning answer variability decreased as consensus emerged. The findings reveal that what was previously seen as unpredictable model behaviour can be channelled as a strength – using dialogue to self-correct and adapt reasoning.

Implications: Next Steps for Collaborative AI Exams

While not yet tested in real clinical settings, collaborative AI exams could make medical AI safer and more reliable for healthcare. The study suggests future tools in clinical education and patient care should embrace varied AI perspectives, unlocking new possibilities by leveraging teamwork rather than demanding consistency from a single model.

Reference

Shaikh Y et al. Collaborative intelligence in AI: evaluating the performance of a council of AIs on the USMLE. PLOS Digital Health. 2025;DOI:10.1371/journal.pdig.0000787.

Author:

Each article is made available under the terms of the Creative Commons Attribution-Non Commercial 4.0 License.

Rate this content's potential impact on patient outcomes

Average rating / 5. Vote count:

No votes so far! Be the first to rate this content.