LLMs Improve Evaluation of AI Trial Reporting

A NEW study has shown that large language models (LLMs ) have strong potential to evaluate medical research standards, with GPT-4 variants performing best at checking (RCTs) on artificial intelligence interventions against CONSORT-AI guidelines.

Why LLMs Matter in Medical Reporting

Chatbots powered by LLMs have previously been used to evaluate whether RCT abstracts adhered to CONSORT-Abstract guidelines. However, their role in reviewing full AI-based intervention studies against the CONSORT-AI framework has remained underexplored. By automating assessments, LLMs could reduce the burden on reviewers and help improve transparency and consistency in how AI interventions are reported, though concerns remain about their precision in handling highly complex criteria.

Study Design and Key Findings

This cross-sectional study analysed 41 RCTs on AI interventions published in JAMA Network Open. Six different LLMs were used, with all queries submitted via an application programming interface set to a temperature of 0 for consistent outputs. Results showed that gpt-4-0125-preview achieved the highest Overall Consistency Score (OCS) with an author-reported rate of 86.5% (95% CI 82.5%–90.5%) and a researcher-verified rate of 81.6% (95% CI 77.6%–85.6%). Close behind was gpt-4-1106-preview, with respective scores of 80.3% and 78.0%. The weakest performance came from gpt-3.5-turbo-0125, scoring around 62% to 63%. Of the 11 CONSORT-AI items, Item 2, which requires stating inclusion and exclusion criteria at the input data level, consistently received the lowest performance with an average OCS of 48.8%. Items 1, 5, 8, and 9 exceeded 80% consistency across all models.

Implications for Future Use of LLMs

The findings suggest that LLMs, especially GPT-4 variants, can be powerful tools for assessing adherence to research reporting guidelines. However, they are not yet capable of fully autonomous evaluations and require human oversight to resolve nuanced or ambiguous cases. Integrating these tools with expert judgement could streamline research checks, improving the overall quality of AI trial reporting.

Reference

Luo X et al. Using large language models to assess the consistency of randomized controlled trials on ai interventions with CONSORT-AI: cross-sectional survey. J Med Internet Res 2025;27:e72412.

LLMs Can Assess Accuracy of AI-based Randomized Controlled Trials

Why LLMs Matter in Medical Reporting

Study Design and Key Findings

Implications for Future Use of LLMs

GIANT 2025 Interview: Roy Lilley

AI Improves Early Dementia Identification with EEG

More articles

Reinventing Modern Medicine: Interview with Eric Topol

Public Health Leadership: Interview with Ashish K. Jha

Designing Life-Saving Neonatal Incubators: Interview with James Roberts

Featured journals

EMJ Innovations 9 [Supplement 2] 2025

EMJ Innovations 9.1 2025

Therapy Area

About Us

LLMs Can Assess Accuracy of AI-based Randomized Controlled Trials

Why LLMs Matter in Medical Reporting

Study Design and Key Findings

Implications for Future Use of LLMs

Related To This Subject

GIANT 2025 Interview: Roy Lilley

AI Improves Early Dementia Identification with EEG

More articles

Reinventing Modern Medicine: Interview with Eric Topol

Public Health Leadership: Interview with Ashish K. Jha

Designing Life-Saving Neonatal Incubators: Interview with James Roberts

Featured journals

EMJ Innovations 9 [Supplement 2] 2025

EMJ Innovations 9.1 2025