A NEW study has shown that large language models (LLMs ) have strong potential to evaluate medical research standards, with GPT-4 variants performing best at checking (RCTs) on artificial intelligence interventions against CONSORT-AI guidelines.
Why LLMs Matter in Medical Reporting
Chatbots powered by LLMs have previously been used to evaluate whether RCT abstracts adhered to CONSORT-Abstract guidelines. However, their role in reviewing full AI-based intervention studies against the CONSORT-AI framework has remained underexplored. By automating assessments, LLMs could reduce the burden on reviewers and help improve transparency and consistency in how AI interventions are reported, though concerns remain about their precision in handling highly complex criteria.
Study Design and Key Findings
This cross-sectional study analysed 41 RCTs on AI interventions published in JAMA Network Open. Six different LLMs were used, with all queries submitted via an application programming interface set to a temperature of 0 for consistent outputs. Results showed that gpt-4-0125-preview achieved the highest Overall Consistency Score (OCS) with an author-reported rate of 86.5% (95% CI 82.5%–90.5%) and a researcher-verified rate of 81.6% (95% CI 77.6%–85.6%). Close behind was gpt-4-1106-preview, with respective scores of 80.3% and 78.0%. The weakest performance came from gpt-3.5-turbo-0125, scoring around 62% to 63%. Of the 11 CONSORT-AI items, Item 2, which requires stating inclusion and exclusion criteria at the input data level, consistently received the lowest performance with an average OCS of 48.8%. Items 1, 5, 8, and 9 exceeded 80% consistency across all models.
Implications for Future Use of LLMs
The findings suggest that LLMs, especially GPT-4 variants, can be powerful tools for assessing adherence to research reporting guidelines. However, they are not yet capable of fully autonomous evaluations and require human oversight to resolve nuanced or ambiguous cases. Integrating these tools with expert judgement could streamline research checks, improving the overall quality of AI trial reporting.
Reference
Luo X et al. Using large language models to assess the consistency of randomized controlled trials on ai interventions with CONSORT-AI: cross-sectional survey. J Med Internet Res 2025;27:e72412.