GPT-4 Learning Models in Dermatology Rated Poor - AMJ

This site is intended for healthcare professionals

GPT-4 Learning Models in Dermatology Rated Poor

Clinician reviewing GPT-4 learning models in dermatology output on a laptop

GPT-4 learning models in dermatology demonstrated substandard information yet rarely offered harmful advice in evaluation, overall.

GPT-4 Learning Models in Dermatology Compared with UpToDate

Researchers assessed whether GPT-4 outputs could function as reliable clinician facing references for common dermatologic conditions. Using a standardized prompt, the team asked two GPT-4 models, ChatGPT 4 and Copilot, to generate summaries and treatment recommendations for 33 dermatologic diagnoses. They then compared those outputs with matched UpToDate excerpts, using the DISCERN instrument to score medical information quality, and a board-certified dermatologist to judge how closely each model’s treatment recommendations aligned with the benchmark.

Medical Quality Scores Were Lower for GPT-4 Outputs

Artificial intelligence tools can produce confident clinical text, but the study found clear gaps in information quality. On the DISCERN scale, UpToDate content was rated “fair” on average, while both GPT-4 models were rated “poor.” The investigators reported mean DISCERN scores of 3.08 for UpToDate, versus 2.28 for ChatGPT 4 and 2.31 for Copilot, based on independent scoring by two authors. Readability and word count were also analyzed to better understand how these tools present clinical information to end users.

Treatment Concordance Varied by Model

When treatments were evaluated against UpToDate, ChatGPT 4 showed higher average concordance than Copilot. ChatGPT 4 recommendations aligned with the benchmark by an average of 64.89%, compared with 31.38% for Copilot, a difference the authors reported as statistically significant. Despite lower overall quality ratings, the authors noted that the GPT-4 models generated relatively few recommendations considered harmful.

What This Means for Clinical Utility

The findings suggest that parameter choices and query structure may meaningfully influence output quality, and that performance can vary across large language models even within the same model generation. The authors conclude that GPT-4 learning models may be most appropriate as time saving adjuncts when used alongside certified dermatologist judgement, rather than as standalone references, particularly in settings where access to dermatologic expertise is limited.

Reference: Naik A et al. Implementing GPT-4 Learning Models in Dermatology: An Assessment of Medical Quality and Utility. Skin Research and Technology. 2026;32(2):e70331.

Author:

Each article is made available under the terms of the Creative Commons Attribution-Non Commercial 4.0 License.

Rate this content's potential impact on patient outcomes

Average rating / 5. Vote count:

No votes so far! Be the first to rate this content.