GPT-4 learning models in dermatology demonstrated substandard information yet rarely offered harmful advice in evaluation, overall.
GPT-4 Learning Models in Dermatology Compared with UpToDate
Researchers assessed whether GPT-4 outputs could function as reliable clinician facing references for common dermatologic conditions. Using a standardized prompt, the team asked two GPT-4 models, ChatGPT 4 and Copilot, to generate summaries and treatment recommendations for 33 dermatologic diagnoses. They then compared those outputs with matched UpToDate excerpts, using the DISCERN instrument to score medical information quality, and a board-certified dermatologist to judge how closely each model’s treatment recommendations aligned with the benchmark.
Medical Quality Scores Were Lower for GPT-4 Outputs
Artificial intelligence tools can produce confident clinical text, but the study found clear gaps in information quality. On the DISCERN scale, UpToDate content was rated “fair” on average, while both GPT-4 models were rated “poor.” The investigators reported mean DISCERN scores of 3.08 for UpToDate, versus 2.28 for ChatGPT 4 and 2.31 for Copilot, based on independent scoring by two authors. Readability and word count were also analyzed to better understand how these tools present clinical information to end users.
Treatment Concordance Varied by Model
When treatments were evaluated against UpToDate, ChatGPT 4 showed higher average concordance than Copilot. ChatGPT 4 recommendations aligned with the benchmark by an average of 64.89%, compared with 31.38% for Copilot, a difference the authors reported as statistically significant. Despite lower overall quality ratings, the authors noted that the GPT-4 models generated relatively few recommendations considered harmful.
What This Means for Clinical Utility
The findings suggest that parameter choices and query structure may meaningfully influence output quality, and that performance can vary across large language models even within the same model generation. The authors conclude that GPT-4 learning models may be most appropriate as time saving adjuncts when used alongside certified dermatologist judgement, rather than as standalone references, particularly in settings where access to dermatologic expertise is limited.
Reference: Naik A et al. Implementing GPT-4 Learning Models in Dermatology: An Assessment of Medical Quality and Utility. Skin Research and Technology. 2026;32(2):e70331.







