MEDICAL question answering accuracy improved while inference time and token use fell substantially with a new intelligent routing framework designed to direct queries to the most appropriate large language model mode.
Why Medical Question Answering Needs Adaptive Routing
As large language models become more widely used in clinical and healthcare settings, selecting the most appropriate reasoning mode remains a significant challenge. Medical question answering tasks vary considerably in complexity, ranging from routine medication adjustments to rare disease diagnosis. These differences place varying demands on model reasoning capabilities.
Researchers developed SynapseRoute, an intelligent routing framework that dynamically assigns questions to either a high reasoning “thinking” mode or a faster and lower cost “non thinking” mode. While the thinking mode can support more complex inference, it is associated with greater computational cost, longer response times and a higher risk of hallucination. In contrast, the non-thinking mode offers faster and less expensive responses but may struggle with more demanding reasoning tasks.
Evaluating Medical Question Answering Performance
The study assessed questions from four medical datasets: USMLE, MedMCQA, PubMedQA and CareQA. Questions were automatically labelled as requiring either thinking or non-thinking modes using the model Qwen3-30B-a3b.
Multiple routing approaches were tested, including traditional machine learning methods and fine-tuned large language models. Performance was measured using standard evaluation metrics and a newly developed Accuracy–Inference–Token index. This metric integrates accuracy, latency and token cost while allowing adjustable weighting according to deployment priorities.
The data showed that approximately 58% of medical questions could be answered correctly using the non-thinking mode alone.
Accuracy Improvements with Lower Resource Use
Among the routing methods evaluated, logistic regression delivered the strongest performance, achieving an area under the curve of 0.82.
Compared with relying exclusively on the thinking mode, SynapseRoute improved overall accuracy: 0.8390; versus 0.8272. At the same time, the framework reduced inference time by 36.8% and lowered token consumption by 39.66%.
The findings suggest that dynamic routing can improve the practicality and cost efficiency of dual mode large language model systems for medical question answering. By matching question complexity to the appropriate reasoning mode, SynapseRoute may enable more efficient deployment of artificial intelligence systems while maintaining or improving performance. The researchers also propose that the Accuracy–Inference–Token index could provide a practical method for evaluating performance and resource trade-offs in future adaptive medical question answering systems.
Reference
Zhang W et al. SynapseRoute: an auto-route switching framework on dual-state large language model for medical question answering. BMJ Innovations. 2026; doi: 10.1136/bmjinnov-2025-001529.
Featured image: MohamadFaizal on Adobe Stock






