Clinical accuracy and applications of large language models in pediatric orthopedics: a systematic review

J Pediatr Orthop B. 2026 Jun 23. doi: 10.1097/BPB.0000000000001368. Online ahead of print.

ABSTRACT

To systematically evaluate the accuracy, reliability, and clinical applicability of artificial intelligence and large language models (LLMs) in pediatric orthopedics, comparing their performance against established clinical guidelines and assessing their utility for patient education and clinical decision support. A search of PubMed and ScienceDirect (2020-2025) identified 2624 articles using the keywords ‘ChatGPT’, ‘Gemini’, ‘Claude’ and ‘orthopedic pediatrics’. After screening and refinement using Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 guidelines, 15 studies met inclusion criteria. Studies evaluated ChatGPT, Google Gemini, Meta AI, Microsoft Copilot, and Claude across multiple pediatric orthopedic conditions across conditions like developmental dysplasia of the hip, slipped capital femoral epiphysis, and scoliosis. Heterogeneity was assessed using Cochran’s Q and I2 statistics, and publication bias was evaluated using funnel plots and Egger’s test. LLM accuracy ranged from 44.3 to 93% (pooled: 74.1%), with pooled accuracy of 74.1%. Reproducibility was moderate, with ChatGPT demonstrating a Spearman coefficient of 0.55 for complex queries. Regional expert consensus scores varied significantly (Europe: 80, North America: 65; P = 0.034; Fleisskappa = 0.113). Up to 33% of responses to guideline-based questions were rated neutral or inaccurate. Reading complexity was elevated (Flesch-Kincaid grade: 12.7), exceeding the recommended sixth-grade level. Parent surveys indicated 82% trust in artificial intelligence as supplementary tools with professional oversight. Minimal statistical heterogeneity was observed (I2 = 0.00%), though publication bias was detected (Egger’s test P = 0.0001). LLMs show potential for education and triage but lack consistency in complex scenarios, elevated reading complexity, and significant regional variability in expert assessments. These tools should be used as educational supplements under professional medical supervision rather than for independent clinical decision-making. Broader clinical application requires domain-specific tuning, standardized evaluation, and readability optimization.

LEVEL OF EVIDENCE: Level V- systematic review.

PMID:42322047 | DOI:10.1097/BPB.0000000000001368

By Nevin Manimala