J Med Internet Res. 2026 Jun 16;28:e92852. doi: 10.2196/92852.
ABSTRACT
BACKGROUND: Public health medical education is increasingly important in the low-resource, high-altitude Xizang Autonomous Region (Tibet). Traditional authoritative textbooks do not meet modern needs for accessibility and interactivity, whereas general large language models (LLMs) may hallucinate in specialized medical domains. Developing specialized LLMs for low-resource regions is also expensive and difficult.
OBJECTIVE: This study aimed to explore a novel approach to high-altitude public health medical education in the low-resource Xizang Autonomous Region that integrates modern LLMs and authoritative textbooks, using a comprehensive benchmark evaluation across multiple dimensions and retrieval-augmented generation (RAG) technology.
METHODS: We conducted a 2-stage cross-sectional comparative evaluation study to benchmark publicly available LLMs and evaluate the added value of textbook-augmented retrieval under standardized generation settings and blinded expert assessment. First, 4 publicly available LLMs (GPT-5.2 [OpenAI], Gemini 3.0 Pro [Google], DeepSeek R1 [DeepSeek], and Tencent HY 2.0 [Tencent]) were benchmarked using an 80-question benchmark on high-altitude public health medicine developed by authoritative medical specialists. Each question was asked 3 times, yielding 960 outputs; first responses (n=320) were scored under blinded conditions by 2 independent 8-member physician panels. A clinically weighted evaluation of multidimensional first-response scores (including comprehensiveness, accuracy, clarity, and relevance) and a composite consistency metric (including semantic similarity and algorithmic similarity) was administered. Second, 4 specific and prevalent authoritative textbooks on high-altitude public health medicine-Ward, Milledge and West’s High Altitude Medicine and Physiology, High Altitude Medicine: A Case-Based Approach, High Altitude Medicine, and High Altitude Medical Protection-were deployed as the external knowledge base for the evaluation-optimized model. Statistical analyses included Spearman ρ, Cronbach α, intraclass correlation coefficients, Friedman tests with Dunn multiple comparisons, and paired Wilcoxon signed-rank tests. The significance threshold was set at α=.05.
RESULTS: DeepSeek R1 was selected as the optimal base model for achieving the highest weighted score (5.61/10.00), followed by GPT-5.2 (5.51/10.00), Gemini 3.0 Pro (5.39/10.00), and Tencent HY 2.0 (4.71/10.00). The deployed retrieval-augmented model integrating the authoritative textbooks and the optimal LLM DeepSeek R1, HPHME-Xplus-RAG, achieved remarkable improvement in multidimensional scores compared to baseline DeepSeek R1 (median 8.00 [IQR 7.88-8.00] vs median 7.63 [IQR 7.38-7.88]; P<.001, r_rb=0.68, indicating a large effect).
CONCLUSIONS: Integrating authoritative textbooks with an evaluation-optimized general LLM through an RAG framework showed strong performance for medical education in the low-resource Xizang Autonomous Region. Unlike prior studies that mainly evaluated general LLMs or used clinical guidelines to build RAG systems for diagnosis and treatment, this study used authoritative textbooks for the broader, guideline-scarce field of public health medical education. This work provides a replicable workflow-domain-authoritative knowledge+RAG+model optimization and evaluation-for low-resource settings, with practical implications for medical instructors and students, hospitals, and public health services seeking cost-effective, convenient, and trustworthy educational support.
PMID:42302307 | DOI:10.2196/92852