Leveraging Large Language Models to Generate Multiple-Choice Questions for Ophthalmology Education

JAMA Ophthalmol. 2025 Oct 16. doi: 10.1001/jamaophthalmol.2025.3622. Online ahead of print.

ABSTRACT

IMPORTANCE: Multiple choice questions (MCQs) are an important and integral component of ophthalmology residency training evaluation and board certification; however, high-quality questions are difficult and time-consuming to draft.

OBJECTIVE: To evaluate whether general-domain large language models (LLMs), particularly OpenAI’s Generative Pre-trained Transformer 4 (GPT-4), can reliably generate high-quality, novel, and readable MCQs comparable to those of a committee of experienced examination writers.

DESIGN, SETTING, AND PARTICIPANTS: This survey study, conducted from September 2024 to April 2025, assesses LLM performance in generating MCQs based on the American Academy of Ophthalmology (AAO) Basic and Clinical Science Course (BCSC) compared with a committee of human experts. Ten expert ophthalmologists, who were masked to the generation source, independently evaluated MCQs using a 10-point Likert scale (1 = extremely poor; 10 = criterion standard quality) across 5 criteria: appropriateness, clarity and specificity, relevance, discriminative power, and suitability for trainees.

INTERVENTION: Relevant BCSC content and AAO question-writing guidelines were input into GPT-4o via Microsoft’s Azure OpenAI Service, and structured prompts were used to generate MCQs.

MAIN OUTCOMES AND MEASURES: The primary outcomes were median scores and statistical comparisons using the bootstrapping method; string similarity scores based on Levenshtein distance (0-100, with 100 indicating identical content) between LLM-MCQs and the entire BCSC question bank; Flesch Reading Ease metric for readability; and intraclass correlation coefficient (ICC) for inter-rater agreement are reported.

RESULTS: The 10 graders had between 1 and 28 years of clinical experience in ophthalmology (median [IQR] experience, 6 years [3-15 years]). Questions generated by GPT-4 and a committee of experts received median scores of 9 and 9 in combined scores, appropriateness, clarity and specificity, and relevance (difference, 0; 95% CI, 0-0; P > .99); 8 and 9 in discriminative power (difference, 1; 95% CI, -1 to 1; P = .52); and 8 and 8 in suitability for trainees (difference, 0; 95% CI, -1 to 0; P > .99), respectively. Nearly 95% of LLM-MCQs had similarity scores less than 60, indicating most LLM-MCQs had limited or no resemblance to existing content. Interrater reliability was moderate (ICC, 0.63; P < .001), and mean (SD) readability scores were similar across sources (37.14 [22.54] vs 42.60 [22.84]; P > .99).

CONCLUSIONS AND RELEVANCE: In this survey study, results indicate that an LLM could be used to develop ophthalmology board-style MCQs and expand examination banks to further support ophthalmology residency training. Despite most questions having a low similarity score, the quality, novelty, and readability of the LLM-generated questions need to be further assessed.

PMID:41100119 | DOI:10.1001/jamaophthalmol.2025.3622

By Nevin Manimala