Am J Clin Pathol. 2026 Jan 5;165(2):aqaf153. doi: 10.1093/ajcp/aqaf153.
ABSTRACT
OBJECTIVE: To evaluate the ability of 4 artificial intelligence large language models (LLMs) to create items that align with the item writing standards of the American Board of Pathology (ABPath) for continuing certification.
METHODS: An informatics item writing application was developed and used with prompts based on the ABPath item writing standards. Uniform prompts were used for the LLMs evaluated, with the content of the items generated tailored to the expertise of the reviewing subject matter experts (SMEs). The SMEs were blinded to the identity of the LLM that generated each item. The 14 SMEs graded 4 written items and 4 practical items, with 1 item from each set of 4 generated from each of the LLMs. The 19 questions used for grading concentrated on item anatomy (ie, item structure), accuracy, relevance, and level of item difficulty.
RESULTS: The overall scores for the 4 LLMs for the written items were as follows: Claude, 229 of 266 (86.1%); ChatGPT, 212 of 266 (79.7%); Llama, 175 of 266 (65.8%); and Titan, 162 of 266 (60.9%). The overall scores for the 4 LLMs for the practical items were as follows: Claude, 247 of 266 (92.9%); ChatGPT, 216 of 266 (81.2%); Llama, 175 of 266 (65.8%); and Titan, 151 of 266 (56.8%). Statistically significant differences existed between the LLMs.
CONCLUSIONS: We observed significant differences in the ability of the 4 LLMs evaluated to draft items consistent with the ABPath guidelines based on SME scoring. It is important to assess the various LLMs available to determine which model best meets the needs of the user for the proposed task and not to assume equivalence.
PMID:41722024 | DOI:10.1093/ajcp/aqaf153