Categories
Nevin Manimala Statistics

Susceptibility of Assessment Types to AI-Generated Content in Digital Health and Health Information Management Education: Quasi-Experimental Pilot Study

JMIR Med Educ. 2026 Mar 30;12:e82988. doi: 10.2196/82988.

ABSTRACT

BACKGROUND: Generative artificial intelligence (AI) tools, such as ChatGPT, are increasingly used in higher education and have raised significant concerns about assessment validity and academic integrity. In Digital Health and Health Information Management (DIGHIM) programs, assessments are designed to evaluate a mix of technical skills, contextual reasoning, and professional judgment that underpin medical and health practice. Understanding how generative AI performs across different assessment types is, therefore, critical to identifying which formats are most susceptible to AI-generated content and how assessments may be redesigned to remain authentic and educationally meaningful.

OBJECTIVE: This study aimed to evaluate ChatGPT’s performance across diverse assessment types in DIGHIM education by examining how task complexity influences AI-generated output quality, and develop recommendations for ethical and effective AI integration in assessments.

METHODS: A pilot quasi-experimental design compared ChatGPT-generated responses with deidentified student submissions across 5 assessment types: digital health solution design, business case analysis, reflective assessment, SQL health database programming, and a health classification quiz. For each task, multiple AI submissions were produced using different prompting strategies, including rubric integration and the use of ChatGPT (GPT-4 and o1 Preview model). Blinded academic markers evaluated all AI-generated submissions and previously submitted deidentified student assessments against standard rubrics, and descriptive statistics were used to compare performance.

RESULTS: ChatGPT’s performance varied considerably across assessment types. It achieved its highest accuracy scores in objective, rule-based tasks such as multiple-choice quiz items in health classification (mean 88%, SD 0%) and produced well-structured, coherent responses for reflective assessments (mean 69%, SD 12.8%), though these often lacked personalization and nuanced industry context. In descriptive analytical tasks, such as digital health business cases and solution designs, ChatGPT produced logically structured work with reasonable use of evidence but failed to provide deep contextualization, domain-specific insights, or visual elements expected in DIGHIM practice. Technical assessments revealed the greatest limitations: SQL programming tasks averaged 42% (SD 17.2%) with persistent schema errors, incomplete queries, and weak interpretation of health data outputs, while scenario-based clinical coding scored just 7% (SD 0%), reflecting a lack of precision in applying ICD-10-AM (International Classification of Diseases, Tenth Revision, Australian Modification) rules and coding conventions. Structured prompting and rubric integration improved results, particularly in descriptive and reflective tasks (up to 80%), but the advanced o1 Preview model did not consistently outperform earlier versions.

CONCLUSIONS: While ChatGPT performs well in structured, rule-based, and reflective tasks, it remains limited in technical accuracy, contextual reasoning, and applied DIGHIM competencies. To support academic integrity and workforce readiness, assessment design should prioritize critical thinking, ethical reasoning, and scenario-based problem-solving aligned with health care practice. Using AI as a tool for critique and refinement, rather than a substitute for student work, may help educators prepare learners for responsible AI use in medical and health professional education.

PMID:41911020 | DOI:10.2196/82988

By Nevin Manimala

Portfolio Website for Nevin Manimala