Feasibility of AI-powered assessment scoring: Can large language models replace human raters?

Clin Neuropsychol. 2025 Sep 1:1-14. doi: 10.1080/13854046.2025.2552289. Online ahead of print.

ABSTRACT

Objective: To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). Method: Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples t-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. Results: Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). Conclusions: ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.

PMID:40889122 | DOI:10.1080/13854046.2025.2552289

By Nevin Manimala