Development and application of DAISY framework for benchmarking AI generated vs human-written abstracts in dental research

Int J Med Inform. 2025 Nov 20;207:106190. doi: 10.1016/j.ijmedinf.2025.106190. Online ahead of print.

ABSTRACT

BACKGROUND: Despite the increasing use of AI tools like ChatGPT, Claude, and Gemini in scientific writing, concerns remain about their ability to generate accurate, high-quality, and consistent abstracts for research publications. The reliability of AI-generated abstracts in dental research is questionable when compared to human-written counterparts. This study aimed to develop a framework for evaluating AI-generated abstracts and compare the performance of ChatGPT, Claude, and Gemini against human-written abstracts in dental research.

METHODS: The DAISY framework was developed to evaluate AI-generated abstracts across five domains: Data accuracy (D), Abstract quality (A), Integrity and consistency (I), Syntax and fluency (S), and Yield of human likelihood (Y). Reliability of the framework was assessed using Cohens Kappa (κ = 0.85) and Pearsons’s correlation coefficient (0.92) for inter- and intra- expert reliability and was found to be satisfactory. This study adopted a comparative observational study design. Eight research articles belonging to structured (n = 4) and unstructured (n = 4) categories were selected from reputable journals. Researchers trained in scientific writing wrote abstracts for these articles, while AI-generated abstracts were obtained using specific prompts. Ten dental experts evaluated the abstracts using this framework. Statistical analysis was performed using ANOVA and Tukey’s post-hoc test.

RESULTS: Human-written abstracts consistently outperformed AI-generated ones across all DAISY framework domains. Among AI tools, ChatGPT scored highest in all DAISY framework domains, followed by Gemini and Claude. Human-written abstracts achieved the highest human likelihood score (90.25 ± 4.68), while AI-generated abstracts scored below 50%, with Gemini scoring least (3.25 ± 1.75). The differences between the groups were statistically significant (P ≤ 0.05).

CONCLUSION: The DAISY framework proved reliable for evaluating AI-generated abstracts. While ChatGPT performed better than other AI tools, none matched the quality of human-written abstracts. This indicates that AI tools, though valuable, remain limited in producing credible scientific writing in dental research.

PMID:41285065 | DOI:10.1016/j.ijmedinf.2025.106190

By Nevin Manimala