J Chem Inf Model. 2025 Dec 1. doi: 10.1021/acs.jcim.5c01941. Online ahead of print.
ABSTRACT
Large language models (LLMs) have the potential to serve as collaborative assistants in scientific research. However, adapting them to specialized domains is difficult because it requires the integration of domain-specific knowledge. We propose Materials Dual-Source Knowledge Retrieval-Augmented Generation (MDSK-RAG), a retrieval-augmented generation (RAG) framework that enables domain specialization of LLMs for materials development under fully offline (no-Internet) operation to ensure data confidentiality. The framework unifies two complementary knowledge sources, experimental CSV data (practical knowledge) and scientific PDF literature (theoretical insights), by converting tabular records into template-based text, retrieving relevant passages from each source, summarizing them with a local LLM, and merging the summaries with the user query prior to generation. As a case study, we applied the framework to metal-sulfide photocatalysts using 740 in-house experimental records and 20 scientific PDFs. We evaluated the framework on a benchmark consisting of 14 expert-defined questions and used two-sided Wilcoxon signed-rank tests for paired comparisons. Models with fewer than 10 billion parameters were executed on a laptop, whereas larger models were run on a dedicated local server; the cloud-based LLM (GPT-4o) was evaluated via the cloud service. For practical deployment, gemma-2-9b-it (<10 billion parameters) was chosen as the primary local model; we additionally tested Qwen2.5-7B-Instruct and a larger gemma-2-27b-it to assess model choice and scalability. For gemma-2-9b-it, the framework increased the median cosine similarity to expert reference answers from 0.63 to 0.71, an absolute increase of 0.08 (corresponding to a relative percentage gain of 12.70%; Wilcoxon signed-rank test statistic: W = 14.0, two-sided p-value: p = 1.34 × 10-2) and improved the median expert 5-point rating from 2 to 3, an absolute increase of 1 point (corresponding to a relative percentage gain of 50.00%; Wilcoxon signed-rank test statistic: W = 3.5, two-sided p-value: p = 7.00 × 10-3). For reasoning-type questions, incomplete context retrieved by MDSK-RAG sometimes disrupted the model’s reasoning process and led to incorrect conclusions, indicating remaining room for improvement. Comparable, statistically significant improvements were observed for the other local models (Qwen2.5-7B-Instruct and a larger gemma-2-27b-it) between conditions with and without the framework in the evaluation by cosine similarity to expert reference answers. In comparison to a cloud-based LLM, the gemma-2-9b-it with the framework outperformed GPT-4o. In this case study, the framework effectively incorporated practical experimental knowledge and theoretical literature into local LLM responses, improving accuracy for domain-specific queries. The framework presented here offers a practical and extensible adaptation of local LLMs to domain-specific scientific research.
PMID:41325550 | DOI:10.1021/acs.jcim.5c01941