J Gastrointestin Liver Dis. 2026 Jun 27;35(2):181-188. doi: 10.15403/jgld-6740.
ABSTRACT
BACKGROUND AND AIMS: Inflammatory bowel disease (IBD) case reports provide rich longitudinal insights but have rarely been analyzed using quantitative text-mining approaches. This study applied unsupervised machine learning to PubMed-indexed IBD case reports to identify long-term thematic structures spanning 60 years and evaluate whether major historical milestones in IBD care can be reconstructed from biomedical texts.
METHODS: Case reports indexed under the keyword “inflammatory bowel disease” were retrieved from PubMed (1960-2025). Titles, key words, and abstracts were concatenated and preprocessed before TF-IDF vectorization. Non-negative matrix factorization (NMF) was applied to extract latent topics, followed by KMeans clustering using the optimal topic number selected by silhouette evaluation (2-15 topics). Cluster characteristics were summarized using report counts and term frequency-inverse document frequency (TF-IDF) statistics. Top discriminative key words were used to assign data-driven topic labels. All analyses were performed in Python 3.10.5 (PyCharm 2022.1.3) using pandas, numpy, scikit-learn, matplotlib, and seaborn.
RESULTS: A total of 18,458 case reports were analyzed. Across all time periods, two highly stable clusters consistently emerged, corresponding to Crohn’s disease and ulcerative colitis. Early decades (1960-1989) emphasized pathology and complication-focused descriptions. Reports from the 1990s showed increasing terminology related to diagnosis and emerging therapies. From 2000 onward, infliximab-related and treatment focused terms predominated, paralleling the rise of biology. After 2010, clusters reflected diversified therapeutic strategies, including attention to extraintestinal manifestations and biologic or small-molecule therapies.
CONCLUSIONS: Unsupervised machine learning successfully reconstructed important historical changes in IBD management, demonstrating that a large case report text corpus captures the evolution of clinical concepts and treatment paradigms over 60 years.
PMID:42365648 | DOI:10.15403/jgld-6740