Categories
Nevin Manimala Statistics

Paired-Sample and Pathway-Anchored MLOps Framework for Robust Transcriptomic Machine Learning in Small Cohorts: Model Classification Study

JMIR Bioinform Biotechnol. 2025 Oct 8;6:e80735. doi: 10.2196/80735.

ABSTRACT

BACKGROUND: Approximately 90% of the 65,000 human diseases are infrequent, collectively affecting ~400 million people, substantially limiting cohort accrual. This low prevalence constrains the development of robust transcriptome-based machine learning (ML) classifiers. Standard data-driven classifiers typically require cohorts of more than 100 participants per group to achieve clinical accuracy while managing high-dimensional input (~25,000 transcripts). These requirements are infeasible for microcohorts of ~20 individuals, where overfitting becomes pervasive.

OBJECTIVE: To overcome these constraints, we developed a classification method that integrates three enabling strategies: (i) paired-sample transcriptome dynamics, (ii) N-of-1 pathway-based analytics, and (iii) reproducible machine learning operations (MLOps) for continuous model refinement.

METHODS: Unlike ML approaches relying on a single transcriptome per subject, within-subject paired-sample designs-such as pre- versus post-treatment or diseased versus adjacent-normal tissue-effectively control intraindividual variability under isogenic conditions and within-subject environmental exposures (eg, smoking history, other medications, etc), improve signal-to-noise ratios, and, when pre-processed as single- studies (N-of-1), can achieve statistical power comparable with that obtained in animal models. Pathway-level N-of-1 analytics further reduces each sample’s high-dimensional profile into ~4000 biologically interpretable features, annotated with effect sizes, dispersion, and significance. Complementary MLOp practices-automated versioning, continuous monitoring, and adaptive hyperparameter tuning-improve model reproducibility and generalization.

RESULTS: In two case studies of distinct diseases, human rhinovirus infection (HRV) versus matched healthy controls (n=16 training; n=3 test) and breast cancer tissues harboring TP53 or PIK3CA mutations versus adjacent normal tissue (n=27 training; n=9 test)-this approach achieved 90% precision and recall on an unseen breast cancer test set and 92% precision with 90% recall in rhinovirus fivefold cross-validation. Incorporating paired-sample dynamics boosted precision by up to 12% and recall by 13% in breast cancer and by 5% each in HRV. MLOps workflows yielded an additional ~14.5% accuracy improvement compared to traditional pipelines. Moreover, our method identified 42 critical gene sets (pathways) for rhinovirus response and 21 for breast cancer mutation status, selected as the most important features (mean decrease impurity) of the best-performing model, with retroactive ablation of top 20 features reducing accuracy by ~25%.

CONCLUSIONS: These proof-of-concept results support the utility of integrating intrasubject dynamics, “biological knowledge”-based feature reduction (pathway-level feature reduction grounded in prior biological knowledge; eg, N-of-1-pathway analytics), and reproducible MLOp workflows can overcome cohort size limitations in infrequent disease, offering a scalable, interpretable solution for high-dimensional transcriptomic classification. Future work will extend these advances across various therapeutic and small cohort designs.

PMID:41342203 | DOI:10.2196/80735

By Nevin Manimala

Portfolio Website for Nevin Manimala