A machine-learning approach for nonalcoholic steatohepatitis susceptibility estimation

Indian J Gastroenterol. 2022 Nov 11. doi: 10.1007/s12664-022-01263-2. Online ahead of print.

ABSTRACT

BACKGROUND: Nonalcoholic steatohepatitis (NASH), a severe form of nonalcoholic fatty liver disease, can lead to advanced liver damage and has become an increasingly prominent health problem worldwide. Predictive models for early identification of high-risk individuals could help identify preventive and interventional measures. Traditional epidemiological models with limited predictive power are based on statistical analysis. In the current study, a novel machine-learning approach was developed for individual NASH susceptibility prediction using candidate single nucleotide polymorphisms (SNPs).

METHODS: A total of 245 NASH patients and 120 healthy individuals were included in the study. Single nucleotide polymorphism genotypes of candidate genes including two SNPs in the cytochrome P450 family 2 subfamily E member 1 (CYP2E1) gene (rs6413432, rs3813867), two SNPs in the glucokinase regulator (GCKR) gene (rs780094, rs1260326), rs738409 SNP in patatin-like phospholipase domain-containing 3 (PNPLA3), and gender parameters were used to develop models for identifying at-risk individuals. To predict the individual’s susceptibility to NASH, nine different machine-learning models were constructed. These models involved two different feature selections including Chi-square, and support vector machine recursive feature elimination (SVM-RFE) and three classification algorithms including k-nearest neighbor (KNN), multi-layer perceptron (MLP), and random forest (RF). All nine machine-learning models were trained using 80% of both the NASH patients and the healthy controls data. The nine machine-learning models were then tested on 20% of both groups. The model’s performance was compared for model accuracy, precision, sensitivity, and F measure.

RESULTS: Among all nine machine-learning models, the KNN classifier with all features as input showed the highest performance with 86% F measure and 79% accuracy.

CONCLUSIONS: Machine learning based on genomic variety may be applicable for estimating an individual’s susceptibility for developing NASH among high-risk groups with a high degree of accuracy, precision, and sensitivity.

PMID:36367682 | DOI:10.1007/s12664-022-01263-2

By Nevin Manimala