Bioinformatics. 2025 Jan 29:btaf034. doi: 10.1093/bioinformatics/btaf034. Online ahead of print.
ABSTRACT
MOTIVATION: The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable machine learning model.
RESULTS: Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.
AVAILABILITY: The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PMID:39878910 | DOI:10.1093/bioinformatics/btaf034