Bioinformatics. 2026 Jan 14:btag014. doi: 10.1093/bioinformatics/btag014. Online ahead of print.
ABSTRACT
MOTIVATION: Sequence motif identification is crucial for understanding molecular recognition, particularly in immune responses involving peptide binding to MHC class I molecules for antigen presentation to T cells. Traditionally, MHC class I binding motifs are assumed to be contiguous and span nine amino acids. However, structural evidence suggests that binding may involve non-adjacent residues, challenging the assumptions of existing methods.
RESULTS: In this study, we propose GAMMA (Gap-Aware Motif Mining Algorithm), a probabilistic framework designed to identify non-contiguous motifs under conditions of incomplete labeling. GAMMA employs Bayesian inference with MCMC sampling to jointly estimate motif parameters, binding locations, and the relative spacing between binding positions. Through extensive simulations and real-world applications to MHC class I peptide datasets, GAMMA outperforms existing motif discovery tools such as GLAM2 in accurately localizing binding residues and identifying the underlying motifs. Notably, our results suggest that the true number of binding residues may be eight, fewer than the commonly assumed nine. In addition, for longer peptides, the model captures increased flexibility in the central region, consistent with structural observations that peptides may bulge in the middle.
AVAILABILITY: The raw data and the source codes are available on GitHub (https://github.com/RanLIUaca/GAMMAmotif).
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PMID:41537246 | DOI:10.1093/bioinformatics/btag014