Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks

Methods Mol Biol. 2025;2867:43-53. doi: 10.1007/978-1-0716-4196-5_3.

ABSTRACT

Protein secondary structure prediction is useful for many applications. It can be considered a language translation problem, that is, translating a sequence of 20 different amino acids into a sequence of secondary structure symbols (e.g., alpha helix, beta strand, and coil). Here, we develop a novel protein secondary structure predictor called TransPross based on the transformer network and attention mechanism widely used in natural language processing to directly extract the evolutionary information from the protein language (i.e., raw multiple sequence alignment [MSA] of a protein) to predict the secondary structure. The method is different from traditional methods that first generate a MSA and then calculate expert-curated statistical profiles from the MSA as input. The attention mechanism used by TransPross can effectively capture long-range residue-residue interactions in protein sequences to predict secondary structures. Benchmarked on several datasets, TransPross outperforms the state-of-art methods. Moreover, our experiment shows that the prediction accuracy of TransPross positively correlates with the depth of MSAs, and it is able to achieve the average prediction accuracy (i.e., Q3 score) above 80% for hard targets with few homologous sequences in their MSAs. TransPross is freely available at https://github.com/BioinfoMachineLearning/TransPro .

PMID:39576574 | DOI:10.1007/978-1-0716-4196-5_3

By Nevin Manimala