SARITA: a large language model for generating the S1 subunit of the SARS-CoV-2 spike protein

Brief Bioinform. 2025 Jul 2;26(4):bbaf384. doi: 10.1093/bib/bbaf384.

ABSTRACT

BACKGROUND: The COVID-19 pandemic has caused over 776 million infections and 7 million deaths globally between December 2019 and November 2024. Since the emergence of the original Wuhan strain, SARS-CoV-2 has evolved into multiple variants-including Alpha, Delta, and Omicron-primarily through mutations in the Spike glycoprotein. The S1 subunit, which binds the human angiotensin-converting enzyme 2 (ACE2) receptor, mutates frequently and plays a key role in infectivity and immune escape, while the more conserved S2 subunit mediates membrane fusion. Anticipating future mutations is essential for guiding vaccine design and therapeutic strategies. Generative Large Language Models (LLMs) have shown promise in protein sequence modeling due to their capacity to produce realistic and functional synthetic sequences. Here, we introduce SARITA, a GPT-3-based LLM with up to 1.2 billion parameters, fine-tuned via continual learning on the protein model RITA trained on 107 017 high-quality SARS-CoV-2 Spike sequences (up to March 1st 2021) to generate high-quality synthetic SARS-CoV-2 Spike S1 subunits.

RESULTS: SARITA is able to generate realistic, full-length synthetic S1 subunits starting from a 14-amino-acid prompt. When evaluated on unseen sequences collected between March 2021 and November 2023-including major Variants of Concern (VOCs) such as Delta and Omicron, and Variants of Interest such as Iota-SARITA outperforms baseline and state-of-the-art LLMs in terms of sequence quality, biological plausibility, and similarity to real-world viral evolution. SARITA generates high-quality sequences in over 97% of cases, with markedly lower False Mutation Rate and higher similarity scores (PAM30, Levenshtein distance) compared to alternative approaches. It also accurately reproduces key mutations characteristic of future variants-such as L212I, R158L, T95P, and E406K-which were not present in the training data but emerged later in VOCs like Omicron and Delta. Structure-based analysis confirms the functional plausibility of these substitutions, with ΔΔG values within experimentally supported thresholds for ACE2 and antibody binding. Furthermore, SARITA anticipates immune-evasive mutations and accurately captures the positional and statistical distribution of mutations found in post- March 1st 2021 variants, highlighting its potential as a predictive tool for viral evolution.

CONCLUSION: These results indicate the potential of SARITA to predict future SARS-CoV-2 S1 evolution, potentially aiding in the development of adaptable vaccines and treatments.

PMID:40755284 | DOI:10.1093/bib/bbaf384

By Nevin Manimala