Bioinformatics. 2024 Dec 23:btae744. doi: 10.1093/bioinformatics/btae744. Online ahead of print.
ABSTRACT
MOTIVATION: Nanopore sequencing represents a significant advancement in genomics, enabling direct long-read DNA sequencing at the single-molecule level. Accurate simulation of nanopore sequencing signals from nucleotide sequences is crucial for method development and for complementing experimental data. Most existing approaches rely on predefined statistical models, which may not adequately capture the properties of experimental signal data. Furthermore, these simulators were developed for earlier versions of nanopore chemistry, which limits their applicability and adaptability to the latest flow cell data.
RESULTS: To enhance the quality of artificial signals, we introduce seq2squiggle, a novel transformer-based, non-autoregressive model designed to generate nanopore sequencing signals from nucleotide sequences. Unlike existing simulators that rely on static k-mer models, our approach learns sequential contextual information from segmented signal data. We benchmark seq2squiggle against state-of-the-art simulators on real experimental R9.4.1 and R10.4.1 data, evaluating signal similarity, basecalling accuracy, and variant detection rates. Seq2squiggle consistently outperforms existing tools across multiple datasets, demonstrating superior similarity to real data and offering a robust solution for simulating nanopore sequencing signals with the latest flow cell generation.
AVAILABILITY: seq2squiggle is freely available on GitHub at: github.com/ZKI-PH-ImageAnalysis/seq2squiggle.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
PMID:39710838 | DOI:10.1093/bioinformatics/btae744