BMC Bioinformatics. 2023 May 2;24(1):180. doi: 10.1186/s12859-023-05305-0.
ABSTRACT
BACKGROUND: Large-scale multi-ethnic DNA sequencing data is increasingly available owing to decreasing cost of modern sequencing technologies. Inference of the population structure with such sequencing data is fundamentally important. However, the ultra-dimensionality and complicated linkage disequilibrium patterns across the whole genome make it challenging to infer population structure using traditional principal component analysis based methods and software.
RESULTS: We present the ERStruct Python Package, which enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, our package achieves significant improvements in the speed of matrix operations for large-scale data. Additionally, our package features adaptive data splitting capabilities to facilitate computation on GPUs with limited memory.
CONCLUSION: Our Python package ERStruct is an efficient and user-friendly tool for estimating the number of top informative principal components that capture population structure from whole genome sequencing data.
PMID:37131141 | DOI:10.1186/s12859-023-05305-0