Categories
Nevin Manimala Statistics

ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data

BMC Bioinformatics. 2023 May 2;24(1):180. doi: 10.1186/s12859-023-05305-0.

ABSTRACT

BACKGROUND: Large-scale multi-ethnic DNA sequencing data is increasingly available owing to decreasing cost of modern sequencing technologies. Inference of the population structure with such sequencing data is fundamentally important. However, the ultra-dimensionality and complicated linkage disequilibrium patterns across the whole genome make it challenging to infer population structure using traditional principal component analysis based methods and software.

RESULTS: We present the ERStruct Python Package, which enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, our package achieves significant improvements in the speed of matrix operations for large-scale data. Additionally, our package features adaptive data splitting capabilities to facilitate computation on GPUs with limited memory.

CONCLUSION: Our Python package ERStruct is an efficient and user-friendly tool for estimating the number of top informative principal components that capture population structure from whole genome sequencing data.

PMID:37131141 | DOI:10.1186/s12859-023-05305-0

By Nevin Manimala

Portfolio Website for Nevin Manimala