Genome Biol. 2022 Aug 1;23(1):166. doi: 10.1186/s13059-022-02722-x.
BACKGROUND: Individual and environmental health outcomes are frequently linked to changes in the diversity of associated microbial communities. Thus, deriving health indicators based on microbiome diversity measures is essential. While microbiome data generated using high-throughput 16S rRNA marker gene surveys are appealing for this purpose, 16S surveys also generate a plethora of spurious microbial taxa.
RESULTS: When this artificial inflation in the observed number of taxa is ignored, we find that changes in the abundance of detected taxa confound current methods for inferring differences in richness. Experimental evidence, theory-guided exploratory data analyses, and existing literature support the conclusion that most sub-genus discoveries are spurious artifacts of clustering 16S sequencing reads. We proceed to model a 16S survey’s systematic patterns of sub-genus taxa generation as a function of genus abundance to derive a robust control for false taxa accumulation. These controls unlock classical regression approaches for highly flexible differential richness inference at various levels of the surveyed microbial assemblage: from sample groups to specific taxa collections. The proposed methodology for differential richness inference is available through an R package, Prokounter.
CONCLUSIONS: False species discoveries bias richness estimation and confound differential richness inference. In the case of 16S microbiome surveys, supporting evidence indicate that most sub-genus taxa are spurious. Based on this finding, a flexible method is proposed and is shown to overcome the confounding problem noted with current approaches for differential richness inference. Package availability: https://github.com/mskb01/prokounter.