Strain-level genetic complexity of the gut microbiota leads to two great challenges in analyzing microbiome sequencing datasets: high-dimensionality - the number of variables (bacterial strains or genes) largely exceeds the number of samples; and high sparsity - most variables are zeros in most samples. Current data-mining methods in microbiome field rely on taxa or functional gene groups such as pathways to collapse the variables and reduce dimensionality. Such strategy has two fundamental flaws: 1) prior knowledge is needed for the data analysis. Novel sequences that have no close neighbors in reference database will be denoted as taxonomically unclassified or functionally unknown and excluded for further analysis; 2) members in the same taxon or genes in the same pathway do not behave in the same way. Lumping them together will produce spurious variables which cannot become robust microbiome signatures for health or diseases. Ecologically, gut bacteria do not exist in isolation, but as functional groups named “guilds”, which denote groupings of members in the ecosystem that exploit the same class of resources in a similar way. Members in the same guild may come from widely different taxa but show co-abundant/co-occurring behavior. We propose to use “guilds of bacterial strains” as the aggregation method for reducing dimensionality and sparsity in microbiome-wide association studies for identifying key functional gut bacterial members that may causatively contribute to human health and diseases.
Back to Emerging Opportunities for Mathematics in the Microbiome