A statistical model for describing and simulating microbial community profiles

Many methods have been developed for statistical analysis of microbial community profiles, but due to the complex nature of typical microbiome measurements (e.g. sparsity, zero-inflation, non-independence, and compositionality) and of the associated underlying biology, it is difficult to compare or evaluate such methods within a single systematic framework. To address this challenge, we developed SparseDOSSA (Sparse Data Observations for the Simulation of Synthetic Abundances): a statistical model of microbial ecological population structure, which can be used to parameterize real-world microbial community profiles and to simulate new, realistic profiles of known structure for methods evaluation. Specifically, SparseDOSSA’s model captures marginal microbial feature abundances as a zero-inflated log-normal distribution, with additional model components for absolute cell counts and the sequence read generation process, microbe-microbe, and microbe-environment interactions. Together, these allow fully known covariance structure between synthetic features (i.e. “taxa”) or between features and “phenotypes” to be simulated for method benchmarking. Here, we demonstrate SparseDOSSA’s performance for 1) accurately modeling human-associated microbial population profiles; 2) generating synthetic communities with controlled population and ecological structures; 3) spiking-in true positive synthetic associations to benchmark analysis methods; and 4) recapitulating an end-to-end mouse microbiome feeding experiment. Together, these represent the most common analysis types in assessment of real microbial community environmental and epidemiological statistics, thus demonstrating SparseDOSSA’s utility as a general-purpose aid for modeling communities and evaluating quantitative methods. An open-source implementation is available at http://huttenhower.sph.harvard.edu/sparsedossa2.


Introduction
Microbial community research has increasingly benefited from study designs inspired by molecular epidemiology, particularly with the goal of associating features of the human microbiome with health and disease [1]. This has enabled discoveries ranging from overall ecological dysbiosis in gut community structure during inflammatory bowel disease (IBD) [2] to specific microbial species, strains, and gene families linked to colorectal cancer (CRC) [3].
However, in almost all cases, existing statistical methods for genetic, transcriptional, metabolomic, or other molecular epidemiology cannot be accurately applied directly to microbiome measurements, due to their unique measurement error, noise, zero-inflation, compositional, and non-independence properties [4,5]. This has led to inaccuracy issues in the literature, such as confounding, uncorrected population structure, batch effects, and a high rate of false positives [6][7][8][9]. There is thus an unmet need for statistical frameworks capable of capturing all aspects of microbiome epidemiology, both for the sake of accurately parameterizing and testing real community profiles, and for "reversing" parameterized models to simulate controlled, synthetic microbiomes for accurate methodology evaluation.
Transcriptional biomarker discovery has a similar history, in which early statistics to associate gene expression patterns with human phenotypes were met with challenges of noise, dimensionality, and test appropriateness [10]. This led to some of the first models for gene expression integrating features of underlying transcriptional biology, different assay platforms, and measurement noise [11]. These were in turn also "reversed" to provide simulated expression data for methods evaluation under guaranteed, controlled circumstances [12], permitting some of the first truly quantitative transcriptional epidemiology and comparative methods evaluation [13].
Models of microbial community structure are similarly important, and both their biological structure and measurement technologies are quite distinct from those for other sources of short-read sequence generation [14]. Microbial community profiles can be derived roughly equivalently from either amplicon (e.g. 16S rRNA gene) or metagenomic shotgun sequencing, and they consist of the (typically compositional) counts or proportions of taxa, genes, pathways, or other features derived from the source sequencing data. Like other types of molecular epidemiology profiles, they are typically a) high-dimensional (number of features equivalent to or surpassing sample size) [1] and b) require both feature-feature and sample-sample biological interactions (i.e. correlations or population structure) to be accounted for [15].
Additionally, microbiome data possess further unique properties that prohibit direct application of models from other molecular epidemiology research. They are considerably more sparse, i.e. zero-inflated, both due to low sequencing depth and biological absence [1]. As a result, in different settings, either biological presence/absence of microbial features or their abundances can be linked to phenotypes [16]. Microbial abundances from sequencing are also near-universally available only on a relative (compositional) scale, thus constrained to sum up to a constant. The combination of general high-dimensional statistical challenges with those unique to ecological profiles have impeded the development of a single, universal model of microbial feature structure.
As such, most previous strategies for modeling or simulating microbial community profiles (typically for methods evaluation) have been relatively simple [5]. Here, we will use "features" and "profiles" to refer to the quantification of taxa or other entities (e.g. genes or pathways) as counts or relative abundances from microbial community sequencing. McMurdie and Holmes [5] adopted deterministic mixing and multinomial sampling for simulating microbial taxa count observations; it thus does not allow for interaction between microbial features, nor does it model biological (as opposed to technical) absences. Similarly, Thorsen et al. [17] adopted random resampling of realworld data for simulating "new" microbial features and samples, indirectly violating compositionality and, again, excluding possible feature-feature interactions. Only recently, metaSPARSim [18] adopted a formal statistical model specifically for simulation of 16S rRNA gene amplicon-sequenced microbial observations (here abbreviated 16S), namely, the gammamultivariate hypergeometric (gamma-MHG) distribution. However, the gamma-MHG model, itself an over-dispersed version of the multinomial model, still does not allow for biological absences or feature-feature interactions. Additionally, the model's sampling implementation requires iteration over read depth for a given sample, which induces impractically high computation burdens to achieve realistic sequencing depths [1]. None of these frameworks formally capture microbial covariation with real or simulated covariates. In addition to other uses of such models, this is perhaps the most important aspect needed for benchmarking applications, where it enables estimation of power, false discovery rates, and effect sizes for microbiome epidemiology.
To address these gaps, we present SparseDOSSA (Sparse Data Observations for the Simulation of Synthetic Abundance), a statistical model that can be used to capture and, in turn, simulate realistic microbial community profiles. Motivated by the biological and technical data generation mechanisms and properties of microbial abundance observations, SparseDOSSA has model layers for a) zero-inflated marginal microbial abundances, b) penalized estimation of highdimensional feature-feature interactions, c) enforced normalization to address compositionality, and d) spiking-in of controlled microbe-microbe and microbe-environment covariation for benchmarking. We demonstrate through validations that the current implementation version, SparseDOSSA 2, accurately captures microbial community population and ecological structures across different environments, host phenotypes, and sequencing technologies, and is capable of recapitulating comparable, realistic synthetic profiles. We also show example applications in microbiome study design power analysis and in recapitulating a complex end-to-end mouse

A statistical model for microbial community profiles
SparseDOSSA is a hierarchical model for microbial count and relative abundance profiles ( Fig.   1), with components specifically accommodating the major distributional characteristics of such data, namely zero-inflation, compositionality (and thus sequencing depth), feature-feature nonindependence, feature-environment interactions, and high-dimensionality. Briefly (Fig. 1A, details in Methods), the model a) specifies zero-inflated log-normal marginal distributions for each microbial feature to allow for both biological and technical absences, b) imposes distributions on the "absolute", i.e. pre-normalized, microbial abundances to satisfy compositionality (similar to models such as the Dirichlet [19] or gamma-MGH [18]), c) models feature-feature correlations through a multivariate Gaussian copula [20], and d) adopts a penalized fitting procedure to address high-dimensionality [21]. Conditional on feature relative abundances and total read depth, count observations are modeled with a standard multinomial sampling procedure, and per-sample read depth is modelled with a log-normal distribution. For implementation, we adopted a penalized Expectation-Maximization procedure for model fitting, and we have evaluated and provided options for cross-validated selection of the optimal penalization parameter (Methods).  hierarchical model to capture the generation mechanism of microbial sequencing counts, including components for "hidden" absolute abundances, sequencing depth (and thus compositional relative abundances), zero inflation, and feature-feature and feature-environment interactions. B) SparseDOSSA can be fitted to varied microbial community types using cross-validation procedures by users; the software also provides pre-trained models are provided for human microbiome template datasets. This allows for C) simulation of either null or "true positive" association spiked-in synthetic datasets, to facilitate microbiome benchmarking or power analysis studies.

SparseDOSSA accurately recapitulates real-world microbial community structures
We validated SparseDOSSA's ability to accurately capture realistic microbial community feature profiles by quantifying its performance across a variety of real-world datasets (Fig. 2, Supplemental Table 1). The studies used include: 1,2) taxonomic profiles from shotgun sequenced metagenomes of healthy human stool and posterior fornix samples from the HMP1-II, hereafter referred to as "Stool" and "Vaginal" [22], 3) shotgun sequenced stool metagenomes of inflammatory bowel disease (IBD) patients from the HMP2 Inflammatory Bowel Disease Multiomics Database (IBDMDB, abbreviated as "IBD") [2], and 4) 16S rRNA gene sequenced murine distal gut communities after diet perturbation [23]. By evaluating the model in different cohorts, we established its robustness under different community phenotypes, habitats (i.e. body sites), overall ecological structures, and sequencing technologies (Supplemental Table 1).
SparseDOSSA 2 captured community parameters and re-simulated microbial profiles with overall community structures that accurately reflected those of the original, real-world ecologies, better than alternative methods ( Fig. 2A-B), across all human datasets (murine study results reported in separate section). Overall, simulated communities yielded the same patterns of global betadiversity as were contained within each modeled dataset ( Fig. 2A). This was quantitatively compared against alternative models (Dirichlet-multinomial, DMM [19] and gamma-MGH, namely metaSPARSim [18]) with the PERMANOVA 2 statistic [24] (Methods). We calculated ecological Bray-Curtis dissimilarities between real-world microbial profiles and those simulated by each evaluated method. We then quantified the total variability in the combined dissimilarities that could be attributed to real-world versus simulation difference, expressed as the PERMANOVA 2 .
Smaller 2 s thus indicate less deviation of the simulated community structures from the real-world target and better performance of the model.
Across almost all evaluated community types, SparseDOSSA 2 generated significantly smaller 2 statistics over 25 simulation iterations than existing methods (Mann-Whitney tests p < 0.05), indicating better fit to and recapitulation of the targeted communities (Fig. 2B, testing results in Supplemental Table 2). Notably, this was consistent in both the human gut (Stool, IBD), where community structure forms continuous "gradients" of microbial composition [9], and the human vaginal environment (Vaginal), where communities are often characterized by a few discrete types with dominant species [25]. Only for the Stool dataset did SparseDOSSA 2 slightly underperform when compared to metaSPARSim in terms of 2 statistic, while still outperforming with respect to per-feature distributions (Fig. 2D). Additionally, metaSPARSim's simulation procedure can take as much as ~10x longer than SparseDOSSA 2 (Supplemental Fig. 1), which is prohibitive for realistic data sizes (especially for benchmarking or power-analysis efforts requiring multiple simulations per parameter configuration, or for Monte-Carlo calculations). We thus conclude that, when evaluated for overall community structures, SparseDOSSA is capable of capturing microbial feature profiles that closely resemble those of real-world microbiomes.
The SparseDOSSA model also provided the best recapitulations of individual features' relative abundances ( Fig. 2C-D). For representative features in each environment, the empirical cumulative distribution function (CDF) curves of samples show that SparseDOSSA 2 simulated abundances closely resemble those of the real-world data (Fig. 2C). Quantitatively, for each set of microbial features, we measured the difference of distributions between re-simulated and realworld (modeled) relative abundances with the Kolmogorov-Smirnov test statistic (K-S, see Methods). The resulting K-S statistic provides a distance between the distribution of each feature's relative abundances across simulated vs. modeled real-world communities. Smaller K-S statistics thus indicate better performance of the model. SparseDOSSA 2 better approximated the targeted real-world per-feature distributions than existing methods across all evaluated datasets ( Fig. 2D), reaching statistical significance in each case (Supplemental Table 3, Mann-Whitney tests p < 0.05). In addition to simulating existing microbial features, SparseDOSSA 2 also provides the functionality to simulate new features that resemble the targeted environment's ecological characteristics (Methods) and was validated to generate "Stool-like", "Vaginal-like", or "IBD-like" new features in terms of prevalence, abundance, and variability for each of the tested datasets (Supplemental Fig. 2). Thus both in overall community structure modeling and in perfeature models, SparseDOSSA 2 was able to accurately capture and re-simulate realistic microbial observations better than alternative approaches.  Table 2). 2 compared against randomly split original real-world data are included as baseline controls. C) Representative features from each environment are similarly distributed between real-world and SparseDOSSA 2 simulated samples, as shown in empirical cumulative distribution functions (CDFs) of log-10 relative abundances (with pseudo value 1e-6 to visually represent zeros). D) Per-feature Kolmogorov-Smirnov statistics quantify that SparseDOSSA 2 outperforms existing methods in simulating realistic feature-level relative abundance distributions (pvalues are significant and included in Supplemental Table 3).

SparseDOSSA captures covariation among microbes and with real or simulated "phenotypes"
Once the SparseDOSSA model is fit to a real-world microbial community profile, the "reversed" version of the model can be used not only to simulate similar, controlled ecologies, but to spike them with known feature-feature or feature-covariate associations (i.e. metadata "spike-ins"). This is implemented by first capturing the "null" state of targeted real-world studies as described above and by subsequently modifying the fit model parameters to induce artificial associations.
Compared to existing spiking-in paradigms [5,17,18], the model includes two important improvements (Methods). First, SparseDOSSA can model a wide variety of covariates -discrete, continuous, or any combination thereof -with multivariable linear modelling, and can thus accommodate simulations of realistic microbiome population study designs with multiple phenotypes, exposures, or confounders [1]. Second, associations with both non-zero (abundance) and zero-inflated (prevalence) components of microbial features can be captured, along with clearly defined effect sizes (fold change or odds ratio, see Methods). This enables rigorous evaluations of, for example, differential abundance testing methods for their statistical performance (e.g. power or false positive rates).
Based on models fit to the Stool and Vaginal communities, SparseDOSSA 2 accurately introduced associations for control "phenotypes" in a new, simulated population ( Fig. 3, Supplemental Fig.   3). Specifically, for the Stool dataset, we introduced a binary covariate (similar to e.g. a case / control contrast) with non-zero effects on 16 (5% of the total 332) microbial features' abundances ( Fig. 3A) and prevalences (Fig. 3B). Features were selected to ensure the highest effective sample size (Methods). For simulated associations of the "phenotype" with feature abundances, log fold change of non-zero relative abundances largely agreed with the target effect sizes within 95% confidence levels (Fig. 3A). Prevalence log odds ratios were also as targeted (Fig. 3B), with effects in relative abundances mostly agreeing with the prescribed effect sizes. Similar abundance and prevalence results were consistently reproduced in the Vaginal environment (Supplemental In addition to modeling associations between microbial features and external covariates, SparseDOSSA can also model community ecological interactions (i.e. correlations between microbial features, or feature-feature "spike-ins", Fig. 3C). This is captured by extending the feature-covariate spiking process above to synthetically associate multiple features with the same hidden covariate (Methods). First, a null model fit to the Stool community contains no true featurefeature associations, only those that manifest spuriously due to compositionality (Fig. 3C).
Starting with this, we modified the model to induce increasingly large feature-feature "ecological" interactions (Supplemental Fig. 4). SparseDOSSA 2 produced both only and exactly the expected true feature-feature associations among absolute abundance components, and the correct induced compositional correlations after simulating the sequencing assay process (Fig.   3C). These results support SparseDOSSA's ability to modify baseline, null community structures by the introduction of interactions among features or with controlled covariates, which together enable the evaluation of a wide range of statistical approaches to microbiome analysis [15,26,27].

Modeling environment-specific benchmarking and power estimation
Since most microbiome analysis methods make simplifying assumptions that may or may not be suited to particular ecologies, SparseDOSSA's flexible model enables power and accuracy estimation in a habitat-specific manner (Fig. 4). Specifically, by spiking only a limited set of known feature-phenotype associations into an otherwise guaranteed-null model, differential abundance methods can be compared directly to each other in a controlled setting (more details in [28]), enabling targeted method benchmarking (Fig. 4A) or power analysis (Fig. 4B).
To demonstrate SparseDOSSA's use for benchmark comparison of microbial community statistical tests, we again simulated synthetic datasets based on the Stool profiles with "phenotypic" associations spiked-in for 5% of features at varying effect sizes (as in Fig. 3A).
Multiple replicates of the same parameter set were performed to provide performance metric mean and standard errors (Methods). Using the resulting gold standards, the performances of three different association tests -limmaVOOM [29], ANCOM [30], and MaAsLin 2 [28] -were similar for power, but false discovery rates varied strikingly (Fig. 4A). Notably, the MaAsLin 2 generalized linear model showed good FDR control at small to moderate effect sizes. At higher effect sizes, non spiked-in ("null") features are also called by MaAsLin 2 as differentially abundant.
Interestingly, this is because SparseDOSSA's spike-in effects are imposed on features' simulated absolute abundances (Methods), and high effect size spike-ins thus also induce relative abundance change in null features due to compositionality. This highlights the important difference between true differential abundance effects corresponding to microbes' biological variation, versus changes post normalization that are driven by other features.
In contrast, ANCOM [30] was designed to account for compositionality and draw inference about hidden absolute abundances; it successfully and maintained FDR under moderate to strong effect sizes. Arguably as a result, however, its performance suffered for small to null effects, presumably because in such cases it is difficult to distinguish between "driver" microbial features with true absolute effects versus those with changes in their relative abundances due to compositionality.
To demonstrate SparseDOSSA's use for power analysis during microbial community study design, we focused targeted simulation datasets with spiked-in effects on a feature modeled on Escherichia coli, as a microbe commonly associated with dysbiosis in the human gut [2]. Using this approach, SparseDOSSA 2 can be used to estimate each association method's expected power for similar biomarkers and populations. In this example, MaAsLin 2 has high power to detect a two-fold abundance change in "E. coli" for a sample size of at least ~500 individuals, but greatly reduced power for smaller fold-changes (Fig. 4B). Since model power for differential abundance testing in sparse, compositional data is extremely difficult to determine parametrically, SparseDOSSA thus provides a way to do so by simulation tailored to any community type or feature of interest.

SparseDOSSA reproduces an end-to-end diet-microbiome analysis
In many cases, SparseDOSSA thus captures the properties of microbial community ecologies well enough to reproduce surprisingly specific aspects of their membership and distributions, which we next demonstrated by reproducing an end-to-end example from a longitudinal, interventional diet study investigating the effects of diet on the murine gut microbiome [23]  After re-simulating communities based on these models fits, ordinations of SparseDOSSA 2 results closely mimic the originally observed clustering structure of dietary effects, and even the longitudinal effects of time under treatment (Fig. 5A). For quantitative differential abundance effects, based on the observed difference of raw diet (TRF) samples when compared against cooked/free-fed (TCF) and cooked/restricted (TCR) within the tuber diet group [23], we additionally applied SparseDOSSA 2's spike-in procedure to simulate a ~2x fold increase in the abundance of Bacteroidetes OTUs in TCF/TCR when compared to TCF samples, and a ~2x fold decrease in Firmicutes OTUs (Methods). Consequently, the simulated samples displayed similar differential outcomes in community diversity as measured by Shannon index, as well as Firmicutes/Bacteroidetes ratios, as seen in the original study ( Fig. 5B-C).
Interestingly, even though per-feature absolute abundances are theoretically unidentifiable in the SparseDOSSA model (Methods), we note the spiking-procedure recapitulated the decreased total cell counts in TRF (Fig. 5D) that [23] also observed via quantitative PCR. The difference between chow versus tuber diets, on the other hand, is completely attributable to SparseDOSSA's framework, as one can arbitrarily modify the average absolute abundances of our fit to the chow or tuber diet samples, but still yield exactly the same relative abundance profiles. These realworld application results highlight SparseDOSSA's adaptability to community phenotypes and treatment effects, as well as demonstrate its performance for amplicon sequence datasets and microbial communities associated with non-human hosts. SparseDOSSA model was also able to model and synthetically replicate changes in "Bacteroidetes" and "Firmicutes" phyla in response to raw vs. cooked diets, including B) overall community alpha-diversity (Shannon index), C) the resulting "Firmicutes" vs. "Bacteroidetes" ratio, and D) overall whole-community effective biomass. TRF = raw tuber (free-fd); TCF = cooked tuber (free-fed); TCR = cooked tuber (restricted ration).

Discussion
Here SparseDOSSA 2 itself optionally spikes-in any such structure). The last assumption 4) that sequencing depths within study are themselves log-normal typically has minimal impact on model fitting or usage. The third assumption holds reasonably well even when any correlation structure originally present is weak or rare relative to overall microbial variance, or affects only a small proportion of features, similar to the assumption of "few differential transcripts" used in most RNAseq models [31]. Second, inasmuch as the read count of each feature depends on its own observed mean, variance, and sparsity, SparseDOSSA 2's simulated data will replicate the marginal distribution of the originating template community. This guarantee on the null distribution of subsequently generated communities allows correlation structure (with samples or among features) to be optionally added in isolation for evaluation of microbial community analysis methods. The first assumption is most approximate -it is generally true for ecologically diverse communities, which empirically follow power-law or log-normal behaviors (with a few abundant organisms and a long tail representing the increasingly rare biosphere). However, as discussed above, its violation leads to small residual systematic biases (<0.5%) in communities where tails of rare organisms are more truncated than expected.
Perhaps the greatest strength of the model is its application in simulating microbial community profiles, which we have emphasized and validated here. Most previous methods for associating microbial features with covariates [30,[32][33][34] or with each other [15,27,35,36] have relied on heterogeneous, one-off models not necessarily reflective of any one "real" microbial community type, or of the diversity of ecological configurations observed in the wild (e.g. the human gut vs.
vaginal microbiome vs. soil). By providing a model that can accurately capture many different community types, remove any existing structure through null distributions, and re-introduce known, controlled structure (microbial or covariate), we hope to provide a convenient, unified framework with which statistical methods can be validated specifically for their environments of interest (e.g. human microbiome epidemiology vs. environmental ecological interactions). In addition to this application, while not emphasized here, the model's parameterization can be used to directly inspect or compare microbial communities. For example, the estimates of absence probabilities for important microbes of interest in specific human populations (e.g. Prevotella in the Westernized vs. non-Westernized gut [37]), or the relationships between vs. mean logabundance across microbes (i.e. prevalence vs. abundance) are directly informative as to their neutral dispersal vs. selection [38]. To some degree this is evident from the murine feeding example above, but most such applications remain to be demonstrated in broader "real-world" datasets.
Relatedly, SparseDOSSA successfully reproduced reported dietary effects on the mouse gut microbiome [23], without assuming such differences a priori (Fig. 5). By fitting our model on microbial observations of separate treatment groups and time points, we allowed SparseDOSSA to adapt to each subset independently, but without assumptions on the existence or magnitude of differences between them. The emergent reproduction of differentiation by diet in the resulting synthetic communities and features (Fig. 5A) exemplifies SparseDOSSA's utility in capturing environment-or treatment-specific dynamics of real-world microbial communities. In parallel, by introducing effects within each dietary group, SparseDOSSA's per-feature spike-in procedure was able to reproduce structural microbial community changes such as overall diversity and wholephylum abundance trade-offs. Together, this end-to-end real-world case study highlights SparseDOSSA's two key functionalities while also testing a non-human, amplicon-sequenced application context: generating realistic microbial community profiles that closely mimic the targeted environment, and introducing covariate spiked-in microbial perturbations to simulate treatment effects.
With respect to this second use case (covariate effects spike-in), existing simulation models often adopt the simplistic approach of modifying the abundances of taxa in the null community to introduce known associations [5,18]. SparseDOSSA, in comparison, utilizes rigorous perturbation models to explicitly specify the marginal means of taxa as functions of chosen covariates. This a) enables much more flexible applications such as the inclusion of confounders or random effects (by incorporating them as covariates), and b) yields spiked-in datasets that are strictly compatible with the standard assumptions of (generalized) linear models. Alternatively, differentiation between simple binary (case-versus-control) contrasts could be achieved with our current model by training SparseDOSSA separately on the two corresponding population subsets, given that each was sufficiently large to serve as a template.
Our modeling and simulation procedure for generating feature-feature correlations is, in turn, directly based off the feature-covariate model and comparatively more restrictive; we expect to explore more rigorous and flexible approaches in future work, since any one "correct" way to model ecological associations in absolute vs. relative abundance space is not clear a priori [35].
Another related area for future work is in the specific model used for absolute abundances, which are not well-understood from currently available data; our current assumption holds if the total biomass of "typical" communities does not change under "typical" circumstances, but this is obviously quite qualitative. Direct measurements of microbial biomass in some environments such as the human gut have sometimes shown this within approximately one fold change [39,40], but not in all cases, and certainly not during extreme perturbations such as antibiotics [41].
Thus the SparseDOSSA model simultaneously provides a conceptual framework with which to capture key aspects of microbial ecologies and their members, a simulation system for benchmarking statistical methods that assess correlation structure in microbial community profiles, and a set of marginal parameters for each community and community type of lower dimensionality and potentially reduced noise relative to raw data. The last, while again not yet explored, could allow sample metadata covariates to be more accurately tested for association with microbial features, or tested for association with microbial community features indirectly (e.g. via their prevalence or mean when present). In addition to the areas discussed above, future expansions of the model might include longitudinal structure or other interdependencies among samples (i.e. population substructure), as well as diversifying the application areas for the model (e.g. for power calculations during study design). As currently implemented, SparseDOSSA 2 provides an endto-end system that enables reproducible and efficient validation of quantitative methods applied to microbial community taxonomic profiles, allowing fair comparisons to be made between different methods or studies to establish a consistent baseline for statistical validation.

The SparseDOSSA model
SparseDOSSA uses the following data generation mechanism to parameterize microbial community profiles: a) environments/samples contain microbes with absolute abundances , b) these are normalized to relative abundances , which c) can be measured via sequenced counts . As detailed in Fig.1A, our model specification for these components is: • For the unobserved absolute abundances = ( 1 , 2 ,⋯, ), we specify a Gaussian copula model [20] with zero-inflated log normal marginal distributions. Specifically, this involves assuming hidden multivariate Gaussian variables = ( 1 , 2 ,⋯, ) for the microbial features and a mapping of these variables to the corresponding absolute abundances ( 1 , 2 ,⋯, ): ○ ∼ (0, -1 ). That is, each is a standard (0, 1) variable and their correlation matrix is -1 .
○ Each is mapped to such that follows a zero-inflated log-normal distribution, parameterized by absence probability ( ) and mean and variability of non-zero log abundances ( , 2 ): Where is the standard normal cumulative density function and is the cumulative density function of the zero-inflated log-normal distribution, parameterized by , , 2 .
It follows from our model specification that, marginally, follows the prescribed zero-inflated log-normal distribution exactly: Jointly, correlations between microbial features' absolute abundances are characterized through the copula parameter . The benefit of adopting a copula model is to separate the parameterization and estimation of a joint distribution into its marginal and correlation components; this is illustrated in the model fitting subsection below.
• For a given sample with sequencing depth , its per-feature read counts ( 1 , 2 ,⋯, ) are assumed to follow a multinomial distribution with individual features' probabilities given by .
• Lastly, we assume the sequencing depth across samples follows a log-normal distribution.

Model likelihood
It is helpful to clarify the likelihood of our model given its parameterization. First, we derive , the likelihood for the unobserved absolute abundances . The likelihood of observed data, as we show later, is an integration of . For illustration purposes, we first note the special case where are not zero-inflated. That is, = 0 for all 's. In this case, we have that: Where is as defined above: = -1 ( ( )) and ( ⋅ ) is the standard normal density function.
The equality follows by noting that the second term (the product) is the Jacobian of the mapping → : = -1 ( ( )). When one or more 's are zero-inflated, the mapping → is not one-toone, and the right hand side of the equality requires integration over 's that map to zero-valued 's: To derive the likelihood for relative abundances , we note that , jointly with the total absolute abundance ( : = ∑ ), forms a one-to-one mapping with the absolute abundances ( = ). Thus, the density function for , , can be obtained through integration of Σ , , which is simply multiplied by the Jacobian of the transformation: Lastly, for the observed microbial count data , the proper likelihood is: Where | ( ) is the multinomial likelihood for microbial counts given their relative abundances.
In practice, to simplify computation, during model fitting we replace this likelihood with Where is the multinomial MLE for given observed , i.e., = ∑ , is a normalizing constant not involving the parameters. The approximation is acceptable because with modern sequencing depth [1], | ( | ) (as function of ) is highly concentrated around . The right-hand side of (2) is what we aim to maximize for estimation of our model's parameters, 1 , 1 , 1 2 ,⋯, , , 2 , and .

Identifiability
It is important to note that likelihood (1)   ,⋯, 2 are identifiable, given 1 ,⋯, and . One can note that when = 1 for all 's, our likelihood degenerates to that in [42] with explicit analytical forms.
• is not identifiable. Again, consider the special case that = 1 and = 1, the form of is explicit and involves -1 ′ 1 ′ 1 , which is a multiple-to-one mapping from . The issue of non-identifiable correlation matrices for microbiome abundance data has been noted and addressed in many previous works; refer to {25950956; 28489411; 29140991; doi.org/10.1080/01621459.2018.1442340} for a partial list. We adopt the technique used in many of these previous works, namely 1 penalization on , to simultaneously address the identifiability issue as well as high-dimensionality for generic estimation of large covariance matrices [21].

Model fitting
Given our model specification and its (non-)identifiability, we propose to minimize the following Subject to the constraint for as specified above: ∑ = 0. As such, > 0 is a penalizing tuning parameter, which we choose with cross-validation in practice. can be either existing relative abundance estimations or, as specified above, normalized from count observations ( = ∑ ).
As specified in (1), the likelihood function involves integration over and is not analytically tractable. Numerically, we propose the following penalized expectation-maximization algorithm [43] for model fitting: can be solved with standard graphical lasso [21].

Generating synthetic microbial observations and simulating new features
Given that our model is fully parametric, synthetic microbial observations, including (hidden) absolute abundances, normalized relative abundances, and sequencing counts, can be generated following the same specifications as described above. To provide model parameters, the user can adopt one of the pre-trained sets included with the software or use the SparseDOSSA 2 training procedure to estimate parameters from any microbial template dataset suited for their simulation case.
Users may also be interested in generating "new" microbial features from the same ecological environment. For this, SparseDOSSA additionally models the per-feature parameters ( , , 2 ) with a three-dimensional non-parametric distribution . That is, across features, ( , , 2 ) ∼ .
Given a set of SparseDOSSA fitted results ( 1 , 1 , 2 1 ,…, , , 2 ), can be estimated with a threedimensional normal kernel density estimator [44]. The estimated can then be used to simulate new microbial features that follow the ecological characteristics (i.e., prevalence, abundance, and variability) of the fitted environment (Supplemental Fig. 2).

Association spike-in
SparseDOSSA adopts linear and generalized linear models for flexible spiking-in in both microbial features' non-zero abundances and prevalences, based on covariates. Let be the vector of covariate(s) for sample and be the targeted corresponding effect sizes (coefficients). To spike in associations between feature 's abundance and covariates , we modify the feature's nonzero mean log absolute abundance parameter across samples. Specifically, the post spike-in mean log abundance is modified as For the -th sample, can be generated with ( , , 2 ) instead of the original ( , , 2 ). This dictates that 's are associated with in their mean non-zero log abundances. As are specified on the log scale, , by definition, corresponds to log fold changes.
The prevalence spike-in similarly is specified via the logistic model; we modify the presence probability parameter (1 -) across samples: And generate 's correspondingly. This introduces an association between the covariates and feature 's prevalence, with corresponding to log odds ratios of the feature being present.
The multivariate linear modelling approach for specifying the association effects for both abundance and prevalence allows us to flexibly incorporate different variable types (e.g. binary, continuous, etc.) and study designs (e.g. existence of confounders).
We note that, importantly, our spiking-in procedure is performed on the absolute abundances, , which induces differential effects in relative abundances (Fig. 3A-B). The main benefit of this approach is that both the spiked-in microbial features and the "null" (i.e. non-spiked features) are clearly defined. The alternative -specifying effects for -is conceptually difficult. As is compositional (sums to 1), prescribing enrichment effects (higher abundance or prevalence) for some microbial features must by definition lead to depletion effects for certain other features. This renders it difficult to clearly define and separate the set of "true positive" spiked-in microbial features and the set of null features. SparseDOSSA's definition of effects for absolute abundances in its spike-in procedure align with recent efforts to rigorously characterize microbial differential abundance effects under the constraint of compositionality [30,45]. Empirically, we note that prescribed log fold changes or odds ratios for often lead to similar effect sizes in the relative abundances for the spiked-in feature 's ( Fig. 3A-B).
Lastly, we note that the spiking-in procedure with metadata variables can be used to simulate association effects between pairs of microbial features (Fig. 3C). Specifically, we first simulate a hidden covariate with standard normal distribution. For a pair of features 1 , and 2 , to enforce positive correlations between the two absolute abundances 1 and 2 , we simulate for them to be associated with in the same direction: can be viewed here as the effect size specifying the strength of correlation between 1 and 2 . To spike in negative correlation between the two, we simply keep as the effect for one of the features and usefor the other.

Evaluation with real-world datasets
For evaluation and comparison of microbiome simulation methods, we examined three real-world datasets with different host environments and disease statuses (Supplemental Table 1) [2,22]. We subset publicly available species level profiles from [22] Table 1.
To evaluate the performance of individual methods, we randomly partitioned each dataset (Stool, Vaginal, IBD) into halves for five iterations. For each partitioning, we fit the parameterization/simulation methods (DMM, metaSPARSim, and SparseDOSSA 2) on one half of the data (training). We then simulated synthetic microbial observations with the same sample size using the fitted results. Lastly, we compared the synthetic observations with the other half of the partitioned data (testing) in terms of both overall dissimilarity with PERMANOVA [24] and perfeature distribution differences (methods detailed below). Within each partitioning, this simulation was also performed five times for each method. That is, each method was used to randomly simulate five synthetic datasets for comparison with the testing half. The partitioning procedure allows us to evaluate method performance without a model overfitting effect. The DMM was fitted using R package "dirmult" and metaSPARSim was fitted using the implementation referred to in its publication [18]. For metaSPARSim fitting, the percent not zero filter for features was set to 0 instead of the default 0.2. In our evaluation this led to an observed performance increase (thus a favorable assessment), likely due to the existence of many highly zero inflated microbial species.
To evaluate overall dissimilarity between the original and synthetic samples, for each partitioning we combined the testing half of the original samples with the simulated datasets (five for each fitted method). We calculated the sample-to-sample Bray-Curtis dissimilarity matrix on the combined dataset. The univariate PERMANOVA model, ∼ { } was fitted.
The corresponding 2 statistic quantifies the percentage of variability between samples attributable to the difference between original "real-world" as compared to simulated samples.
Smaller 2 statistics indicate less difference, and better performance of the simulation method.
For each method, a total of 25 PERMANOVA evaluations (5 original dataset partitioning × 5 simulation) was performed for each real-world dataset. Lastly, we additionally evaluated the 2 between the training and testing halves of a dataset for each partitioning; this yields an estimation of minimum achievable 2 's for each dataset.
To evaluate the difference between distributions of individual features in the original and synthetic datasets, we simply combined the synthetic datasets generated across all partitioning and simulation repetitions. An individual feature was compared for its relative abundance distribution between the original real-world data and combined synthetic samples. This was quantified with the Kolmogorov-Smirnov (K-S) test statistic, which is defined as the largest absolute difference between the empirical cumulative distribution functions of the real-world and synthetic abundances. Smaller K-S statistics indicate better approximation of the targeted real-world distributions with the simulation method.

Association spike-in evaluation
We simulated spiked-in associations between microbial features and a synthetic case/control variable, based on the SparseDOSSA 2 fitted results. A total of 1,000 synthetic samples were simulated (500 cases and 500 controls). For non-zero abundance spike-in (Fig. 3A), the top 5% (16 total) most prevalent features were selected for spiking-in; this yields the highest effective sample size for the selected features because our abundance spiking-in targets only the non-zero component of a feature's distribution. Half of the features were spiked for a targeted log fold change ( ) of 1 in cases compared to controls, and the other half were spiked for a log fold change of -1. Actual log fold changes in the simulated relative abundances, along with 95% confidence intervals, were calculated by performing a linear regression on the log transformed non-zero relative abundances for each feature.
Similarly, for prevalence spiking-in (Fig. 3B), the top 5% features with prevalence closest to 0.5 were selected; as with abundance spike-ins, this was to ensure the spiked-in features had the highest effective sample size, as the association between a binary outcome (presence/absence here) and a binary covariate (case/control) is best-powered when the sample distribution is balanced across all different outcome/covariate combinations [46]. Again, half of these features were spiked for a targeted log odds ratio ( ) of 1 in cases compared to controls, and the other half were spiked for a log odds ratio of -1. Actual log odds ratios of the simulated feature prevalence, along with 95% confidence intervals, were calculated by performing a logistic regression on the presence for each feature.
For simulation of feature-feature associations, we first set the correlation between feature pairs in SparseDOSSA 2 to zero (i.e., = where is the identity matrix). This ensures that feature absolute abundances are independent in the "null" dataset ( Fig. 3C left panel, bottom right), whereas spurious correlation still exists in relative abundances due to compositionality. Two random pairs (four features) in the top ten most abundant features were selected for non-zero feature-feature association spike-in. As specified in Methods above, we simulated two independent normal synthetic hidden metadata variables, one for each feature pair to be associated. For the first feature pair, they were spiked in both abundance and prevalence with the same effect ( = = ) at varying sizes, for positive association. The second pair were spiked with opposing effects ( for one, -for the other) for negative association. We used Spearman correlation to estimate the empirical association between feature pairs in the simulated absolute and relative abundances. Target association effect size was also varied ( of 0, 1, 2, and 5) to showcase the relative signals of "true" associations that exist for both absolute and relative abundances, and spurious associations that are only induced in relative abundances due to compositionality (Fig. 3C, Supplemental Fig. 4)

Benchmarking and power analysis
Since "true" associations with prescribed effect sizes are known for SparseDOSSA synthetic datasets, they can be used for benchmarking microbiome analysis methods as well as for power analysis of microbiome study designs. For benchmarking analysis (Fig. 4A), we again selected the top 5% (16 total) most prevalent features in the Stool dataset to perform abundance spike-in, such that the selected features had the highest effective sample size. A total of 200 microbial profiles were simulated to be associated with a balanced binary metadata (100 cases, 100 controls). We varied effect sizes with half spiked features at = (0, 0.5, 1, 2) and the other half with = (0, -0.5, -1, -2), correspondingly (in the effect size 0 case no spike-in was performed and microbial profiles are generated independently of metadata). A total of 500 random simulations were performed for each parameter combination. We applied existing differential abundance analysis methods to detect the spiked-in features in each simulation dataset [28][29][30], with individual method configurations as reported in our previous benchmarking analysis [28]. We summarized the empirical power and FDR of a method in one simulation dataset, across the twenty random replicates for each parameter configuration, and reported the mean and standard error in Fig. 4A.
For showcasing SparseDOSSA's utility in a power analysis, we spiked in non-zero abundance associations with a balanced case-control variable for a simulated species parameterized by fitting Escherichia coli. This was performed at varying effect sizes (log fold change, = (0.5, 1, 2)) and sample sizes (100 to 1000). For each parameter configuration, a total of 500 replicates were simulated. The empirical power and its standard error of using MaAsLin 2 to detect the differential abundant effect in "E. coli" was summarized across the 500 replicates and reported in Fig. 4B; this was repeated for each effect size/sample size configuration.

Murine diet microbiome analysis
We applied SparseDOSSA 2 to the longitudinal diet dataset of the mouse gut microbiome in [23], to show that our method is capable of reproducing a complex study's findings. To recapitulate the longitudinal diet effect as reported in [23]'s Fig. 1a, we fitted SparseDOSSA 2 separately on 1) the control Chow diet samples at baseline, 2,3) Tuber diet samples at day 1 and day 5, separately, and 4,5) Meat diet samples at day 1 and day 5, separately. This approach allows SparseDOSSA 2 to independently fit subsets of the data, without assuming a priori the observed differences noted in [23]. We then used SparseDOSSA 2 fitted results to simulate synthetic observations for each diet/timepoint combination, with five times the original sample size (to reduce variability due to random sampling). Bray-Curtis MDS ordination on these synthetic data displayed a striking resemblance to that observed in [23] (Fig. 5A), in that a) communities cluster according to dietary treatment, and 2) this response is consistent after one day of switching from chow to whole-food diets and is strengthened at day 5.
We next reproduced the differential gut microbial profiles observed in mice fed raw versus tuber diets as presented in [23]'s Fig. 1f-g. [23] adopted three different types of Tuber diet: the raw/freefed (TRF), the cooked/free-fed (TCF), and the cooked/restricted (TCR). This study presented that on the phylum level, TRF induced enrichment of Bacteroidetes and depletion of Firmicutes when compared to TCF/TCR. We applied SparseDOSSA 2's feature spike-in procedure to approximate this effect. Specifically, we generated a balanced, three category (TRF/TCF/TCR) variable. Based on our fitted model of the Tuber diet at day 5, we spiked in a two-fold ( = log 2) increase in the non-zero abundance of Bacteroidetes OTUs and a two-fold decrease in Firmicutes OTUs ( = -log 2) in TRF samples when compared to TCF/TCR samples. This roughly agrees with the presented results in [23] Fig. 1d. We next simulated SparseDOSSA synthetic datasets for both the baseline Chow diet samples (sample size = 20), and the spiked-in Tuber diet samples (60 samples total, 20 each for TRF/TCF/TCR). We calculated the Shannon index and Firmicutes/Bacteroidetes ratios of these samples, and show in Fig. 5B that they agreed with the corresponding findings presented in [23] Fig. 1f-g.