Marginalized mixture models for count data from multiple source populations

Mixture distributions provide flexibility in modeling data collected from populations having unexplained heterogeneity. While interpretations of regression parameters from traditional finite mixture models are specific to unobserved subpopulations or latent classes, investigators are often interested in making inferences about the marginal mean of a count variable in the overall population. Recently, marginal mean regression modeling procedures for zero-inflated count outcomes have been introduced within the framework of maximum likelihood estimation of zero-inflated Poisson and negative binomial regression models. In this article, we propose marginalized mixture regression models based on two-component mixtures of non-degenerate count data distributions that provide directly interpretable estimates of exposure effects on the overall population mean of a count outcome. The models are examined using simulations and applied to two datasets, one from a double-blind dental caries incidence trial, and the other from a horticultural experiment. The finite sample performance of the proposed models are compared with each other and with marginalized zero-inflated count models, as well as ordinary Poisson and negative binomial regression.


Introduction
The analysis of data from populations with unexplained heterogeneity presents special challenges to researchers. When count data arise from mixtures of unobserved populations, models based on standard probability distributions are often inadequate to explain observed variability (Frühwirth-Schnatter 2005;Wedel and DeSarbo 1995). For example, in dental caries research and many other areas, proportions of observations with zero counts are often higher than expected under the Poisson or negative binomial distributions, and regression models based on these distributions may result in biased estimates and poor predictions. To account for such excess zeros, Mullahy (1986) and Lambert (1992) proposed zero-inflated Poisson (ZIP) regression. ZIP models, which employ twocomponent mixture distributions, hypothesize that observed counts arise from one of two latent classes within the source population: one class provides only zeros and the other produces both zero and non-zero values. However, the assumption of a model based on 'at-risk' and 'not-at-risk' latent classes may not be appropriate in some settings or may provide an inadequate fit. To model counts from multiple source populations, Wang et al. (1996) proposed multi-component Poisson mixture distributions, and their approach has been extended to other finite mixtures of non-degenerate count distributions. Despite the flexibility of finite mixtures for describing highly dispersed count data, parameters from standard mixture regression models are not directly applicable to making inferences about the overall effects of covariates on marginal means of count outcomes (Albert et al. 2014;Preisser et al. 2012). Even with the application of indirect methods of parameter estimation such as the use of post-modeling transformations, there are many instances where latent class model formulations fail to fully explain relationships between covariates and population-wide parameters.
While the importance of the marginal mean as a target of inference in the analysis of finite mixtures of counts is well established (Albert et al. 2014;Böhning et al. 1999;Lambert 1992;Preisser et al. 2012), marginally-specified mean models for finite mixtures of count distributions have only recently been proposed. Within a ZIP likelihood framework, Long et al. (2014) proposed marginalized zero-inflated Poisson (MZIP) regression, which specifies a two-part model for counts with a set of regression coefficients for the marginal mean and, to complete model specification, a second set of regression coefficients for the latent parameter defining membership in the 'excess-zero' class. The marginalized zero-inflated negative binomial (MZINB) model (Preisser et al. 2016) extended the MZIP model to account for overdispersion in addition to excess zeros. Todem et al. (2016) described a general representation of two-part marginalized mean count models including distributions for bounded counts, e.g., the zero-inflated betabinomial distribution. In each case, the model is assumed to follow a two-component mixture consisting of a standard count distribution with a degenerate point-mass at zero. However, data-generating mechanisms based on mixtures of non-degenerate count distributions can often provide better fits when the data suggest that a simple degenerate point-mass is insufficiently flexible to capture the heterogeneity in the counts. This can arise, for example, when there is overdispersion in the counts that cannot be fully explained by standard count data distributions (e.g., Poisson, negative binomial) amended by excess zeros.
In this article, we seek to expand the class of marginalized mixture models for zeroinflated and other heterogeneous count data to allow for greater model choice with maximum likelihood estimation, when there is interest in evaluating the effects of exposures on the overall mean count. For counts having unexplained heterogeneity, we extend the degenerate component of marginalized zero-inflated models to standard count distributions for more flexible modeling of the marginal mean. Our motivation comes from a randomized double-blind caries incidence trial conducted between 1988 and 1992 in Lanarkshire, Scotland, to compare the anti-caries efficacy of three toothpaste formulations in 4294 children ages 11-12 (Stephen et al. 1994). The outcome variable of interest was the number of new decayed, missing and filled surfaces (DMFS) two years following the baseline dental exam. Caries development is a complex process, which depends upon multiple biological and environmental factors; moreover, the clinical manifestation of disease is influenced by preventive care and restorative dental treatment decisions. For this reason, standard count models such as Poisson and negative binomial regression may not adequately account for heterogeneity in the DMFS counts. We consider marginalized, two-component finite mixture models to obtain direct inference about the relationship between toothpaste formulation and the marginal mean caries count in the trial population. "Methods and Results" section reviews traditional and marginalized zero-inflated count regression models, while "Models for mixtures of nondegenerate count distributions" section discusses traditional finite mixture regression models and proposes marginalized two-component count regression models involving mixtures of non-degenerate distributions. Simulation studies and two applications of the proposed models including the Lanarkshire caries trial are discussed in "Simulation study & Applications" sections, respectively. Concluding remarks follow in "Discussion and conclusions" section.

Zero-inflated Poisson and negative binomial models
Traditional zero-inflated models assume that counts arise from a two-component mixture of a standard count distribution with a distribution degenerate at zero. Under such models, counts are generated either from a 'non-susceptible' or 'perfect' state that always gives zeros, or from a 'susceptible' , 'imperfect' state that produces both zero and positive counts according to a standard count data distribution. Lambert (1992) introduced the zero-inflated Poisson (ZIP) regression model and applied it for modeling defects in manufacturing processes, where defects are assumed coming from a 'perfect' state with a probability π or an 'imperfect' state with a probability 1 − π. While counts from the 'perfect' , 'no-defect' state are always zero, those from the 'imperfect' state follow a Poisson distribution. The probability mass function (pmf ) of a random variable having a ZIP or zero-inflated negative binomial (ZINB) distribution can be written as where the mixing parameter π i is interpreted as the probability of a count being from the 'non-susceptible' or 'not-at-risk' latent class, I(T) is an indicator variable taking 1 when T is true, and 0 when T is false; g is a Poisson or negative binomial mass function, and θ i is the vector of parameters in g. When g is the Poisson mass function, θ i is equal to the mean μ i of the distribution, and for a negative binomial probability mass function g, where μ i is the mean of the distribution and α is the dispersion parameter.
In this paper, we will use the following parameterization for the probability mass function of a negative binomial distribution with mean μ and dispersion parameter α.
In zero-inflated count models, the logit and the log link functions are typically specified for the mixing probability π i and the mean of the assumed standard distribution μ i , respectively, as logit(π i ) = w i γ and log(μ i ) = x i ξ , where w i and x i are q × 1 and p × 1 vectors of covariates for the i th subject, and γ = (γ 1 , γ 2 , . . . , γ q ) and ξ = (ξ 1 , ξ 2 , . . . , ξ p ) are regression parameters. For n independent observations, the ZIP likelihood function is The corresponding likelihood function for the ZINB model can be written as Since interpretations of parameters γ and ξ in ZIP and ZINB models apply to the two latent subpopulations, they do not directly describe the overall population mean. Although the overall mean, E(Y i ) = ν i , for i th subject could be estimated from such models by and transformations such as the delta method could be applied to estimate the corresponding variance, it is not always easy to understand the behavior of ν i . In particular, determining the overall effects of an exposure variable on incidence density ratios is challenging especially when the linear predictors from both the mixing proportions and the Poisson mean model contain the exposure variable (Long et al. 2014).

Marginalized ZIP and ZINB models
To estimate the overall effects of covariates on the population mean, marginalized zeroinflated Poisson (Long et al. 2014) and marginalized zero-inflated negative binomial (Preisser et al. 2016) models specify parameters for the probability of being an excess zero (i.e., π i ) and the marginal mean where β = (β 1 , β 2 , . . . , β p ) is a vector of regression parameters for ν i that have the same interpretations for the effects of exposures on the marginal mean as in Poisson and negative binomial regression, whereas the parameters in γ have the same latent class interpretations for zero-inflation as in ZIP and ZINB models. The MZIP and MZINB likelihood functions are obtained by replacing μ i by ν i /(1 − π i ) in the ZIP and ZINB likelihoods, respectively. The next section introduces methods of estimating regression parameters for the overall population mean of heterogeneous counts generated from non-degenerate mixture distributions. With the aim of expanding the pool of two-part marginalized models for counts, special consideration is given to data generating mechanisms based on mixtures of two Poissons and a negative binomial with a Poisson distribution.

Finite mixture models
Finite mixture distributions have been used to model counts obtained from heterogeneous populations (Wang et al. 1996;Schlattmann 2009;Morgan et al. 2014). In the general finite mixture model, the source population is assumed to be a partition of m ≥ 2 latent subpopulations; with a probability π ij , the count random variable Y i corresponding to the i th individual takes a value from the j th subpopulation according to a distribution specific to the subpopulation. An m-component mixture distribution can be defined as (Frühwirth-Schnatter 2005; Wedel and DeSarbo 1995) where the components f 1 , f 2 , . . . , f m are probability mass functions of known distributions, is a vector of mixing probabilities with 0 ≤ π j ≤ 1 and m j=1 π j = 1. While the mixture distribution for zero-inflated counts in equation (1) allows mixing probabilities to vary across individuals, conventional finite mixture models assume a constant probability, π j , corresponding to the j th subpopulation and impose heterogeneity through f j (y i |θ ij ).
The Poisson mixture distribution, where with μ ij being the mean of the j th component distribution, is a popular finite mixture model for count data. In Poisson mixture regression, the mean μ ij , j = 1, . . . , m, is modeled as a function of covariates using the log link. Wang et al. (1996) discuss that such models are identifiable for full rank design matrices. While finite mixture models enable flexible modeling of counts from heterogeneous populations, their parameters have latent class interpretations. Such coefficients do not directly provide inferences regarding the effects of covariates on the overall population mean (Min and Agresti 2005;Roeder et al. 1999). For m = 2, the pmf of a random variable with a Poisson-Poisson mixture distribution can be written as where π is a mixing probability, and f P1 and f P2 are Poisson mass functions with corresponding mean parameters μ 1i and μ 2i . Similarly, a negative binomial-Poisson random variable has a pmf given by In Eq.
(2), f P is a Poisson pmf with mean parameter μ 1i and f NB a negative binomial pmf with mean and dispersion parameters μ 2i and α, respectively. The marginal mean, ν i , of a random variable Y i having either of the two mixture distributions can be written as In traditional finite mixture models, separate regression equations are specified for the mean of each component of the mixture. In general, ν i depends upon a complicated function of the regression coefficients from the components. In the next section, new marginalized models are specified for direct inference regarding the effects of covariates on ν i .

Marginalized finite mixture models
Solving for μ 2i in Eq. (3) gives To estimate a model for ν i , the likelihood functions of Poisson-Poisson and negative binomial-Poisson mixture models can be written as functions of ν i by replacing μ 2i by the linear function of the marginal mean in Eq. (4). Thus, marginalized Poisson-Poisson (MPois-Pois) and negative binomial-Poisson (MNB-Pois) pmfs can be written as in Eqs. (5) and (6), respectively: The MPois-Pois model is defined through Eq. (5) with specification of generalized linear models in (7) for the relationship of covariates to ν i and μ 1i , where x i and z i are vectors of covariates and β and ξ are corresponding vectors of regression coefficients, and −∞ < ρ < ∞ is a constant. Although ξ and ρ are considered nuisance parameters that are not of primary interest, they need to be modeled to facilitate maximum likelihood estimation of β in the marginal mean model. The logarithm of μ 1i is modeled by using a linear predictor that involves covariates as in standard finite mixture Poisson models. The mixing parameter π is modeled as a constant using the logit link to guarantee that its estimate is between 0 and 1. A common model specification is x i = z i such that β and ξ are p × 1 vectors of parameters. However, the covariates that are included in modeling ν i and μ 1i may be different. As the main interest is in β, a reduced set of covariates z i may be considered when it is necessary for computational tractability. Alternatively, a shared-parameter model (Preisser et al. 2016) may be used to incorporate a large number of covariates with relatively few parameters.
Similarly, the likelihood function for the MNB-Pois model can be specified as has the same expression as in Eq. (10). With carefully chosen starting parameter values, marginalized finite mixture models can be fitted using quasi-Newton optimization. Guidance for specifying starting values and use of SAS Proc NLMIXED for fitting the proposed models is presented as Additional file 1 (Benecha et al., 2017) along with further discussion of connections between the models in "Marginalized ZIP and ZINB models" and "Marginalized finite mixture models" sections. Finally, with respect to mixture Eq. (2), solving for μ 1i in (3) gives Inserting this expression for μ 1i in the standard mixture likelihood function based on Eq.
(2) gives a likelihood function for a model that is different from MNB-Pois. The alternative model, which marginalizes over the Poisson part versus MNB-Pois that marginalizes over the NB part, is not considered owing to unresolved computational issues in the applications.

Simulation study
Simulation studies were performed to examine the properties of MPois-Pois and MNB-Pois models for various sample sizes. Counts from these models were generated from the probability mass functions in Eqs. (5) and (6), where π, μ 1i , ν i and α are determined from To estimate Type I error rates of testing H 0 : β 1 = 0 vs H 1 : β 1 = 0, all the simulations were repeated by generating data using β 1 = 0, but keeping all the remaining parameter and covariate values the same as described previously. For each of the six models, the Type I error rates were calculated among converged model fits as the proportion of p-values from two-sided Wald tests that were less than 0.05. For MPois-Pois generated data, estimates of β 1 , β 2 and β 3 had low biases for all models and all sample sizes (Table 1). The MPois-Pois model had Type I error rates for β 1 close to 0.05, while the remaining models tended to over-estimate the error rates ( Table 2). The MPois-Pois model estimated coverages of 95% confidence intervals for β 1 , β 2 and β 3 that were close to the nominal value (Table 3). Whereas NB, MZINB and MNB-Pois models tended to have only slight undercoverage, Poisson and MZIP had coverage ranging from 88 to 92%. Convergence rates for MPois-Pois simulation scenarios ranged from 96.2 to 99.3%, while convergence rates ranged from 88.0 to 90.2% for MNB-Pois, from 75.9 to 98.4% for MZIP, and from 72.0 to 96.6% for the MZINB models. Convergence was 100% for Poisson and NB regression for all sample sizes.
When the data are generated from the MNB-Pois model, the MNB-Pois model had low percent relative median biases for β 1 , β 2 and β 3 , and the biases appear to decrease as sample sizes increase (Table 4). The corresponding estimates from the Poisson, NB and MZINB models also have low biases, but those from MPois-Pois and MZIP models are generally higher. In addition, the performance of the true MNB-Pois model with regard to Type I error rates (for β 1 ) and coverages of 95% confidence intervals (for β 1 , β 2 and β 3 ) is superior to Poisson, MZIP and MPois-Pois models at all sample sizes (Tables 5   and 6, respectively) and has better performance than NB and MZINB for the sample sizes of 500 and 1000. Over 96% of MNB-Pois models converged for sample sizes of 200 or more, with 91% convergence for sample size of 100. Coverage ranged from 97.4 to 100% for MZIP, from 92.0 to 99.4% for the MPois-Pois models, from 85.3 to 91.4% for MZINB models and rates were 100% for Poisson and NB regression. Overall, the simulation results indicate that when the true model is specified, MPois-Pois or MNB-Pois models estimate marginal mean regression parameters with small biases, Type I errors close to the assumed rate and coverages of 95% confidence intervals near 95% for sample sizes of 200 or greater.

A caries incidence trial
The methods described in this article were applied to the Lanarkshire caries incidence trial introduced in "Introduction" section. A total of 4294 children ages 11-12 were randomized to either sodium fluoride (NaF), sodium fluoride plus sodium trimetaphosphate (NaFTMP) or sodium monofluorophosphate (SMFP) toothpaste formulations and dental exams were performed at baseline and after 1, 2 and 3 years. The analysis was based on 3412 children followed up until year 2 and the response variable of interest was the number of new decayed, missing and filled surfaces (DMFS). Let NaF = 1 if the child was given sodium fluoride and 0 otherwise and let NaFTMP = 1 if the child was randomized to the NaFTMP group and 0 otherwise; children in the SMFP group make up the reference treatment category (NaF = NaFTMP = 0). In addition to treatment allocation, baseline caries (bc: 1= high, 0 = low) and baseline calculus (calc:1=yes, 0= no) were considered as explanatory variables. High baseline caries values correspond to at least one decayed, missing or filled anterior tooth or premolar, and a baseline calculus value of '1' refers to the existence of calcified deposits on the teeth formed by the continuous presence of dental plaque (Stephen et al. 1994;Preisser et al. 2014). An important feature of the data is the large number of zero counts in the outcome variable, as 658 (19.28%) of the 3412 children had zero DMFS counts (Fig. 1). Since the percentage of zeros is high, twopart marginalized models may provide less biased estimates and better model fits than one-part generalized linear models. Poisson, NB, MZIP, MZINB, MPois-Pois and MNB-Pois models were applied to compare the efficacy of the toothpaste formulations with respect to the marginal mean DMFS Table 3 Coverages of 95% confidence intervals for estimates of β 1 , β 2 and β 3 from marginalized models fitted to data generated from the MPois-Pois model with 10,000 replications  count. In the two-part models, the four binary covariates defined above were included in each model part. The three best models were NB, MZINB and MNB-Pois, which produced fitted values that best matched the observed distribution of DMFS counts (Fig. 2) and have the lowest AICs (Table 7). On the other hand, Poisson, MZIP and MPois-Pois models, which did not directly account for overdispersion, had poor fits and gave standard errors of regression coefficients for the marginal mean model that were too small. The MNB-Pois model gave the best fit (lowest AIC) while its marginal mean model parameter estimates and standard errors were similar in value to those of the next best fitting model, MZINB.
Based on the MNB-Pois model, the estimated caries incidence density ratio for the children who used the NaF toothpaste formulation versus children with the same baseline status of caries and calculus who used SMFP was exp(−0.060) = 0.942 (95% CI: 0.874, 1.015; Table 8). The estimated caries incidence density ratio for the NaFTMP toothpaste relative to SMFP was exp(−0.033) = 0.968 (95% CI: 0.882, 1.062). Thus, children in the NaF and NaFTMP groups had a decrease in the marginal mean DMFS count by 5.8 and 3.2%, respectively, compared to children with the same baseline characteristics who were assigned to the SMFP group. However, the associations are not statistically significant since the confidence intervals of the two incidence density ratios include 1. Conversely, inappropriate selection of the Poisson, MZIP or MPois-Pois models would have resulted in the potentially misleading conclusion that toothpaste formulation with sodium fluoride significantly reduces two-year incident caries relative to SMFP in this population of children.  Table 6 Coverages of 95% confidence intervals for estimates of β 1 , β 2 and β 3 from marginalized models fitted to data generated from the MNB-Pois model with 10,000 replications

Number of roots produced by shoots of the apple cultivar Trajan
In a horticultural experiment, Marin et al. (1993) recorded the number of roots produced by 270 micro-propagated shoots of the columnar apple cultivar Trajan. During the rooting period, all shoots were maintained under identical conditions, but the shoots themselves were cultured on media containing different concentrations of the cytokinin 6-benzylaminopurine (BAP), i.e., 2.2, 4.4, 8.8, and 17.6 μM, in growth cabinets with an 8 or 16 hour photoperiod. These data have been previously analyzed by Ridout et al.(1998Ridout et al.( , 2001 and Yang et al. (2009). Each of the eight treatment combinations consisted of either 30 or 40 shoots, hence resulting in a total of 270 shoots. Overall, 23.7% of the root counts were zero. However, only two of 140 shoots produced under the 8 hour photoperiod were zeros whereas 62 of 130 shoots produced under the 16 h photoperiod failed to produce roots. Fig. 1 Histogram of two-year DMFS increment for 3412 children ages 11-12 from a dental caries incidence trial conducted in Lanarkshire, Scotland between 1988 and 1992 Among the five models, the MPois-Pois model provided the best fit having the lowest AIC with the MZINB model fitting the second best (Table 9). Based on the MPois-Pois model, under the 8 hour photoperiod, each doubling of BAP concentration (i.e., a natural log(2) change) resulted in a statistically significant 5.7% (= [exp(log(2) × 0.080) − 1] ×100%) increase in the number of roots produced (95% CI: 0.9%, 10.7%). Conversely, under the 16 hour photoperiod, each doubling of BAP concentration resulted in a statistically significant 9.1% (=[ 1 − exp(0.693 × −0.138)] ×100%) decrease in the number of roots produced (95% CI: 0.5%, 17.1%). The 16 hour photoperiod produced about half the number of roots as the 8 hour photoperiod (Table 10).

Discussion and conclusions
In this article, marginal means of counts with unexplained heterogeneity were modeled using two-component finite mixture distributions. Regression parameters were specified in two-part marginalized models for direct estimation of exposure effects on the overall mean count using maximum likelihood methods. Specifically, the proposed MPois-Pois and MNB-Pois mixture models provide alternative model choices to MZIP and MZINB for counts that are overdispersed or have many zeros. It may not always be clear whether a zero-inflated count model or a model based on a finite mixture of two non-degenerate components is more appropriate as Poisson and negative binomial distributions with small means can generate a large amount of zeros. In the case of dental caries, zeroinflated count regression models are sometimes used (Preisser et al. 2012) even though caries researchers question whether any child can be immune to developing caries    (Mwalili et al. 2008). In many fields, use of finite mixtures of non-degenerate components may have a stronger theoretical basis than assuming a mixture of at-risk and not-at-risk latent classes. While there is sometimes interest in latent classes, researchers across many fields of inquiry are frequently interested in quantifying the effects of covariates on the overall mean count while adjusting for unexplained heterogeneity. In such cases, marginal mean regression parameters in MZIP, MZINB, MPois-Pois and MNB-Pois models have straightforward interpretations in describing overall exposure effects on count outcomes. As described in the Additional file 1 (Benecha et al. 2017), the marginalized models proposed in this article belong to a larger class of marginalized mixture models for counts. In particular, when the mixing probability component of the model is fixed either with or without covariates, the MZIP and MZINB models may be viewed as special cases of corresponding MPois-Pois and MNB-Pois models where the Poisson component of the latter two models has a mean of zero, rendering that component degenerate. In this sense, the proposed models expand the family of two-part marginalized regression models by providing alternatives to MZIP and MZINB regression. In the absence of theoretical justification, the merit of each model in the larger class of alternative marginalized models is judged based on goodness of fit considerations. Because our main interest is in modeling marginal means of counts, model parameters that are not of primary interest are allowed to depend on covariates, or none whatsoever, to complete specification of the likelihood function. This provides for model parsimony as needed while allowing all the relevant covariates to be estimated in the marginal mean model.
A simulation study indicated that when the true model is specified, each of the proposed marginalized mixture models provides low biases, Type I errors and confidence interval coverages close to the nominal levels. As shown in additional simulation studies reported in Benecha et al. (2017), model mis-specification can result in undercoverage and inflated Type I errors. Use of empirical covariance estimation as proposed by Long et al. (2014) for MZIP models would likely improve coverage and Type I errors for large samples. In any case, assessment of model goodness-of-fit is highly recommended. Unfortunately, such assessment is often hampered by computational difficulties in fitting complex models such as MNB-Pois when the data at hand do not contain sufficient information to estimate all the model parameters. Reducing the number of covariate parameters often provides an expeditious remedy for this situation. Another advance would be to develop score tests for goodness of fit, as proposed by Ridout et al. (2001) in comparing ZIP and ZINB models, that do not require fitting the model under the alternative hypothesis.
In summary, the proposed marginalized mixture modeling framework provides a wide range of alternatives to directly estimate exposure effects on marginal means of counts