Estimating cost-offsets of new medications: Use of new antipsychotics and mental health costs for schizophrenia

Estimation of the effect of one treatment compared to another in the absence of randomization is a common problem in biostatistics. An increasingly popular approach involves instrumental variables—variables that are predictive of who received a treatment yet not directly predictive of the outcome. When treatment is binary, many estimators have been proposed: method-of-moments estimators using a two-stage least-squares procedure, generalized-method-of-moments estimators using two-stage predictor substitution or two-stage residual inclusion procedures, and likelihood-based latent variable approaches. The critical assumptions to the consistency of two-stage procedures and of the likelihood-based procedures differ. Because neither set of assumptions can be completely tested from the observed data alone, comparing the results from the different approaches is an important sensitivity analysis. We provide a general statistical framework for estimation of the casual effect of a binary treatment on a continuous outcome using simultaneous equations to specify models. A comparison of health care costs for adults with schizophrenia treated with newer atypical antipsychotics and those treated with conventional antipsychotic medications illustrates our methods. Surprisingly large differences in the results among the methods are investigated using a simulation study. Several new findings concerning the performance in terms of precision and robustness of each approach in different situations are obtained. We illustrate that in general supplemental information is needed to determine which analysis, if any, is trustworthy and reaffirm that comparing results from different approaches is a valuable sensitivity analysis. Copyright © 2011 John Wiley & Sons, Ltd.


Introduction
Estimation of the effect of one treatment compared to another in the absence of randomization is a common problem in biostatistics. With more emphasis placed on value in the health care setting illustrated with increased funding for comparative effectiveness research in the United States [1], researchers are increasingly utilizing observational studies to learn about effectiveness of interventions. It is well understood that a simple comparison of average outcomes between treatment arms will potentially confound the treatment effect with various selection effects (associations of predictors with treatment). If the treatment assignment mechanism depends on unmeasured variables affecting the outcome of interest (unmeasured confounders) then regression adjustment and propensity score methods [2] may fail to account for selection effects. In this case, instrumental variables methods may provide a pathway to causality. An instrumental variable (IV) is a random variable that is predictive of the treatment a patient receives but uncorrelated with the outcome conditional on treatment [3].
Despite the existence of the various estimators for IV analysis, there is little research on their comparative operating characteristics, and far less on empirical experience in real world settings. A compelling issue is that traditional instrumental variables methods are invariant to the form of the data (continuous versus binary outcome, continuous versus binary treatment) prompting the question of whether one can do better by tailoring methods to a given situation. Terza et al. [4] proposed a residual inclusion method for cases when the outcome or selection equation is nonlinear (e.g. as in generalized linear models). Another important consideration is that traditional IV methods do not utilize parametric assumptions, perhaps surprisingly so given the recent explosion in the adoption of latent variable models derived using parametric assumptions (e.g. IRT models, Rasch models, latent class models, latent factor models). An exception is the bivariate probit model, which is generated from assumptions on the underlying latent variables [5]. We study the traditional IV methods, the residual inclusion method, and the latent variable approach to IV in the context of evaluating whether newer antipsychotic drugs are less costly than their predecessors. We focus on the estimation of the causal effect of a binary treatment on a continuous outcome.
Our research is motivated by the problem of comparing mental health spending between schizophrenia patients using newer atypical antipsychotic medications and those using older conventional antipsychotic medications in Florida's Medicaid population over the period 1994-2001. The older drugs, which are D2-antagonists such as chlorpromazine and haloperidol, were introduced in the 1950s to alleviate hallucinations and delusions in psychotic patients. Atypical antipsychotics, including clozapine, olanzapine (trade name zyprexa), quetiapine (trade name seroquel), and risperidone (trade name risperidal), were first marketed in the late 1980s and 1990s, and while considerably more expensive than the D2-antagonists, were associated with a different profile of side effects. While the conventional antipsychotics were associated with neurologic side effects, the newer atypicals have been linked to other side effects such as weight gain, diabetes, and lipid problems. During our study observation period, three atypicals were introduced-zyprexa, seroquel, and geodon. Some have claimed that atypical antipsychotics, while more expensive ultimately pay for themselves by leading to reductions in other types of health spending [6]. This claim has come to be known as the offset hypothesis. The offset hypothesis asserts that the greater tolerability of the new antipsychotics will improve adherence to treatment regimens, thereby reducing relapses, resulting in declines in the use of hospital and emergency room services. However, it is disputed whether lower subsequent costs for atypicals are sufficiently large as to offset their greater upfront cost [7].
Study of the offsets hypothesis is complicated by the fact that patients that receive the newer atypical drugs likely differ from those getting the older drugs on a number of systematic factors that may not be fully measured. These include existing medical and mental health comorbidities, severity of illness, and treatment preferences.
We utilize variation in the availability of atypical drugs across the state of Florida that arises because the time-lag between Federal approval and local availability varies by geographic area to form instrumental variables. The instruments are indicators of whether a specific atypical was available in a patient's geographic area of residence defined as one of 11 area Medicaid offices representing geographic, cultural, social, and economic factors in a given year. Using these instruments we illustrate several different estimators that account for unmeasured selection effects to test the offsets hypothesis in the Florida Medicaid population. Our goal involves quantifying the evidence for or against the offsets hypothesis using multiple approaches encompassing different assumptions, thereby enabling one approach to act as a sensitivity analysis for another and yielding real-world experience of the extent to which methodological concerns about the various approaches matter. We also use simulations to evaluate the operating characteristics of the various methods when assumptions hold and when they are violated.
We next define notation, describe assumptions, and introduce models. General methods for estimation are detailed in Section 3 and implemented on the Florida Medicaid data in Section 4. Section 5 describes a simulation study to evaluate the operating characteristics of the methods when assumptions hold and when they are violated. We provide concluding remarks in Section 6.

Notation and definitions
We use simultaneous equations to specify models and the potential outcomes nomenclature of [8] to define treatment effects. Let y i , z i , x i , u i , and c i denote the outcome, the treatment variable, a vector of exogenous covariates, a vector of instrumental variables, and an unmeasured confounding variable for the ith of n subjects.
The instrumental variables u i are assumed to be: (1) associated with z i conditional on x i , (2) uncorrelated with y i conditional on (z i , c i , x i ), and (3) uncorrelated with c i conditional on x i [9,10]. Assumption (2) says that there is no direct effect of u i on y i (the exclusion restriction), while assumption (3) says that u i shares no common causes with y i (i.e. u i is uncorrelated with any unmeasured variables that predict y i ). If assumption (3) is violated then u i may be related to y i through an uncontrolled confounding variable [11], thereby introducing bias. In models where y i is modeled with an explicit error term, y,i , assumptions 2 and 3 reduce to the assumption that u i and y,i (which includes c i ) are uncorrelated conditional on (z i , x i ). Although x i and c i might predict both y i and z i and so structurally are equivalent, c i is problematic because it is unobserved. Controlling for x i generally makes assumptions (2) and (3) above more believable by controlling for variation in unmeasured confounders that is correlated with x i [12].
The annual mental health spending for patient i, denoted cost i , is the sum of all payments made for services with mental health diagnoses, mental health procedures (e.g. psychotherapy), or psychotropic drugs that are primarily used for mental health treatment such as antidepressants and mood stabilizers. The distribution of cost i is right skewed. As discussed in Section 4.1, Box-Cox transformations under various models indicated that the log-transformation traditionally used for spending data to account for right-skew would be reasonable. Accordingly, y i = log(cost i ). Because all patients in the data set received services from a health care provider, the 29 observations with cost i = 0 were considered impossible and excluded from the analysis.
The treatment z i is a binary-valued indicator of whether a patient filled an atypical (z i = 1) or a conventional antipsychotic (z i = 0) prescription in a given year. If a patient filled both we assigned them to the drug that accounted for the greatest share of their health costs for that year. Thus, z i is defined in the same year as cost i . In a sensitivity analysis we restricted the data to new users (i.e. those who initiated treatment with an antipsychotic during our study period) and the first year of data on each individual, thereby obtaining the subset of subjects for whom we could reasonably assume made their initial antipsychotic choice during the study period. This allowed us to check whether it made sense to combine new users and longer term users, and those staying on a single drug from those who switched drugs, in a single analysis. The results were minimally affected suggesting that a pooled analysis that controlled for year was justified.
The predictors in x i are race/ethnicity, female, age, receipt of Supplementary Security Income (SSI) benefits, history of substance abuse, area of residence, and year. Variables represented by c i could include health status of the patient, access to skilled physicians, and physician prescribing habits. The vector of instrumental variables u i consists of the products of binary indicators of whether zyprexa, seroquel, and geodon were FDA-approved at the start of each year and the 10 area-of-residence indicators; the most populous area, Miami, was the excluded category. The variables (x i , c i , u i ) are all defined in the same year as cost i and z i .
The rationale for the above choice for u i is that the availability of antipsychotics depends on physician learning which in turn depends on local area attitudes towards innovation, information dissemination, and other conditions that varied substantially across Florida. Thus, drug approval and area of residence are related to antipsychotic use at a given time. In order for u i to be an appropriate instrument, it cannot be directly related to health care costs or to unmeasured confounders affecting health care costs. This would not be the case if patients with higher costs lived in areas that were faster adopters or if attitudes towards innovation, information dissemination, and other conditions directly affect costs. Thus, the inclusion of area of residence indicator variables in x i helps make drug approval interacted with region a valid IV.

Assumed underlying model
The outcome y i depends on treatment z i and the exogenous predictors x i through the linear regression equation where y,i has mean 0 and variance 2 y . The validity of this model relies on the existence of linear relationships, homogeneous variances, independent observations, and orthogonality between (z i , x T i ) T and y,i . In the Medicaid data, z i and y,i are likely to be correlated as (e.g.) detailed measures of the severity of a patient's health condition were not available, and these likely affect a patient's propensity to fill an atypical prescription and their net health spending.
A second equation describes the relationship between z i and (u i , x i ) where z i = I (z * i > 0). In terms of the Medicaid data, z * i represents the patient's propensity to be prescribed an atypical antipsychotic. By assumption, the predictors on the right-hand side (rhs) of (2), including the IV u i , are independent of y,i .
The regression parameter 1 , the difference in the outcomes when all other factors (including those influencing y,i ) are equal, is of primary interest. When z i is exogenous (uncorrelated with y,i ),

Parametric model: structural and distributional assumptions
In parametric analyses we follow the construction of the bivariate probit model. This model assumes that the error term i = ( y,i , z,i ) is an additive function of c i , an unmeasured confounder that linearly affects (y i , z * i ), and ( y,i , z,i ), a random disturbance. That is, i = ( 3 c i + y,i , 3 c i + z,i ), where c i , y,i , and z,i are mutually independent random variables each with mean 0 and variance 2 c , 2 y , and 2 z , respectively. Hence, i has mean 0 and covariance Because we can multiply 2 c by k, and divide 3 and 3 by k 1/2 without changing the model, for model identification we set 2 c = 1. Derivation of the bivariate probit is completed by assuming that c i , y,i , and z,i are normally distributed, implying that i is bivariate normal. Thus, the model is identified through the first and second moments of the distribution of i . Because z i is binary we can only identify the standardized effects h 1 /( 2 3 + 2 z ) 1/2 and h 2 /( 2 3 + 2 z ) 1/2 , leading to the constraint 2 3 + 2 z = 1. With three parameters and two degrees of freedom in cov( i ) we set 3 = 1 (equivalently, 2 z = 0) to identify the model, defining 3 and 3 /( 2 3 + 2 y ) 1/2 as the covariance and correlation between y,i and z,i , respectively. In the normal case unobserved selection is thus quantified by = 3 / y , where 2 y = 2 3 + 2 y . Clearly, ∈ [−1, 1].

Ordinary least squares (OLS)
Linear regression fits the model y i = 1 z i +b T 2 x i + y,i where var( y,i ) = 2 y . The least-squares estimator isˆ = (X T X) −1 X T y, where y = (y 1 , . . ., y n ) T and X is the n× p matrix with ith row (z i , x T i ). When y,i is mean independent of all predictors and has homoscedastic variance (as assumed here), the estimator is minimum variance unbiased among the class of linear estimators (Gauss Markov theorem). However, if any predictor is correlated with y,i , OLS will be inconsistent [13, Chapter 5].

Two-stage least squares (2SLS)
The classic IV estimator of b in (1) is the minimum variance estimator among those satisfying the constraint that u i and i are orthogonal (see Appendix A for construction). This method-of-moments estimator is equivalent to the two-stage least-squares (2SLS) procedure, in which we first fit to obtainẑ i , and then fit to estimate b. In the special case where u i is univariate-binary and there are no other covariates, the 2SLS procedure given by (3) and (4) is equivalent to the Wald estimator [14]. The standard error ofb is The orthogonality condition enforced in (3) holds factors affecting i constant, allowing b to be estimated for those subjects for whom u i influences z i , the 'population on the margin'. In the offsets analysis, the population on the margin is patients whose uptake of an atypical antipsychotic medication was influenced by the availability of zyprexa, seroquel, or geodon in the city where they lived. Thuŝ 1 is a 'structural shift' of using an atypical. A notable feature of 2SLS is that no presumption is made about the type (e.g. binary, ordinal, interval) of variables that y i and z i are or about the distribution of i . The binary nature of z i led us to consider whether more efficient results could be obtained by accounting for the form of z i .

Alternative two-stage approaches
Although the 2SLS estimator is consistent when the IV assumptions hold [16,17], inferences may be inefficient because the binary form of z i is not respected. As an alternative to 2SLS, we can replace (3) with where (·) is the cdf of the standard normal distribution. The implied (nonlinear) 2SLS procedure fits (6) using nonlinear least squares (NLS) or a generalized linear model, setsẑ i = (u T iĥ 1 +x T iĥ 2 ), and then evaluatesb = (X TX ) −1X T y. The interpretation of 1 is unchanged by the nonlinear first-stage equation because z i is the main effect in the outcome equation, not z * i . Following Terza et al. [4] we term this approach 'two-stage predictor substitution (2SPS).' Despite the fact that orthogonality ofẑ and is no longer enforced, 2SPS has been said to yield consistent estimates of 1 when the outcome equation is linear and (h 1 , h 2 ) is estimated consistently [5]. However, in the case when the outcome equation is nonlinear, 2SPS has been shown to perform poorly even when the first-stage equation is linear [4].
In the case of a linear model for z i , point estimates of 1 under the model are identical to those obtained from (4). However, when z i depends on a nonlinear model such as (6), the effect of z i above and beyond the effect ofẑ i − z i (the 'endogenous (bad) variation' in z i ) on y i does not equal the effect ofẑ i (the 'exogenous (good) variation' in z i ) on y i . The two-stage residual inclusion (2SRI) procedure of [4], whose origins date to a test for endogeneity in [18], is the estimation of (6) followed by (7). It has been shown that 2SRI yields consistent estimates for linear and nonlinear models [13, Chapter 12].

Maximum likelihood
Assuming normality and using the parameterization of Section 2.2, it follows that In this model quantifies the extent to which unobserved factors affecting z * i are correlated with those affecting y i [19]. A positive value of indicates atypical-favorable selection because unobserved factors that make individuals more likely to take an atypical are also more likely to have higher health costs; i.e. ignoring selection would lead to over estimation of 1 .
The joint marginal density for (y i , z i ) is obtained by integrating over z * i in the model defined by (1), (2), and (8) (see Appendix B for details). The product of these densities is the observed data likelihood function for (b, h, , 2 y ), given by for = ±1. The presence of y,i and thus b in (9) and (10) precludes separate maximization of the y i and z i | y i components of the likelihood function. The two components need to be fit simultaneously in order forˆ 1 to have a structural shift interpretation. Only when = 0 does it suffice to fit separate linear and probit regression models.

Bayesian inference
Bayesian analysis provides a more flexible approach to inference than maximum likelihood by incorporating prior distributions containing information about the parameters. In the absence of prior information, a non-informative prior is often reasonable.
Because it is the mechanism governing selection, plays a crucial role in the offsets analysis. We are interested in the sensitivity of the results to the prior for and the extent to which it characterizes the other approaches. Prior distributions for that cover a wide range of levels of precision will be used.
Conditional on , the other model parameters are well identified by the data. Therefore, to investigate sensitivity to the prior on , we specify priors diffuse in ( 1 , b 2 , h 1 , h 2 , 2 y ) and with varying levels of informativeness about . Specifically, we assume where ( +1)/2 ∼ Beta( 1 , 2 ); p( ) has the shape of a Beta density but its support is extended from [0, 1] to [−1, 1]. Note that = −2 y (1− 2 ) −1/2 maps the correlation coefficient to (−∞, ∞). In the special case where 1

the density of a t-distribution
with two degrees of freedom, mean 0, and scale parameter (2 2 y ) −1/2 ; a thick-tailed distribution. The values considered for m = ( 1 , 2 ) are such that E[ ] = 0 and var( ) = 2 . This requires that 1 = 2 = ( −2 −1)/2, which places a supremum of 1 on 2 (a bound that is only obtained in the limiting case where p( ) has point masses of 1/2 at ±1). The larger 1 = 2 the smaller 2 .
Although 1 is the same conditional effect as for 2SLS and maximum likelihood, Bayesian interpretations are conditioned on data obs = {y i ,

Testing the exclusion restriction
While the exclusion restriction is a necessary condition for parameter identifiability in the two-stage approaches, the specification of a parametric distribution for i makes the exclusion restriction nonessential for identifiability of likelihood-based procedures. Therefore, the exclusion restriction may be tested by fitting the model where i is specified as in (8). Equation (11) is equivalent to the selection model in [20] and is a special case of the structural shift model in [21]. The model is fully identified when i is bivariate normal if (x i , u i ) contains at least one non-constant predictor [22]. A small non-significant value of b 3 supports the exclusion restriction. However, we emphasize that this test is only valid if i truly is bivariate normal, an assumption that itself cannot be fully evaluated using the observed data.

Computation
The two-stage procedures can be implemented using standard methods for fitting linear or generalized linear models. However, the computation of standard errors is complicated by the need to simultaneously account for the estimation error from both equations. Equation (5) may be used for 2SLS while asymptotic approximations, such as those outlined in [13,Chapter 12], are needed to obtain closed-form expressions for 2SPS and 2SRI.
Because the likelihood function depends on unobserved latent variables, specialized model-fitting routines are needed for maximum likelihood and Bayesian inferences. MLEs are obtained by directly maximizing the observed data log-likelihood function in (10) using a nonlinear optimization package in R. Standard errors are computed using the delta method to obtain closed-form expressions approximating the covariance matrix of the parameters or functions thereof that is then evaluated at the MLEs of the parameters (see Appendix C). WinBUGS [23] is used for Bayesian inference with inferences evaluated as Monte Carlo averages over draws from the posterior distribution of the model parameters. Convergence is monitored using trace plots and the diagnostics available in CODA [24].

Cost-offsets: atypical and conventional antipsychotic use in adults with schizophrenia in Florida
The dependent variable is the log-transformed aggregate spending for all services with mental health diagnoses, mental health procedures (e.g. psychotherapy), or psychotropic drugs that are primarily used for mental health treatment such as antidepressants and mood stabilizers for a patient in a given year. The objective is to infer 1 , the difference in the annual log-spending of treatment of using an atypical versus a conventional antipsychotic for individuals suffering from schizophrenia in Florida's Medicaid population during 1994-2001 when all other factors, including unmeasured factors influencing i , are fixed.
To facilitate interpretation we transform estimates from log-spending to spending (units of $). At a given value of (z i , x T i ) T , mean spending equals the exponential of mean log-spending multiplied by a retransformation factor. The retransformation factor may in general be estimated using the smearing estimate [25], given byŜ = n −1 n i=1 exp(ˆ y,i ), whereˆ y,i is the estimated residual in (1). Therefore, in $ the savings attributed to using atypicals over conventionals is given by Under likelihood-based approaches there are alternatives to the smearing estimate. For example, the MLE of the retransformation factor is given byŜ = exp(ˆ 2 y /2). In Bayesian implementations, the retransformation factor from log-normal to normal, S = exp( 2 y /2), is incorporated in the posterior means of any quantities on the scale of the retransformed outcome and so is automatically accounted for when quantities of interest are evaluated as Monte Carlo averages over draws from the posterior distribution of the model parameters.
Another quantity of interest is the average treatment effect (ATE). The ATE evaluates the combined effect of selection and treatment on spending by evaluating the expectation with respect to f (y i (z)), the marginal distribution of y i (z) after integrating over z * i ∈ (R : z i = z), whereas 1 is the pure effect of the latter. The average treatment effect over individuals with covariates values {(x i , u i )} n i=1 is given by  30 1971--1988 or in terms of dollars (14) where (·) and (·) denote the pdf and cdf of the standard normal distribution, respectively. Equation (13) illustrates that ATE = 1 when = 0 (no selection). A feature of parametric methods is that they are able to delineate between population effects (e.g. 1 ) and (average) treatment effects specific to the subjects in the sample. The expressions for the ATE's in (13) and (14) are informative because they show their relationship to 1 . Such expressions can only be determined through the specification of a full parametric model for the data.
The local average treatment effect (LATE), the effect of treatment on those whose treatment status can be changed by u i (the marginal population), equals 1 when there is a single binary instrument, no covariates, the exclusion restriction holds, and the effect of u i is monotone across i [26, p. 155]. However, under our model different LATEs correspond to the values of u i defining the marginal population. Specified mathematically, the LATE is given by is the potential treatment of subject i when u i = u (k) (k = 0, 1). (See [27,28] for summary measures of heterogeneous LATE.) However, 1 (the 2SLS estimand) can be thought of as a weighted average of a LATE for the marginal subpopulations identified (one at a time) by each component of u i . Therefore, despite not corresponding to a single LATE, our primary interest is in 1 and so we do not report LATE for any particular subpopulations.
To gauge the sensitivity of results computed under the Bayesian model with respect to the prior for , we fit this model with 2 between 0.96 (prior has a U-shape) and 10 −4 /3 (prior is a spike at 0).

Descriptive results
The Florida Medicaid data set comprises 26 759 adults diagnosed with schizophrenia at some point during 1994-2001 yielding n = 78 349 person-year observations (Table I) of health care spending. The vector x i has 18 elements (intercept, black, other non-white (largely Latino), female, age, receipt of supplemental security income (SSI), substance abuse history, year, and 10 area dummies), whereas u i contains 33 elements (the availability of zyprexa, seroquel, and geodon and their interactions with area).
A comparison of means based on Table I suggests that atypical antipsychotics are much more expensive than conventional drugs. However, this analysis does not account for non-random selection of patients into treatment; for example, patients with more severe conditions may have higher propensity to receive an atypical and also be more costly. It is clear from Table I that spending is higher for males, whites over blacks, blacks over other non-whites, substance abusers, and those receiving SSI. For all predictors other than male (versus female), the magnitude of the difference is greater within atypical antipsychotics than conventionals. However, the magnitude of the correlations between year and spending, and between age and spending, was greater for conventional antipsychotics.
Because the distribution of cost i is naturally skewed to the right, we sought a transformation that induced normality. The maximum likelihood estimate of the Box-Cox transformation is 0.140 for unadjusted cost i , 0.153 under OLS, and approximately −0.05 under the fully parametric simultaneous equations model. Because the log-transformation is a compromise between these alternatives, corresponding to a transformation parameter of 0, we transformed the data using y i = log(cost i ). Using the alternative Box-Cox transformations had minimal effect on the results of the analysis.

Treatment effects
We analyzed the data using each of the procedures discussed in Section 3: ordinary least squares (OLS), two-stage least squares (2SLS), two-stage predictor substitution (2SPS), two-stage residual inclusion (2SRI), maximum likelihood, and Bayesian analysis with various priors.
The large value of estimated by the likelihood-based methods (MLE and Bayesian models) was verified by plotting the profile likelihood function of and confirming the existence of a unique global optimum atˆ = 0.721, far from the edge of the parameter space (Figure 1). Under such a strong unmeasured selection effect, 1 and ATE are destined to have very different values. The Staiger-Stock test F-statistic of 9.86 in the 2SLS analysis suggests that the instrument only accounts for a small fraction of the selection effect and would be considered borderline-weak compared to the conventional standard of 10 [29]. Ordinary least squares (OLS) suggests that the newer atypical antipsychotics result in more spending (Table II: 1 estimated to be near 1), the two-stage procedures give inconclusive results ( 1 estimated to be near 0), and the likelihood-based methods suggest that the newer atypicals lead to lower levels of spending ( 1 estimated to be near −0.7). In terms of annual patient dollars, the cost of atypicals less the cost of conventionals was estimated to be $9948, range from −$263.3 to $2262, and range from −$10 010 to −$9065 under OLS, the two-stage procedures, and the likelihood-based procedures, respectively. Because the left-skewness of the data inflates the treatment effect upon retransformation, the predicted mean OLS estimate is substantially larger than the raw mean difference ( Table I). Inflation of the mean due to retransformation combined with the highly positive selection effect leads to the large saving found under the likelihood-based analyses.
The OLS estimate of 1 and the likelihood-based estimate of the ATE are fairly similar, illustrating that the former is actually estimating the ATE. The SE of the MLE of the ATE is slightly smaller than the SE of the OLS estimate, consistent with the result in [30] that regression parameters of terms unique to one regression equation are estimated more efficiently in a bivariate model than with the corresponding univariate model.  0.008 0.008 The ATE, ATE $ , and are only estimated for the likelihood-based procedures as estimation relies on the specification of a probability distribution for the observations. Despite estimating the same quantity, differences between the two-stage and likelihood-based estimates of 1 are substantial. Because we thoroughly check the distribution of the observed variables graphically and using several diagnostics, and also explored various variable transformations, we believe that the discrepancy in these estimates is due to things we do not observed that cannot be tested fully empirically: violations of the exclusion restriction or departures of the distribution of the error terms from the bivariate normal distribution of the data. To gain further insight into possible causes of the discrepancy we conducted a simulation study (Section 5).
The 95 per cent confidence interval for 1 under 2SLS only just overlapsˆ 1 under 2SPS and 2SRI and conversely the 95 per cent confidence intervals of 1 under 2SPS and 2SRI only just encompassˆ 1 under 2SLS, illustrating that the results are sensitive to small differences in the method of estimation. The MLEs had SEs about one-fifth and one-third those for 2SLS and its nonlinear variants (2SPS and 2SRI), respectively, thus highlighting the ability of the likelihood procedures to yield more precise inferences.
Comparing the scale of the vertical axes in Figure 2, the Bayesian point and interval estimates of 1 and $ 1 were substantially more sensitive to p( ) than Bayesian estimators of ATE and ATE $ . From log 10 ( 2 ) = −3 (i.e. 2 ≈ 0.001) to log 10 ( 2 ) = −2 (i.e. 2   data obs ]) move from being close to the OLS estimate to close to the MLE. However, as indicated in the plot of the selection effect (bottom-right), a very precise prior on 2 is required to obtain Bayesian estimates that correspond to those of the two-stage approaches. The 33 elements of u i in (11), the model for testing the exclusion restriction, had standardized effects (estimate divided by standard error or posterior standard deviation) ranging from 1.054 to 1.954, not significant at the 0.05 level. The F-statistic for the test that b 3 = 0 equals 237, well above the critical value at the 0.05-level of 47.4. Thus, there is strong evidence under the assumed bivariate normal model that the exclusion restriction is violated.

Simulation study
We conducted a simulation study to evaluate the sensitivity of the estimators of 1   In the first group of simulations, i was drawn from a bivariate normal distribution, the case where the likelihood function is correctly specified. In subsequent simulations, observations were randomly drawn from a bivariate t-distribution with seven degrees of freedom or were correlated draws from gamma distributions, allowing assessment of the robustness of the approaches to thicker-tailed and skewed distributions. Finally, we simulated data in violation of the exclusion restriction by setting sim 3 = sim 1 to evaluate sensitivity with respect to the exclusion restriction. We also evaluated the normal likelihood-based test of the exclusion restriction for sim 3 ranging from 0 to sim 1 . The bias and root mean square error (RMSE) for each estimator and scenario are displayed in Figures  3 and 4, respectively, while Table III contains operating characteristics of the likelihood-based test of the exclusion restriction. However, in discussing these results we use a method-by-method approach; this was most helpful in describing the scenarios under which each approach works best and when it absolutely should not be used. To supplement the results, Figure 5 depicts an algorithm for determining which approach is best to use in practice.

Results for OLS
As expected the OLS estimates became increasingly biased the further was from 0 ( Figure 3) with RMSE is essentially equal to bias in all cases where = 0. Any variations in its performance across distributions or under violation of the exclusion restriction (which is irrelevant as far as OLS is concerned) were drowned out by the impact of an unmeasured confounder. Clearly, if there are legitimate concerns about unmeasured confounders then OLS is not appropriate. Figure 3 shows that 2SLS performs well across all conditions other than violation of the exclusion restriction, in which case 2SLS is even more biased than OLS and thus should not be used. However, if the IV is supported by theoretical arguments or other insights indicating that the exclusion restriction holds use of 2SLS is appropriate (as shown in Figure 5).

Results for 2SLS
It is clear from Figure 4, where the range of the vertical axis for 2SLS is much greater than that of the other methods, that the standard errors of 2SLS estimates exceed those of all other methods. Thus use of 2SLS typically lowers the statistical power of the analysis compared to other approaches.  Figure 5. Algorithm for choosing the best method in practice. The decision process begins with the left-hand rectangle and at each subsequent step selects a method (ovals) or moves to the next decision (boxes). Because our results suggest that 2SPS is dominated by either 2SRI or 2SLS, it does not appear. Data transformations and other analyses that inform model specification should be performed prior invoking this algorithm.

Results for 2SPS and 2SRI
As shown in Figures 3 and 4 these alternative moment-based IV procedures compared favorably to 2SLS when the underlying distribution is symmetric or the exclusion restriction is violated (although they still perform consistently poorly in this scenario) but not so favorably when the underlying distribution is skewed. In general, 2SRI is more precise (smaller variance and RMSE) than 2SPS which is more precise than 2SLS while the reverse order holds for sensitivity to skewness (i.e. 2SRI is worst performed). These results, not previously reported in the literature, may be a consequence of the nonlinearity of Equation (7) introducing bias when the distribution of i is skewed. Because the theoretical results reported in [4] imply that 2SRI and 2SPS are consistent (irrespective of the underlying distribution), bias should approach 0 as n increases. However, additional simulations at different values of n suggested that, at best, the convergence is very slow.
Based on the above, 2SRI may be the best method to use when the evidence supporting the validity of the IV is strong (as for 2SLS) but n is such that the study is insufficiently powered under 2SLS. However, if there is evidence that i has a skewed distribution, particularly y,i (see Section 5.5), then 2SLS would be the safer (more robust) choice.

Results for likelihood-based estimators
The MLE and the Bayesian estimators of 1 yield better results than the two-stage estimators when the underlying distribution is normal and are more robust to violations of the exclusion restriction. However, they are more sensitive to departures of the underlying distribution from normality. An interesting finding is that likelihood-based estimators of 1 are relatively more robust when non-normality is in the form of thicker tails than skewness while the reverse is true for likelihood-based estimators of the ATE.
The robustness of the MLE of the ATE is due to the presence of . The two-stage methods do not account for and so, with no way to compensate for 3 = 0, yield biased results, while the MLE of 1 is only partially affected by violations of the exclusion restriction due to the fact that the error correlation partially absorbs 3 u sim i = 0. As indicated in Figure 5, likelihood-based methods are recommended when unmeasured confounders are thought to exist but it is questionable whether the IV is valid or no IV is available. To make likelihood-based analyses as believable as possible, transformations of y i that induce normality inˆ y,i should be considered. Table III shows the results of including u sim i as a predictor of y i when the true value of its coefficient, 3 , varies from 0 to 0.144 (i.e. up to the magnitude of the effect of the IV). When the underlying distribution is bivariate normal, 3 is estimated with high precision and no bias. Furthermore, the power of the test H 0 : 3 = 0 against the alternative H A : 3 = 0 increases from 0.05 when the true value is 0 (in this case power = type I error) to over 0.95 at 0.072. Therefore, if normality holds the likelihood-based methods provide a valid test of the exclusion restriction.

Test of exclusion restriction.
Inferences about 3 are almost as reliable if the underlying distribution has t 7 as opposed to normal marginals, slightly over-covering when 3 > 0. However, if the underlying distribution is skewed (as for a Gamma distribution), then estimates of 3 are biased and the type I error of the test of the exclusion restriction is excessive. Therefore, the bivariate normal test of the exclusion restriction cannot be relied upon if the true outcome distribution is asymmetric.

Other results
We also evaluated the approaches under various other scenarios for which we do not present results. In one series of simulations one of y,i and z,i was normal and the other was non-normal (t or gamma) distributed. Results were more sensitivity to non-normality of y,i than z,i . In fact, as long as y,i was symmetric, 2SRI appeared to be robust to the distribution of z,i and the likelihood-based procedures were only biased by small amounts.
When h 1 increases by a factor of 2, the MSE of the two-stage estimators decreases by a factor of 4. Although the MLE becomes more precise as h 1 increases, the trend is nowhere as dramatic as for the two-stage approaches. This reflects the fact that the likelihood-based procedures are identified from the distribution of the data and so the involvement of u sim i improves the stability and precision of the estimates. Multiplying 1 by 2 does not alter the precision of the estimators revealing that the magnitude of 1 is not tied to the precision with which it is estimated.
The performance of interval estimators for 1 was highly correlated with the bias and variance of the corresponding point estimators. If the point estimator was unbiased then the coverage of the interval estimator was close to 0.95.

Discussion
We used data from a large state database to investigate whether newer atypical antipsychotics lowered net costs of health care relative to conventional antipsychotics. Because treatment is non-randomly assigned, instrumental variables methods were used to separate the true effect of treatment on log-cost from selection effects. To aid interpretation, we converted the total payments made under each treatment from log-spending to spending (in $). We used several approaches for the analysis with the rationale that the methods would validate one another if similar results were obtained. The methods yielded results that were surprisingly disparate; atypicals were estimated to save about $10 000 under likelihood-based procedures (the MLE and Bayesian models), in contrast to no saving or increased spending, of about $2000, under the two-stage procedures. These results bring the assumptions underpinning the methods into question.
To gain a sense of which results to believe, we used simulations that studied the properties of the two-stage procedures and the MLE. We observed the following: (1) OLS only works in the absence of confounding, (2) 2SLS works well in all scenarios other than when the exclusion restriction is violated, in which case it fails completely, (3) 2SPS performs better than 2SLS unless the underlying distribution is skewed, (4) 2SRI performs better than 2SPS unless the underlying distribution is skewed, and (5) likelihood-based estimators perform better when the underlying distribution is normal and when the exclusion restriction is violated. While results (1) and (2) are well known, (3)-(5) are new findings. The poor performance of the alternative two-stage procedures when the distribution of the data is asymmetric illustrates that the complete robustness of 2SLS to the distribution of the data is compromised by seeking to improve efficiency through the use of a nonlinear first-stage equation.
Likelihood-based estimators of 1 and 3 (the direct effect of the candidate IVs on the outcome) are surprisingly robust to violations of normality as long as the true distribution is symmetric, but fail when the true distribution is skewed. Thus, while likelihood-based models can be robust and, therefore, can be used to test the validity of 2SLS when the true distribution is symmetric, their findings are quickly compromised if the true distribution is skewed.
With the above in mind, where does the evidence for offsets of atypical antipsychotics point? Based on the analysis of the Florida Medicaid data, there is evidence that the assumption of normal residuals, although a reasonable approximation, does not hold exactly. Therefore, because 2SLS is robust to the distribution of the residuals and the exclusion restriction can be defended heuristically (i.e. from an economic standpoint), 2SLS might be most trustworthy. However, the findings in this paper reveal that even a small departure from the exclusion restriction makes 2SLS and the alternative two-stage procedures likely to produce results that are substantially biased. Given the highly conflicting results across the approaches we feel there is insufficient evidence to conclude the offset of atypical antipsychotics is positive or negative. This is generally consistent with the research from clinical trials such as the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) study [31].
The fact that different statistical procedures result in such different results is alarming. The finding that the alternative two-stage procedures are sensitive to the underlying distribution of the data is important for researchers using these methods. Similarly, the results on the sensitivity of parametric models derived from latent variable constructs, as for the likelihood-based analysis here, is an important lesson to statisticians and other users of these approaches.
The methodology we have developed is generally applicable to any observational study in which an IV is available for the treatment and outcome of interest. In light of the sensitivity results reported here, the availability of a valid IV is critical. However, finding IVs in practice can seem an art form to one not familiar with IV analysis as the arguments supporting the exclusion restriction are theoretically rather than empirically driven. Therefore, we recommend a subject matter expert with an acute sense of the outcome, treatment and unobserved confounding variables is integrally involved in the determination of candidate IVs.
A limitation of our empirical analysis is that we did not account for repeated measurement of subjects that appeared in the data set in multiple years. We subsequently fit a hierarchical Bayesian model that included a random intercept for subject. The posterior mean of 1 was −0.492 (sd = 0.0197), suggesting that single-level analyses might over-estimate the magnitude of 1 . The withinsubject variance had a posterior mean of 1.036 (0.0125) while the between-subject variance had a posterior mean of 1.118 (0.0141), suggesting there is substantial unexplained variance between subjects.
Another direction in which the likelihood-based estimators could be extended is by assuming a more flexible family of distributions for i (e.g. bivariate t-distribution) and constructing estimators under those less restrictive assumptions. However, because a more flexible model is likely to be less well identified by the data, it is not clear that it would yield an estimator that is more robust to the distribution of the data. Equation (A1) can be solved exactly if the number of IVs equals the number of endogenous predictors. However, if dim(u i ) > dim(z i ) it is generally not possible to satisfy all of the orthogonality conditions simultaneously; z i is said to be overidentified. An 'optimal' solution is obtained by finding the parameter values that minimize the quadratic form where W is a positive-definite weighting matrix quantifying the relative importance of the orthogonality conditions across the instruments. Setting W proportional to cov(X y ) = 2 y X T X, where y = ( y,1 , . . ., y,n ) T , minimizes the variance of Q(b). The IV estimator is then the value of b which minimizes T (b) T X T XT (b) subject to (A1), yieldinĝ whereX = U(U T U) −1 U T X is the predicted value of X in a regression of X on U and U is the matrix with ith row (u T i , x T i ) [15, p. 530]. Because x i is contained in both X and U it projects onto itself and sox i = x i . Geometrically,ẑ is the projection of z = (z 1 , . . ., z n ) T onto U;ẑ is orthogonal to y and thus contains only the component of variation in z that can be used to estimate b.

Appendix B: Derivation of the likelihood function
Make y,i and z,i the subject of Equations (1) and (2) and factor the joint density as f ( y,i , z,i ) = f ( y,i ) f ( z,i | y,i ). Then integrate with respect to z,i over those values of z,i for which z i = 1, and analogously for those values for which z i = 0, to obtain f (z i | y,i ). Finally, substitute for y,i to obtain the joint density of (y i , z i ).  All other elements of the Hessian matrix, the matrix of second derivatives of the log-likelihood function with respect to (b, h, , 2 y ), equal 0. The covariance matrix of the estimated parameters is estimated by the negative-inverse-Hessian matrix. The approximate variance of estimators of functions of these parameters, such as and the ATE, are then derived using the delta method.