Challenges in Estimating the Impact of Vaccination with Sparse Data

Supplemental Digital Content is available in the text.

where Observed_cases i,t was the actual number of hospitalizations for disease i reported in time period t at the national-level. Cases were randomly sampled at rates of 10%, 1%, and 0.25%, to simulate the population sizes of different regions and states in Brazil. For example, the population sizes of the five regions were between 5% and 48% of the entire national population of infants and seniors. At the state level, the smallest state represented only 0.3% (State 14 in the North region; n=10,500) for children under 12 months of age and 0.2% (State 12 in the North region; n=5,900) of people 80+ years of age across Brazil.
We repeated this sampling process 100 times and created 100 down-sampled datasets at each of these rates. We then fit the models to each of these down-sampled datasets and quantified the impact of the vaccine. To evaluate the effect of sample size on the performance of the models, the estimated impact of the vaccine from the down-sampled data were compared to those from the national-level data, assuming the national estimate was the "ground truth."

eAppendix 3: Additional information on the simulated time series data
The number of outcome ( 0 ) and four control diseases ( 0 , 0 , 0 , and 0 ) are randomly sampled from the following means to generate the time series: where 0 represents the number of pneumonia hospitalizations at time t; x q0 represents the count of control disease j at time t; is a function that maps a time point to the corresponding calendar month; q represents the month j regression coefficient; . represents the indicator function; p is the total number of control diseases in the analysis; q ( q ) is the regression coefficient for control disease j which is given a spike-and-slab prior distribution (depending on q ) in order to allow for data-driven variable selection 5 ; q are binary random variables that indicate if control disease j is included in the regression model ( q = 1) or not ( q = 0); and we used spike and slab priors to select variables in the candidate models with equal prior probability of inclusion for each variable (π = 0.5). We collected 9,000 posterior samples after a burn-in period of 1,000 iterations for each fitted model. The convergence of the Markov chain Monte Carlo sampling algorithm was evaluated using the Geweke diagnostic and Gelman Rubin diagnostic. [6][7][8] There were no obvious signs of non-convergence as the p-values from the Geweke diagnostic tests were all larger than 0.05 and the potential scale reduction factors were <1.1 across all parameters. The effective sample size was also checked to confirm that we had enough posterior samples to make valid inference. 9,10 eAppendix 5: Additional information on the STL+PCA model The seasonal-trend decomposition procedure based on locally weighted scatterplot smoothing (STL method) decomposes time series into three components: trend, seasonality, and the remaining variation in data (eFigure 3). 11 The observed number of control disease cases in month t, denoted by q0 , can be written as follows: ln( q0 + 0.50) = q0 + q0 + q0 where q0 , q0 , and q0 are the trend component, annual seasonal component, and remainder component, respectively. For the STL trend extraction, the span of the locally weighted scatterplot smoothing (LOESS) window can be changed to control the smoothness of the extracted trends. The larger the LOESS window is, the smoother the extracted trends are (eFigure 4). We set it to be 5, 25, and 59 months, and selected the optimal size using the deviance information criterion (DIC). 12 The model with the optimal window size was then used to generate the counterfactual for the post-vaccine period and to quantify the impact of vaccine.    The performance of models was compared using the mean squared error (MSE), which was calculated as follows: where . represents a rate ratio for dataset i and n is the number of total datasets compared in each analysis (i.e., total number of states for state-level analysis and 100 for down-sampled analysis).
MSE was decomposed to variance and bias squared, which were calculated as follows: . The sum of variance and bias squared is equal to MSE.

eTable 1. List of Control Diseases Included in the Synthetic Control Model.
Grouping scheme ICD-10 Exclusions ICD-10 chapters C00_D48 Neoplasms D50_89 Diseases of blood and blood-forming organs and certain disorders involving the immune mechanism E00_99 Endocrine, nutritional, metabolic disorders G00_99_SY Diseases of the nervous system G00_04 H00_99_SY Diseases of the ear and mastoid process H10, H65, H66 I00_99 Diseases of the circulatory system K00_99 Diseases of the digestive system L00_99 Diseases of the skin M00_99 Diseases of the musculoskeletal system N00_99 Diseases of  Step 2: PCA Each black dot represents a rate ratio (RR) estimated for each down-sampled dataset. Dark grey bars associated with these dots represent 95% credible intervals for RRs. Vertical red lines represent the null value (RR= 1). The percentages at the top represent the down-sampling rates.