COVID-19: Short-term forecast of ICU beds in times of crisis

By early May 2020, the number of new COVID-19 infections started to increase rapidly in Chile, threatening the ability of health services to accommodate all incoming cases. Suddenly, ICU capacity planning became a first-order concern, and the health authorities were in urgent need of tools to estimate the demand for urgent care associated with the pandemic. In this article, we describe the approach we followed to provide such demand forecasts, and we show how the use of analytics can provide relevant support for decision making, even with incomplete data and without enough time to fully explore the numerical properties of all available forecasting methods. The solution combines autoregressive, machine learning and epidemiological models to provide a short-term forecast of ICU utilization at the regional level. These forecasts were made publicly available and were actively used to support capacity planning. Our predictions achieved average forecasting errors of 4% and 9% for one- and two-week horizons, respectively, outperforming several other competing forecasting models.


Introduction
The first cases of COVID-19 pandemic were detected in Chile by early March 2020. A few days later, all schools were closed and a few counties with a relatively high number of cases were quarantined. By the end of April, the available data showed that the outbreak was kept relatively under control with a few hundred confirmed new cases every day.
However, by early May the infection rate started to increase rapidly threatening the ability of health services to accommodate all incoming COVID-19 cases. In middle May, the Chilean Society of Intensive Medicine (SOCHIMI) reported a worrisome occupation of ICU beds of more than 95% in the capital city of Santiago, where most of the cases were concentrated. Suddenly, ICU capacity planning became a first order concern. On May 12th, the Instituto Sistemas Complejos de Ingeniería (ISCI) -who was already working on analytics of mobility-was urged to prepare short-term forecasts of ICU beds occupancy for those regions with the highest utilization rates. In 24 hours, we submitted our first report. From then on, we prepared forecasts every two days for several weeks, and then we reduced the frequency to report every four days. These reports were sent directly to the authorities -particularly those in the coronavirus response committee-and to SOCHIMI.
Additionally, we published the reports on ISCI's website (https://isci.cl/covid19/). The reports grew in complexity and coverage of regions over time, in agreement with decision makers on what seemed to be more pressing.
We developed a solution to generate predictions of the number of ICU beds that were going to be required by COVID-19 patients for every region in the country in a time horizon 14 days ahead. Our methodology was based on an ensemble of a variety of forecasting models that capture different components of the evolution of the outbreak. The first model we built was a compartmental model that described the patient flow as a stochastic progression through different clinical states. Here we contemplated that new patients were going to require an ICU bed after a specific number of days with a given probability and they would be discharged after a given number of days with a certain distribution.
Compartmental models have been one of the most popular approaches to characterize the evolution of epidemics (Ferguson et al. 2020, Gatto et al. 2020, but it has limited flexibility to accommodate dynamic variations in the environment. In the context of COVID-19, the delay between the identification of a new case and the requirement of an ICU bed, the duration of the mechanical ventilation, and the likelihood of requiring urgent care, are some of the critical parameters that might change over time and that are not properly captured by this kind of models. Therefore, we included several autoregressive and machine learning models that could better capture dynamic variations in the environment. Then, we combined the forecasts of the different models using a trimmed mean ensemble. Our approach could generate accurate forecasting, achieving an average prediction error of 4% and 9% for a one and two-week horizon respectively. These predictions were informative to support decision makers in the sanitary crisis and outperformed other competing ensembles of forecasting models.
In this article, we describe in detail the methodology we used to generate forecasts for this very urgent problem, showing how the use of analytics provided relevant support for decision making in critical times, even with incomplete data and without enough time to fully explore the numerical properties of all forecasting methods. Using this methodology, we produced predictions with small forecast errors that not only were useful to support decision making in critical times, but could also be informative about resource planning in potential new outbreaks.
The rest of the article is structured as follows. In section 2 we revise the relevant literature. In section 3 we describe the context and provide some institutional background that should be considered in the design of the forecast. In section 4, we describe the models we used and how we combined them to produce our forecast and in section 5 we discuss some adjustments we introduced to accommodate changing conditions in the spread of the COVID-19. Finally, in section 6 we summarize our results to conclude in section 7 with a discussion of the main findings and avenues for future research.

Literature Review
Our work is related to two streams of research. First, our research is related to the use of analytics in health care and in particular to the use of forecasting methods to plan healthcare capacity. Second, our research is related to the use of forecasting pooling and the combination of multiple methods to generate robust predictions. Next, we discuss both streams of literature with a special focus on other recent works in the context of COVID-19.
Analytics have shown to be relevant in supporting decisions in different components of healthcare systems (Manyika et al. 2011, Raghupathi andRaghupathi 2014). In recent years, we have seen an explosive growth of applications of analytics in diverse facets of health care, including medical diagnosis, human resources, supply chain management, and the design of health care insurance to name a few (Ward et al. 2014, Galetsi andKatsaliaki 2019). Although the use of mathematical modelling in this area has brought a number of challenges, there are ample opportunities to generate essential and timely knowledge to support decision-making (Nambiar et al. 2013). In the context of the control of infectious diseases, the combination of big data and tractable analytical techniques provides new tools to fight against pandemics (Kao et al. 2014). The global impact of the COVID-19 has motivated numerous modelling efforts to provide guidelines for the control and This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3693447 management of the outbreaks. Certainly, there is a close relationship between the spread of the infection and the demand for medical resources. Therefore, mathematical models that describe the evolution of the pandemic can provide a first order approximation for the demand of ICU beds. For this reason, we are especially concerned about the modelling effort forecasting the spread of the outbreak to estimate the requirements of hospital resources.
For example, Ferguson et al. (2020) and Gatto et al. (2020) used different scenarios of nonpharmaceutical interventions in the UK and Italy respectively. Similarly, Chinazzi et al. (2020) and Villas-Boas et al. (2020) evaluated the impact of mobility and travelling in the spread of the virus, while Dowd et al. (2020) assessed the effect of age structure fatality rates.
Closer to our research, other works have proposed different models to forecast the number of infections. For instance, Roosa et al. (2020) used logistic growth models and a subepidemic wave model and Hu et al. (2020) used auto-encoders to provide short-time forecast for the total cumulative and new confirmed cases in several provinces of China. Perc et al. (2020) used an exponential growth model that includes recovery and fatality rates to analyze the evolution of the total number of cases in the US, Slovenia, Iran and Germany.
Finally, Oliveira and Moral (2020) introduced a state-space hierarchical model to generate short-term daily forecasts considering the relations between the series of different countries.
These investigations have advanced our understanding on how the pandemic spread, but they are silent about the use of hospital resources.
The use of forecasting methods to aid hospital resource planning has been an active area of research. In this regard, time-series analysis has been one of the most widely used approaches to generate short-term demand forecast because they provide a comprehensive treatment of seasonality and serial correlations. For example, Schweigler et al. (2009) assessed the prediction accuracy of short-term emergency beds occupancy of different timeseries methods and historical average models. Abraham et al. (2009) analyzed different time-series methods to forecast emergency bed occupancy and show they can provide meaningful information up to one week ahead. To predict the medical requirements in a longer horizon, Jones et al. (2002) proposed a seasonal ARIMA to characterize the volatility of the series. They found that the model produces good forecasts most of the time, but it breaks down during a crisis. On a different research line, Littig and Isken (2007)

employed
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3693447 individual patient level data in a computational intensive model to forecast demand for beds at different units in a hospital.
Since the start of the COVID-19 pandemic, there has been several attempts to estimate the demand for hospital resources. However, as most of this work is devoted to describe the aggregated evolution of requirements, the results are useful to anticipate global policy making, but not to support tactical decisions. For example, Murray et al. (2020) used a simple gaussian curve fitting to predict the number of ICU beds and mechanical ventilators used at the peak and the cumulative use of bed days. More complex epidemiological models have also been used extensively in this context. Cancino et al. (2020b) and Rainisch et al. (2020), used compartment models with age structures to simulate the spread of the virus and evaluate the impact of mitigation strategies in the healthcare demand at the peak of the epidemic.
Similar to these investigations, in our predictions we developed a compartmental model, but we tailored it to the predictions of the demand of ICU beds in the short term. To do so, we limited our attention to the progression of patients after they have been diagnosed and we considered a parametric distribution of patients requiring an ICU bed. Here we incorporated clinical parameters that describe the clinical evolution of patients and we derived detailed predictions for critical medical resources. An important drawback of compartment models, is that they have limited ability to accommodate dynamic changes in key parameters and therefore their predictions may fail to capture important variations in the process such as congestion or delay in testing. To overcome this limitation, we relied on ensemble forecasting models where we combined compartment model predictions with those derived from auto-regressive and machine learning models that can better capture dynamic variations in the environment.
To integrate different forecasting models we used an ensemble approach. Previous works offer strong evidence supporting that combining forecasts can improve accuracy of the predictions (Armstrong 1989, Zou andYang 2004). Literature of forecasting pooling indicates that simple average of predictions perform well, but in our case we used a special form of trimmed mean to accommodate some specific features of our problem (Jose and Winkler 2008).
Forecasting pooling have been applied in many diverse domains (Clemen 1989), but its applications in the context of the pandemic is scarce. Wang et al. (2020) combined a logistic growth model with machine learning predictions to estimate the epidemic curve and This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3693447 predict the overall trends of the epidemic. Ribeiro et al. (2020) applied a wide variety of forecasting models including autoregressive models, random forests, ridge regression and support vector regression to provide very short-term forecast of cumulative confirmed cases in Brazil and they compare the performance of individual models against an ensemble prediction. Similarly, Benítez-Peña et al. (2020) used an optimization-based ensemble to find the best combination over a family of machine learning predictions and apply this methodology to predict the cumulative number of hospitalized patients in Andalusia. Following a different approach, Uhlig et al. (2020) used neural networks to extract features from time series data and then used those features to feed standard compartment models to describe the aggregated spread of the pandemic.
There are two main elements that differentiate our research from other works using multiple models. First, our method combines different predictions to produce robust estimations of the requirements for ICU beds. These models can capture different components of the spread of COVID-19. For example, we considered machine learning models that can provide a great deal of flexibility to accommodate short-term variations in the environment, but we also included compartment models that provide more detailed description of the clinical components of the decease. The second distinctive feature is that our approach is specifically devoted to support ICU capacity decisions and therefore we tailored our predictions to estimate the number of beds that will be required in each point of time and not only in aggregated metrics such as the number of beds at the peak or the cumulative number of beds that will be required during the whole duration of the pandemic.

The urgent problem of forecasting ICU beds
The first COVID-19 cases were detected in Chile by early March 2020, and for the first two months, the number of new infections was relatively under control with a few hundred confirmed new cases every day. However, by early May the number of new COVID-19 cases increased rapidly creating numerous and complex challenges for the country. A graphical illustration of the evolution of the pandemic in Chile is displayed in Figure 1. In the left panel, we display the series of new confirmed cases, and in the right panel, we display the series of new deaths. In both cases, we highlight the state of the series at the time we started producing the forecasts, a few days after the country entered a severe exponential growth phase. By looking at the international experience and learning from the problems This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3693447 faced by other countries that were affected earlier by the pandemic, it was clear that the management of hospital capacity was going to be a critical decision Grasselli et al. (2020). Furthermore, following the exponential trend of new cases, there was a serious concern that the capacity of ICU beds could be dramatically surpassed, leading to much higher mortality rates. To increase ICU capacity, hospital management can follow a number of complementary strategies with different levels of complexity. A simple mechanism to increase hospital capacity is through the release of medical resources by re-scheduling non urgent procedures.
Other strategies require more time to be implemented. For example, pediatric rooms could be converted to receive adult patients or anaesthetic machines could be adapted to provide mechanical ventilation. As most of these mechanisms should be implemented in a time span of a few days, we decided to provide forecast for a 14-days horizon. Despite generating forecast of ICU utilization for each of those fourteen days, in the reports we highlighted the number of beds that would be required in exactly one and two weeks.
Chile is administratively divided in sixteen regions and, in terms of geographical aggregation, the forecasts were produced at the regional level. The country's population is very unevenly distributed and the Metropolitan Region, which includes the capital city of Santiago, concentrates near half of the national population. Despite this heterogeneous population distribution, our decision to produce regional demand forecasts is justified for two reasons. First, and consistently with the administrative division, the budgets are executed at the regional level. Second, from an operational perspective, if needed, patients can be transported from one hospital to another within the region, and therefore, the capacity at the regional level provides the more useful aggregation for decision making.
To estimate the models we used data that was publicly available. Given the crucial importance of the consequences of the pandemic for the whole nation, the Ministry of Health provided frequent epidemiological reports since the day of the first infection. Later on, the Ministry of Science consolidated all available information and created a public repository with an extensive list of statistics reported in a time series format 1 . The data series are reported at either national, regional or at the county level, with only a few statistics available at a more disaggregated level. Through the whole process, we tried to include different covariates in the model, but the results we generated are based on the list described in Table 1. The main series we studied was the Regional Number of COVID-19 Patients in ICU. The other series were used as additional explanatory variables.

Aggregation
Since Geographical Time Number of PCR test 2020-04-09 Regional Daily Number of COVID-19 Patients in ICU 2020-04-01 Regional Daily Number of COVID-19 Patients in ICU by age group 2020-04-01 National Daily Number of New symptomatic cases 2020-03-03 Regional Daily Table 1 Publicly available data used in the forecast At the beginning of our study, the repository had information of the total daily number of new infections by region, but a few weeks later, the repository started reporting the number of new cases distinguishing between symptomatic and asymptomatic cases. As the latter were not going to require ICU beds, from then on, we decided to only consider the series of symptomatic cases.
With this data at hand, we embarked on the challenging task of producing demand forecast for ICU beds. Certainly, accurate predictions could assist decision makers to better prepare for the large number of hospitalizations that were expected. However, the exponential nature of the infections generated huge variations in the expected number of patients in different scenarios. Our goal was, as we were urged to do, to create robust predictions and to deliver them to health officials, with the aim to support them with information that will help to understand the rate at which they should be increasing the capacity.

Models
The most widely used approach to describe the evolution of infectious diseases are compartmental models, where the population dynamically evolves through different stages (Yang et al. 2020, Gatto et al. 2020, Kucharski et al. 2020. In one of the simplest versions, healthy (and susceptible) people become infected when they are in contact with an infectious individual, to eventually recover or die. These models have been extended to include clinical stages, providing a first approximation to the use of hospital resources (Ivorra et al. 2020, Cancino et al. 2020a). In addition, as compartmental models directly describe the dynamics of the disease, they can be suitable to guide the evaluation of mitigation policies (Ferguson et al. 2020). Nonetheless, these models require precise estimation of epidemiological and clinical parameters, which have been proven to be difficult to estimate in practice (Jewell et al. 2020, Roda et al. 2020. Furthermore, for the specific problem of forecasting the COVID-19 related demand of ICU beds, we had good reasons to believe that several of the key parameters could change rapidly over time generating biased predictions. We identified at least three reasons why epidemiological parameters could be non-stationary: 1. The proportion of symptomatic patients requiring mechanical ventilation can change over time and similarly, clinical criteria to release patients from ICU can be adjusted dynamically depending on the actual usage of the existing capacity. This is not only because hospitals can relax nominal criteria, but also because SARS-CoV-2 is a new virus that involves continuous learning from medical teams. For instance, the head of the Chilean Society of Intensive Medicine stated that "Patients initially stay in the ICU between 10 and 11 days, and now they are staying between 14 and 16 days. This is because, with everything we learned, we intubated less and selected more serious patients" 2 .
2. Despite governmental efforts in providing timely access to relevant information, a large portion of the system to generate the data was in constant stress and therefore the information that we had available for any single day could be lagged. Among others, the results of lab tests exhibited important delays worldwide 3 and hence, the number of new cases might be more or less informative depending on the congestion of the laboratories.
3. The data is not always available at the patient level and there are important factors that were never observed. For example, every day the government reports the number of new cases and the current occupation of ICU beds per region, but there is no information about how many patients enter, exit nor the length of the stay of the patients in intensive care. Likewise, when the capacity was lacking in some regions, the Government has the ability to move patients between regions which was not systematically reported.
To overcome the limitations of compartmental models and properly capture short-term dynamics, we combined them with other time-series models that may be better suited to capture those dynamics. From a theoretical point of view, the use of forecast combinations is justified because they can lead to smaller forecasting errors and they can even reduce biases of individual forecasts (Granger and Ramanathan 1984, Elliott and Timmermann 2004, Yang 2004. Beyond theory, combined forecasts have shown to lead to better performance in a wide range of applications (Clemen 1989, Stock and Watson 1999, Zou and Yang 2004. Previous literature have offered several reasons to justify the empirical success of combining forecasts, such as model misspecification, changes in the underlying parameters and heterogeneous use of different information sets (Hendry and Clements 2004). As we have explained, several of these reasons were present in our setting and consequently, our general approach is based on a ensemble of different forecasting models. Next, we briefly discuss the individual models we included in the ensemble. To organize the discussion, we group the list of models in three categories: autoregressive, artificial neural networks and compartment models.

Autoregressive Models
ARIMAX. We start with a classic autoregressive integrated moving average (ARIMA) approach (Box et al. 2015). In this model, the value of the time series in day t (y t ) depends on its lagged values and its lagged errors, and the series are further differentiated to estimate stationary processes. The ARIMAX variant is the result of considering an additional set of exogenous explanatory variables x t . In the vector x t we considered the whole series of new cases and the positivity rate. By introducing the backward shift operator B, the model can be expressed in a compact form as: The model depends on the relative weight of the own values (φ), the weights of the errors (θ), the weight of the exogenous variables (β), and the constant term (θ 0 ). The model also depends on the number of lags (p, q) and the number of difference operations (d). To determine the value of (p, d, q) we used a step-wise selection based on AIC criteria (Hyndman and Khandakar 2008).
In our analysis, we considered ARIMA and its ARIMAX variation, but the forecast of both were fairly similar and therefore, to ensemble the forecast we only considered one of the two. For the ARIMAX, we included the number of new symptomatic infections in previous days as one of the key explanatory variables. While for more flexible models we considered the whole sequence of new symptomatic infections, in this case we only included a few values in the [6][7][8][9][10][11][12] range that showed to provide more stable estimates than feeding the model with the complete series.
TBATS. We then looked at a trigonometric seasonality, Box-Cox transformation, ARMA errors, and trend seasonal components (TBATS) model. This model uses a combination of exponential smoothing and a Box-Cox transformations to automatically accommodate multiple seasonal components. Each of these seasonalities are modeled by a trigonometric representation based on Fourier series. Although this model considers a series of nested equations to represent a detailed decomposition of the series, using the backward shift operator, the model can also be expressed in a reduced form as: Here y One of the advantages of this model is that it provides a great deal of flexibility to automatically accommodate a large number of seasonal and trend components. However, unlike the previous ARIMAX model, TBATS does not include exogenous variables and hence, it has limited ability to anticipate how variations in the infections can be translated in different requirements of ICU beds.

Time-Delay Artificial Neural Networks
In our approach we included a number of neural network models. To accommodate the time series structure, we used a special class called time-delay neural networks (TDNN).
In this class, the inputs to any node can include outputs of earlier nodes not only during the current time step, but also from previous time steps (Clouse et al. 1997).
As is common in neural network learning, we trained the structure adjusting its parameters to minimize the error using a generalized feed-forward network. Thus, without loss of generality, the predictions are given by: In this expression, f is the activation function for the hidden layers and g a nonlinear transformation in the output layer. Additionally, the φ i (x) are the basis functions and (v, w) is the list of weights that are calibrated in the training. Neural networks have shown to have superior forecasting power in different settings (Hill et al. 1996). The TDNN models we present next, vary in their structure (number of input nodes p and hidden nodes q), as well as, in the pruning criteria to reduce the dimension of the net and avoid overfitting.
MLPR. The perceptron is a classifier that maps a vector of inputs to a single binary value through a threshold activation function. A Multi-layer perceptron is a network of individual classifiers that enable learning about complex processes and it is one of the most used perceptron-based learning algorithms (Stephen 1990). The flexibility of a MLPR algorithm allow to include an arbitrary set of input variables, such as lagged values of the series and other explanatory factors. In our implementation of MLPR, we followed the This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=3693447 general expression of (3) with logistic activation functions and four hidden layers. We tried different number of nodes per layer, but ended up using a {5:10:10:5} architecture that performed well.
ELM. Extreme Learning Machines is a special feed forward neural network that only uses a single-hidden layer. In this layer, nodes are randomly chosen and the weights of the outputs are analytically determined (Huang et al. 2006). Unlike other back-propagation learning algorithms, the parameters of the hidden layer of the ELM do not need be tuned.
In fact, ELM aims to not only minimize the training error, but also to reduce the norm of the output weights. Thus, ELM models tend to achieve good generalization performances with much faster training than other artificial networks (Tang et al. 2015).
We implemented ELM following the general TDNN expression in (3). The norm of the output weights was controlled by the LASSO, where the norm-1 of the weight is imposed to be smaller than a given threshold. We decided to use 11 nodes in the single hidden layer, because that exhibited good performance and small prediction errors.

GMDH. A Group Method of Data Handling approach involves the successive selection
of models based on an external prediction criteria. Starting with a simple set of models, the method constructs a new generation of more complex models and combines them to maximize the forecasting performance (Ivakhnenko 1970). In our case, we organized the sequence of models in a neural network, where each layer corresponds to a new generation of models. Following other GMDH applications in the literature, we considered polynomial models as follows (Farlow 1984): The GMDH method allows the inclusion of an arbitrary set of covariates in the polynomial, but in the context of time series we only considered the lagged values of the series.
Our motivation to include this model in the pool of forecasts is that it was conceived to learn about complex relationships when lacking of a detailed knowledge of the fundamentals of the process. In our case, we had epidemiological theory characterizing the evolution of the pandemic, but the observed data is mediated by a number of unobservable processes that might require additional layers of complexity. Another strength of GMDH is that recent computational implementations of the algorithm include automatic normalizations of the variables (Dag and Yozgatligil 2016).

ICD Compartment model
The goal of our compartment model is to predict the future utilization of ICU beds by critically ill cases of COVID-19. Thus, our model aims to replicate the behaviour of the ICU process balancing inbound and outbound flows of patients in different stages of the process. Our model considers three compartments through which the patients evolve. For each one of them we tracked the number of patients in each stage as follows: • I: The number of infectious individuals that show symptoms of COVID-19.
• C: The number of critically ill people that need an ICU bed.
• D: The number of individuals that are discharged from the ICU.
The number of infected, critically hospitalized and discharged patients fluctuate over time and therefore we made the state variables depending on time. Thus, the variables I t , C t and D t represent the number of new symptomatic cases, the number of critical patients and the number of discharged cases at day t. We describe the transition between states using a probabilistic approach. These probability distributions do not only consider how likely is that a given patient evolves to another state, but also the expected duration of that transition as is illustrated in Figure 2 I t−l−m   Figure 2 illustrates that the number of ICU beds that are going to be used in day t depends on the number of beds used the previous day, the number of new cases that require critical care and the number of patients that will be discharged. Here, a fraction a of the symptomatic cases will require an ICU bed, but they only demand it l days after they are diagnosed. Acknowledging there is variation on the delay since the diagnosis, we considered that patients requiring a new ICU bed in day t could have been diagnosed between l − m and l + m days before. While our model allows for any arbitrary distribution to characterize the requirements of beds over time, in this work we only considered uniform distributions and therefore a fraction a 2m+1 of new symptomatic cases detected on t − l will require an ICU bed on t. The logic for discharging patients is similar, but we know that sooner or later all patients were going to be discharged and therefore all uncertainty is associated to the duration of the bed usage. We assumed that on average patients are discharged after k days, but as before we allowed dispersion and let patients to complete their clinical cycle in the [k − h, k + h] range. If the lengths of the stay were uniform, the fraction of patients entering the ICU in t, being discharged in time t + k is 1 2h+1 . In the empirical application, we used a bimodal distribution with a fraction d of moderately severe cases staying in ICU between k 1 − h 1 and k 1 + h 1 days, and more severe cases staying between k 2 − h 2 and k 2 + h 2 days. Formally speaking, the equations describing the evolution of patients over time is given by: In these equations, the series of ICU utilization and the number of new symptomatic cases are the data, while (a, d, l, m, k 1 , h 1 , k 2 , h 2 ) are parameters to be estimated. These parameters are disease-specific and we could retrieve their value for medical literature on the SARS-CoV-2. For example, the mean duration of symptoms before hospital admission was reported to be 10 ± 2 days (Bhatraju et al. 2020, Phua et al. 2020. Similarly, medical reports indicate that the length of stay depends on the how severe the decease manifests in the patient and it is estimated to be 21 ± 7 for those who shows extremely severe symptoms, and 14 ± 3 for those who do not (Phua et al. 2020). In our model, we used these clinical estimates as references, but we conducted an exhaustive search over a grid of values centered around those values to chose the parameters that minimize the forecasting error on a holdout sample.

Ensemble
An extensive body of literature has shown that combining forecasts can improve prediction accuracy, and that a simple average often performs better than more complex combination schemes (Bates and Granger 1969). However, as the mean can be sensitive to extreme values, more recent literature have suggested that deleting more extreme predictions might further improve forecasting pooling. For example, the median forecast might be less susceptible to be affected by outliers (Armstrong 1989).
In our application we used a trimmed mean approach (Jose and Winkler 2008), where we used the simple forecast mean after discarding the two most extreme predictions. There are two variations we introduced to this procedure to accommodate our forecasting needs.
First, as the predictions of ARIMA and ARIMAX where highly correlated, in the pool of selected forecasts we only considered at most one of them. Second, as the medical personnel in charge of abilitating new ICU beds had more intuitive interpretation of the ICD model, we always included it in the pool. To rank the forecasts we considered the prediction of ICU beds in a two-weeks horizon. Thus, if y (k) t is the k-th order statistic for the series, then our forecast is given by the following trimmed mean: Thus, our forecast is composed by an average of four models (including ICD). Considering these predictions were directly informing health officials bout a very critical decision, we visually inspected all forecasts before producing the final reports. In these inspections, in very exceptional cases, when more than one forecast dramatically deviate from the mean, we overruled our trimmed criteria and include both ARIMA and ARIMAX in the forecasting pool.

Implementation
When inspecting the series, we found no evidence of seasonality for any variable and therefore all models were estimated using no seasonal components. To determine the number of observations to use in every forecast, we considered information since April 1st, when the accumulated number of symptomatic patients reached three thousands cases. Later on, when more data was accumulated, we only considered the last sixty days of data to estimate the models.
All models were estimated using daily data. During the pandemic, the Ministry of Health To decide the optimal values (p, d, q) of ARIMA and ARIMAX models, we proceed iteratively. If the value d is known, the model selects the orders p, q via an AIC information criterion. For non-seasonal data, d is selected based on successive KPSS unit-root test (Kwiatkowski et al. 1992) and stops when finding an non-significant result (Hyndman and Khandakar 2008). In the case of TBATS, the general model considers several components and therefore several variations are estimated (e.g. with an without trends, with and without Box-Cox transformations) and the final model is also chosen using AIC (Livera et al. 2011).
Models based on artificial neural networks can be estimated using standard backpropagation learning algorithms. However, given the time-series structure, the estimation benefits from using automatic feature selection (Crone and Kourentzes 2010). For the case of GMDH the weights of the polynomial are calibrated using a Regularized Least Square Estimation method (RLSE) reducing the potential problems of multi-collinearity (Dag and Yozgatligil 2016) In terms of computational tools, the data aggregation and pre-processing were conducted using R libraries. To normalize the data, we used Z-scores for neural network models and For the compartment model and the ensemble, we coded our own routines to accommodate the specific requirements of the problem.

Timeline of events and methodological adjustments
In the previous section we described the general methodology we employed to produce daily forecasts for ICU beds. However, a key premise of this work is that the situation required predictions urgently. Moreover, the general environment was constantly changing and therefore we had to continuously update our methodology to accommodate the evolution of the pandemic and the information needs from the health officials. This is the list of the most relevant events that required adjustments in the methodology.
• We generated our first solution only a few hours after the government realized that ICU planning was going to be a key element in mitigating the consequences of the pandemic. These early solutions only considered reduced-form models with no epidemiological considerations. However, we quickly realized we needed to complement these models with others capturing the medical structure of the problem. This is because a large fraction of the decision makers who were actively reading our reports were health-care professionals needing a medical narrative to explain the variations in the demand for ICU beds. This narrative was only provided by a compartment model and therefore in all public reports we generated, we always included those models 4 .
• During the first two weeks, we used the series of new confirmed cases regardless if they presented symptoms or not since that was the only information readily available.
For the prediction of ICU beds, only patients with symptoms have a positive probability of requiring intensive care and therefore the number of cases with symptoms should provide a more direct signal of the requirement of ICU beds. When the series of new cases was systematically reported depending on the existence of symptoms, we started to use symptomatic cases only.
• The first two reports we generated only considered the Metropolitan Region because it concentrated by far the largest number of cases and consequently it was the most urgent concern for local authorities. After a week, we added reports for other three regions (Tarapacá, Antofagasta and Valparaiso) that also showed an alarming rise of new cases. At this point our model was completely automated to generate predictions for all regions in the country, but we only progressively added more regions as they became more worrisome.
By early July we started reporting forecast for all sixteen regions of the country.
• The GMDH model was not considered in the original list of models and it was only introduced in June 11th. From then on, this model was available to be considered in the ensemble.
• In early July, we identified that most models were starting to show some slowing down in the rate at which additional ICU beds were going to be needed for the Metropolitan Region. However, the ICD compartment model did not show any sign of saturation. After interviewing medical personnel we realized that some patients were starting to be mechanically ventilated in Emergency Rooms (ER) and they were not counted in the nominal series of ICU utilization. Thus, in terms of capacity planning, we were required to report how many beds should be made available to cover both new cases and ventilated cases in emergency rooms. Therefore, we complemented the series of ICU beds with the number of patients ventilated in ERs 5 .
• As laboratories were reaching their testing capacity, the variation in the number of reported new cases increased significantly in middle June. As a consequence, the forecast were less stable. To overcome this problem, we pre-processed the series of new cases and we used the five-days moving average instead the raw series.
The results that we present in the next section are devoted to represent what we reported in each point of time and already include all methodological changes we introduced along the process.

Results
Starting from May 16th, we generated standardized and frequent reports with the two weeks ahead forecasts. The reports were made publicly available at https://isci.cl/covid19/ and were generated regularly every other day, except for the last two weeks of July when the reports were generated only twice a week. The first reports only provided forecasts for the most critical regions, but we later provided reports for the whole country. In the analysis we present here, we only consider results since May 20th, when our routines were fully automatized to generate predictions for all regions.
The main body of the reports consisted on a summary of the number of beds that were going to be required for each region in a time horizon of two weeks, followed by a graphical summary of the forecast. A very important requirement for these reports was that they had to be concise and easy to read. The crisis committee had a very short time to evaluate all the information, so our reports were tailored to consider this situation. In Figure 3 we display the predictions reported for the Metropolitan Region on July 24th. Graphical reports for another two regions and the national summary is available in Appendix B. the predicted number of beds that will be required in exactly seven and fourteen days. For the example presented in the Figure 3, the reports indicated that the Metropolitan Region was going to require 937 beds in a week (172 beds less than occupation at that date) and 802 beds in two weeks (307 beds less than the occupation at that date). For this particular example, the ensemble was produced with MLPR, ELM, TBATS and ICD models, but those were changing depending on the values of the forecasts. For a detailed count of the frequency in which each model was used in the ensemble, see Appendix D.
For a systematic evaluation of our forecasts we decompose the analysis in two parts. We first compare the performance of each model and the ensemble in terms of their forecasting errors, and then we discuss how our ad-hoc trimmed algorithm compares against other pooling criteria.

Model Evaluation
From May 20th to July 28th, we produced 30 reports of ICU utilization. In each report we presented daily forecasts of the demand for ICU beds for the next two weeks for the regions considered in that instance. In every case we generated predictions for different forecasting models and we built our best guess trough a conditional trimmed mean ensemble. A visual representation of all forecasts we reported for the Metropolitan Region is displayed in Figure 4. In this figure, we display the actual series of ICU occupation with a black line and each of the thirty fourteen-days ahead forecasts are presented with a different color.
These results indicate that except for a few cases in early June, when we overestimated the demand for ICU beds, the predictions were, at least visually, quite accurate. To summarize all these daily forecasts, we compute the Mean Absolute Percentage Error (MAPE) for each model as is displayed in Table 2. To simplify the exposition, we only report the performance of the models for the Metropolitan Region because this region required by far the largest majority of ICU beds in the country. For example, at the peak of the outbreak, the Metropolitan Region demanded more than 11 times more beds than the second most congested region. An analog Table illustrating the errors for Valparaiso Region (the second largest) is available in Appendix C. Further metrics of other regions are available upon request to the authors.
In  were "tremendously important to support decision making in difficult times".
It is important to evaluate the performance of our forecasting task in the context of a pandemic characterized by phases of exponential growth that can lead to large errors in predictions. Consider for example the case of the U.K. where early epidemiological models initially projected around 500,000 deaths, number that was updated to under 20,000 deaths just after two weeks (Uhlig et al. 2020). For an additional discussion of the challenges in predicting the spread of COVID-19, see Petropoulos and Makridakis (2020).
To further understand how individual models performed against the ensemble, we plot the series of MAPE for all models in Figure 5. From this figure we observe that the precision of the models is not uniform over time. With the exception of ICD, all models performed well early in the process and at the end where the pandemic was either on the continuous rise or in steady decline. However, the ICD model showed to be more accurate in the middle of the process. Interestingly, the ensemble was frequently associated to smaller errors than the individual models. In the next section, we provide a more comprehensive discussion about this pattern, evaluating how our ensemble compares to other methodologies proposed in the literature.

Validation of the Ensemble
To complete the analysis, we discuss how our conditional trimmed mean ensemble performed against other criteria to combine forecasts. As we forced our predictions to include the ICD compartment model regardless of the value of its predictions, it is possible that our ensemble can lead to a worse performance than other criteria that are not subject to this restriction. By design, we were willing to sacrifice precision to gain interpretation, but it is worth exploring whether our predictions deteriorated by considering this interpretability constraint. Figure 6 displays the Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) for our conditional trimmed mean along with two other commonly used ensembles: the mean and the median forecast. For simplicity, in these series we only report the errors for the whole forecasting horizon with no distinction between the first or second week. Instead, in these plots we highlight three stages depending on whether the series of ICU occupation was exhibiting positive, neutral or negative trends; we label them as ascending, plateau and descending phases respectively. Although the definition about the exact time when the series change its slope is somewhat discretionary 6 , this qualitative decomposition helps to understand the role of the compartmental model in the forecasts.
Summaries of the comparison of ensemble criteria are displayed in Table 3  Descending 3.26 7.60 6.48 Table 3 Error Metrics per Combination across Iterations From Table 3 we observe that forcing the inclusion of ICD model did not induce any deterioration of the forecast precision and our ensemble forecast exhibited the smallest 6 In this exercise the second phase starts on June 23th and the third phase is decided to start on July 8th In early stages of the pandemic, our trimmed mean criteria was outperformed by the standard mean and median ensembles. However, after a few iterations our predictions consistently exhibited the smallest error. This result can be explained because the ICD model produced prediction errors with opposite signs that cancel out the errors of other models.
We believe that feeding the model with structural information about the clinical evolution of COVID-19 patients can provide a useful forecasting signal and it gives additional support to the convenience of using combined forecasts.

Conclusions
In this research we proposed a methodology to produce short-term forecasts for ICU beds in the context of COVID-19 epidemic in Chile. Our algorithm is based on an ensemble method that combines autoregressive, artificial neural networks and a compartment model to generate our best prediction of the ICU utilization for a time horizon of fourteen days.
This algorithm captures the epidemiological dynamics of the disease with a compartmental model and it is complemented by time series models that capture short-term changes in the clinical parameters. This approach resulted in very accurate predictions with a mean error of 4% for the first week and 9% for the second week. The analysis of the performance over time indicates that, in relative terms, the proposed model produced larger errors earlier in the process. This can be explained because in early stages of the pandemic, each individual model had less data to learn from. However, we believe that a more fundamental reason is that after a few iterations different models produced complementary results and therefore the trimmed mean we use to ensemble the forecast generated better forecast than any single model in isolation. Hence, every model contributed with a different key signal that increased the accuracy of ICU beds predictions in most of our reports. In this regard, the inclusion of a compartmental model helped to generate more precise predictions, in spite of being the least accurate model overall.
In terms of the application, the reports we made publicly available were a very useful tool to anticipate the availability of critical resources in the hospitals. We generated consistent information to characterize the progression of the pandemic providing health officials with a data-driven tool to make quick decisions about ICU planning. These reports facilitated the Ministry of Health to implement a progressive increment of the number of beds that resulted in more than doubling the capacity in the most congested regions. We heard from health and science authorities, and from SOCHIMI how these forecasts were useful to know what was coming, and to better focus resources and efforts across the country. Importantly, the messages we were sending were better received because, following interaction with authorities, we tailored the reports to ease communications.
We are confident that our model contributed to better planning in a critical situation where the lives of many were at risk. However, as the COVID-19 pandemic is still a major threat in many countries around the world, we consider important to discuss potential ideas to further improve the methodology. In our work, we used the data that was available and that we identified as having predictive power. However, more disaggregated data is likely to further improve forecasting. For example, more detailed information on patient demographics and medical histories can further help to identify what fraction of patients might require mechanical ventilation, and thus, give more detailed guidelines about focalized mitigation policies.
The proposed methodology can be also improved by adding more forecasting methods into the pool of models. Although we use a wide variety of models, there are other that we did not try. For example, the recently developed prophet forecasting model (Taylor and Letham 2018) has shown to produce proficient predictions for the number of active cases (Papastefanopoulos et al. 2020). Our methodology could not only benefit by adding more forecasting models, but also other ensemble criteria. For example, recent literature have shown that combining forecasts through Ordinary Least Squares and Least Absolute Deviations can lead to further improvement to the ensemble (Nowotarski et al. 2014).
To produce our predictions, we treated different regions independently. Although this is a reasonable assumption for the case of Chile where commuting between regions was limited, it might not be a good assumption to replicate our work in other geographies. In that case, a hierarchical model allowing for spatial correlation might be more appropriate (Oliveira