Artificial Neural Network Modeling of Novel Coronavirus (COVID-19) Incidence Rates across the Continental United States

Prediction of the COVID-19 incidence rate is a matter of global importance, particularly in the United States. As of 4 June 2020, more than 1.8 million confirmed cases and over 108 thousand deaths have been reported in this country. Few studies have examined nationwide modeling of COVID-19 incidence in the United States particularly using machine-learning algorithms. Thus, we collected and prepared a database of 57 candidate explanatory variables to examine the performance of multilayer perceptron (MLP) neural network in predicting the cumulative COVID-19 incidence rates across the continental United States. Our results indicated that a single-hidden-layer MLP could explain almost 65% of the correlation with ground truth for the holdout samples. Sensitivity analysis conducted on this model showed that the age-adjusted mortality rates of ischemic heart disease, pancreatic cancer, and leukemia, together with two socioeconomic and environmental factors (median household income and total precipitation), are among the most substantial factors for predicting COVID-19 incidence rates. Moreover, results of the logistic regression model indicated that these variables could explain the presence/absence of the hotspots of disease incidence that were identified by Getis-Ord Gi* (p < 0.05) in a geographic information system environment. The findings may provide useful insights for public health decision makers regarding the influence of potential risk factors associated with the COVID-19 incidence at the county level.


Introduction
Novel coronavirus disease (COVID-19) has rapidly spread worldwide, becoming a global health threat [1]. The disease was first identified in Wuhan, China, and continued to spread out across the world [2]. According to the World Health Organization [3], as of 4 June 2020, there have been more than 6.4 million confirmed cases and over 380 thousand deaths worldwide. These statistics have surpassed the number of deaths and cases for Middle East respiratory syndrome (MERS) and severe acute respiratory disorder (SARS) since their outbreaks [4]. The pandemic has directly impacted the economy, society, and healthcare systems. According to the International Monetary Fund [5], global economic growth in the year 2020 is estimated to be -3.0%, compared to +2.9% in 2019. The United Nations predicts that the pandemic can continue to adversely impact societies with perpetual disease spread due to improper policy interventions [6].
Although the United States is ranked number one in the global health security index [7], it is the leading country in the number of confirmed cases and deaths globally [8]. As of 4 June 2020, there and terrain slope), and demographic (such as proportions of age groups, race, gender, and access to primary care) factors were prepared at the county level and were used as explanatory variables. To avoid reiteration, a complete description of the used variables has been provided in Mollalo et al. [17].
In addition to the above explanatory variables, which were also used in the study of Mollalo et al. [17], age-adjusted mortality rates of several diseases were incorporated, including infectious diseases (i.e., TB, HIV/AIDS, hepatitis, and lower respiratory infection), cardiovascular diseases (i.e., cerebrovascular disease, hypertensive heart disease, ischemic heart disease, cardiomyopathy and myocarditis, atrial fibrillation and peripheral vascular disease), chronic respiratory diseases (i.e., COPD, asthma, interstitial lung disease, and pulmonary sarcoidosis), cancer (i.e., pancreatic, gallbladder and biliary tract, mesothelioma, Hodgkin lymphoma, leukemia, tracheal, bronchus, and lung cancer), and substance use disorders (i.e., drug and alcohol use). The data were retrieved from the University of Washington Global Health Data Exchange (http://ghdx.healthdata.org/us-data) and joined to the preexisting database. All data were collected and prepared at the county level and are publicly available. A list of all variables can be found in the Supplementary Materials.

Spatial Analysis
We examined the geographic distribution of the COVID-19 incidence rate using global and local indices. The global Moran's index [32,33] was used to identify the overall pattern (random, clustered, or dispersed) of disease incidence rate using the following formula: where C i and C j are the deviations of COVID-19 incidence rates from the mean incidence rate for county i and county j, respectively; w ij is the spatial weight between county i and county j, which is non-zero when the counties are neighbors (i.e., share borders); and n is the total number of counties. The value of I ranges between −1 and +1. The values close to 0 indicate random distribution (null hypothesis), while values close to +1 and −1, respectively, indicate positive and negative spatial autocorrelations [34,35].
As the global Moran's index is unable to identify the location of hotspots [35], Getis-Ord G i *, statistics developed by Getis and Ord [36] were used to identify the hotspots of COVID-19 incidence rates (p < 0.05) as follows [37]: The positive and high value of G * i indicates a more intense clustering of high values (hotspot(s)). The output of the G * i statistic was mapped in ArcGIS 10.7 (Esri, Redlands, CA, USA) to locate the hotspots of COVID-19 incidence rates.

Feature Selection
The presence of a relatively large number (n = 57) of potentially relevant variables can create a technical problem and a theoretical discrepancy, which can in turn decrease the generalizability of the neural networks [38]. Therefore, we applied the Boruta algorithm [39] to identify feature importance, and ultimately chose "all-relevant" important features [40]. This algorithm is a wrapper around the Random Forest classification algorithm and is implemented in the "Boruta" package in R.
To determine important and unimportant features, this algorithm creates random shadow variables and runs a random forest classifier on the set of original and shadow variables. Based on the results of a statistical test (using z-scores), the algorithm iteratively removes the variables that have lower z-scores compared to the shadow variables [39]. After performing the Boruta feature selection algorithm and also Pearson's correlation analysis on the training dataset, important and less correlated (r < 0.7) variables were identified and selected as input variables in the neural networks.

Artificial Neural Networks
Artificial neural networks (ANNs) are computational structures that can learn the relationship between a set of input and output variables through an iterative learning process. These networks use simple computational operations such as addition and multiplication, yet they are capable of solving complex, non-linear problems [41][42][43]. Once a network is properly trained, it can be used to predict a variable of interest based on an independent (holdout) dataset, usually with minimal modifications [44].
The main components of ANNs are neurons that are organized in layers and are fully connected to the next layer by a set of weights (edges). Each ANN consists of one input layer, one output layer, and at least one hidden layer. The simplest form of ANN is called a perceptron, first introduced by Rosenblatt [45], which is the building block of neural networks. In a perceptron, each input is multiplied by a corresponding weight and then aggregated by a mathematical function called "activation of the neuron." Another function then computes the output. ANNs are a set of layers that are created by stacking perceptrons. For instance, if the inputs to the ith perceptron in a network are denoted by x 1i , . . . , x ni , assuming that a summation function is used to calculate the outputs (denoted by z i ), we will have [44]: where n is the number of inputs; m is the number of neurons in the current layer; w ij is the weight of the jth neuron (jth input to the ith cell), and b i is a bias term. In matrix form, z i can be simplified to: where Given a specific loss function, the perceptron can reach better estimates of the output values by adjusting the weights and bias terms through an iterative process referred to as error-correction learning. This process calculates the "errors" using observed and estimated values and "corrects" network parameters based on those errors. Given the estimated value of the network output at iteration n, (i.e., d n ), and the observed output value y n , a loss term is defined by [46]: where Loss is a function of d n and y n , which gives a measure of the difference between observed and estimated output values and is defined based on the type of problem. This Loss term can be used locally at each neuron to update the weights of the network (in that neuron) using gradient descent learning: where, at iteration n, w ij is the weight from neuron j to neuron i, η is the step size, and ∂w ij (n) is the partial derivative (gradient) of Loss with respect to w ij .
Step size is one of the (hyper) parameters of a network and can be optimized by trial and error. A similar procedure is used to update bias terms. The activation function is a non-linear function applied to each neuron to transfer its values into a known range, for instance, [−1, 1] or [0, 1]. The most common activation functions in ANNs are rectified linear unit (ReLU), sigmoid, and hyperbolic tangent (tanh) [47]. The summation term in Equation 4 acts as an activation function for the perceptron.
In this study, the performance of multilayer perceptron (MLP) neural networks in modeling the disease incidence is investigated across the continental United States. MLP is a variant of the (single) perceptron model explained above and is one of the most popular classes of feedforward ANNs, with one or more hidden layers between the input and output layers [48]. MLP is used in supervised learning tasks for classification or regression. Figure 1 represents the topology of the MLP neural network. In this regression study, we employed MLP with 1 and 2 hidden layers. The "Neuralnet" package in R was used to train the MLP.

Int. J. Environ. Res. Public Health 2020, 17, x FOR PEER REVIEW 5 of 13
The activation function is a non-linear function applied to each neuron to transfer its values into a known range, for instance, [−1, 1] or [0, 1]. The most common activation functions in ANNs are rectified linear unit (ReLU), sigmoid, and hyperbolic tangent (tanh) [47]. The summation term in Equation 4 acts as an activation function for the perceptron.
In this study, the performance of multilayer perceptron (MLP) neural networks in modeling the disease incidence is investigated across the continental United States. MLP is a variant of the (single) perceptron model explained above and is one of the most popular classes of feedforward ANNs, with one or more hidden layers between the input and output layers [48]. MLP is used in supervised learning tasks for classification or regression. Figure 1 represents the topology of the MLP neural network. In this regression study, we employed MLP with 1 and 2 hidden layers. The "Neuralnet" package in R was used to train the MLP.

Model Performance
The entire dataset was randomly divided into three different categories: 1) training samples: 60% (nt = 1865) of data used for developing the models; 2) cross-validation samples: 15% (nc = 466) of data used to fine-tune network weights and to avoid overfitting; 3) holdout samples: 25% (nh = 777) of data used to test the accuracy and generalizability of the models. The same partitioned data were used for all models for the purpose of comparison. The process of training models stopped at earlier stages to avoid overfitting. The performances of neural networks in predicting COVID-19 cumulative incidence rate (output) based on selected variables (inputs) were compared to each other, and to the linear regression model as a baseline on holdout samples. We used three different evaluation measures for accuracy assessments: root-mean-square error (RMSE), mean absolute error (MAE), and the correlation coefficient between observed COVID-19 incidence rate and model predictions (r). In

Model Performance
The entire dataset was randomly divided into three different categories: 1) training samples: 60% (n t = 1865) of data used for developing the models; 2) cross-validation samples: 15% (n c = 466) of data used to fine-tune network weights and to avoid overfitting; 3) holdout samples: 25% (n h = 777) of data used to test the accuracy and generalizability of the models. The same partitioned data were used for all models for the purpose of comparison. The process of training models stopped at earlier stages to avoid overfitting. The performances of neural networks in predicting COVID-19 cumulative incidence rate (output) based on selected variables (inputs) were compared to each other, and to the linear regression model as a baseline on holdout samples. We used three different evaluation measures for accuracy assessments: root-mean-square error (RMSE), mean absolute error (MAE), and the correlation coefficient between observed COVID-19 incidence rate and model predictions (r). In this study, the model with minimum error values and a higher correlation coefficient was considered as the optimal model [47]. Below are the formulae to assess the accuracies: where O i is the observed value of the COVID-19 incidence rate, P i is the predicted value by the model, and n is the number of observations on a holdout dataset. Sensitivity analysis was carried out on the optimal model to assess the contributions of variables in predicting disease incidence. Finally, vanilla logistic regression was utilized to explain the relationship of the most contributing factors obtained from sensitivity analysis and the presence/absence of hotspots identified by Getis-Ord G i *.

Results
Results of spatial analysis with Global Moran's I indicated that the distribution of COVID-19 incidence rate in the continental United States is clustered (Index: 0.36, z-score: 34.75, p < 0.0001), rejecting the null hypothesis (random distribution). Moreover, Getis-Ord G i * could identify the location of hotspots of disease incidence rates (Figure 2). In total, 217 counties were identified as hotspots (p < 0.05), which were mainly located in the northeastern regions of the continental United States, western Georgia, central Ohio, southern Louisiana, and northeast Iowa.
The Boruta algorithm and Pearson's correlation analysis selected 34 variables as less correlated and important variables (Supplementary Materials), which were then fed as inputs to ANNs. Overall, among the activation functions, "tanh" had slightly better performance (lowest RMSE) and thus was used in the MLPs. We systematically increased the number of neurons in the hidden layers from 10 to 30. The lowest errors were obtained with 15 neurons in the hidden layer. The performances of all employed models, in terms of RMSE, MAE, and r between observed COVID-19 incidence rate and model predictions on the holdout sample are presented in Table 1. Correlation coefficients of the models ranged between 0.30 and 0.65. The linear regression model achieved the least correlations with observed COVID-19 incidence rates (r < 0.3). On the contrary, the MLP with one hidden layer achieved the highest correlation (r = 0.65), indicating a satisfactory agreement between model predictions and observed COVID-19 incidence rates. Moreover, the accuracy assessment of the results indicated that the prediction error of the MLP with one hidden layer is less than others (RMSE = 0.72, MAE = 0.36). The worst performance was obtained by linear regression (RMSE = 0.99, MAE = 0.58), while the MLP with one hidden layer yielded better accuracy and generalization capability than other models and was thus considered as the proposed model for further analysis. Figure 3 compares the z-scores of actual and predicted values of the dependent variable for holdout samples using the one-hidden-layer MLP. The Boruta algorithm and Pearson's correlation analysis selected 34 variables as less correlated and important variables (Supplementary Materials), which were then fed as inputs to ANNs. Overall, among the activation functions, "tanh" had slightly better performance (lowest RMSE) and thus was used in the MLPs. We systematically increased the number of neurons in the hidden layers from 10 to 30. The lowest errors were obtained with 15 neurons in the hidden layer. The performances of all employed models, in terms of RMSE, MAE, and r between observed COVID-19 incidence rate and model predictions on the holdout sample are presented in Table 1. Correlation coefficients of the models ranged between 0.30 and 0.65. The linear regression model achieved the least correlations with observed COVID-19 incidence rates (r < 0.3). On the contrary, the MLP with one hidden layer achieved the highest correlation (r = 0.65), indicating a satisfactory agreement between model predictions and observed COVID-19 incidence rates. Moreover, the accuracy assessment of the results indicated that the prediction error of the MLP with one hidden layer is less than others (RMSE = 0.72, MAE = 0.36). The worst performance was obtained by linear regression (RMSE = 0.99, MAE = 0.58), while the MLP with one hidden layer yielded better accuracy and generalization capability than other models and was thus considered as the proposed model for further analysis. Figure 3 compares the z-scores of actual and predicted values of the dependent variable for holdout samples using the one-hidden-layer MLP.     We performed a sensitivity analysis to investigate the effect of each variable on the COVID-19 incidence rate using the MLP with one hidden layer. Figure 4 shows the top 10 contributing variables in order of importance. According to Figure 4, age-adjusted mortality rates of ischemic heart disease, Int. J. Environ. Res. Public Health 2020, 17, 4204 8 of 13 pancreatic cancer, leukemia, Hodgkin's disease, mesothelioma, and cardiovascular disease were among the top 10 factors with the highest relative importance for COVID-19 incidence rates, showing the potential importance of these preexisting conditions to COVID-19 incidence rate. In addition to the mortality rates, the proportion of males above 65 years old, higher median household income, precipitation, and maximum terrain slope were other important contributing variables. We performed a sensitivity analysis to investigate the effect of each variable on the COVID-19 incidence rate using the MLP with one hidden layer. Figure 4 shows the top 10 contributing variables in order of importance. According to Figure 4, age-adjusted mortality rates of ischemic heart disease, pancreatic cancer, leukemia, Hodgkin's disease, mesothelioma, and cardiovascular disease were among the top 10 factors with the highest relative importance for COVID-19 incidence rates, showing the potential importance of these preexisting conditions to COVID-19 incidence rate. In addition to the mortality rates, the proportion of males above 65 years old, higher median household income, precipitation, and maximum terrain slope were other important contributing variables.  The logistic regression model was used to explain the association between the presence/absence of the identified hotspots (p < 0.05) of COVID-19 incidence rates and the explanatory variables obtained from sensitivity analysis. The results indicate that age-adjusted pancreatic cancer mortality rates followed by median household income, precipitation, and Hodgkin's disease mortality rates could explain the positive association with the presence/absence of hotspots. Meanwhile, age-adjusted mortality rates for leukemia and cardiovascular disease, and maximum terrain slope, were negatively correlated with the occurrence of the hotspots. Table 2 summarizes the results of the logistic regression model statistics.

Discussion
COVID-19 is an RNA virus that has the potential to mutate like the flu and measles, which may have contributed to the rapid transmission of the disease [49]. Due to the successful performance of ANNs in modeling many complex relationships, we examined the applicability of ANNs in predicting COVID-19 incidence in the continental United States. One of the main advantages of ANNs over widely applied traditional statistical techniques is their predictive capabilities even when working with noisy, complex, and incomplete datasets [18], which may also be useful for modeling other viruses with complex epidemiology, such as Zika virus. This motivated us to compile a relatively broad range (n = 57) of socioeconomic, behavioral, environmental, topographic, and demographic factors together with mortality rates of preexisting conditions. The variables were either suggested by previous studies or were based on domain knowledge (rarely investigated at the county level).
Among the different combinations of network topologies and learning parameters that were examined, the MLP with one hidden layer performed better and thus was used for predictions. Sensitivity analysis of this model indicated that six age-adjusted mortality rates, including ischemic heart disease, pancreatic cancer, leukemia, Hodgkin's disease, mesothelioma, and cardiovascular disease, had substantial impacts on county-level COVID-19 incidence across the continental United States. While there is still much to discover and research, the results suggest that the disease incidence may be influenced by the fluctuance in mortality rates' distribution nationwide. Therefore, counties with elevated proportions of mortality rates of one or more chronic conditions may be more vulnerable to the higher incidence of COVID-19, when compared to other counties. As a result, it may potentially impact mortality rates during the pandemic. Lai et al. [50] indicated that comorbidities and cancer might be substantial contributors to COVID-19 mortality excess rates. They proposed that their findings are applicable to COVID-19 incidence and mortality in the United States. Hanff et al. [51] convey that COVID-19 mortality is significantly associated with comorbidities, including cardiovascular diseases (i.e., hypertension), suggesting that further studies may focus on detailed descriptions of comorbid physiological implications in COVID-19 patients, especially in the use pharmacological therapies. Alimadadi et al. [52] proposed that sophisticated analysis, such as machine learning and artificial intelligence, may aid in combating the pandemic. They also suggest that these methods may provide a better understanding of COVID-19 diagnosis, medication treatment, prevention, and hospital logistics. Although our findings seem consistent with recent studies, drawing conclusions at the individual level is not valid due to ecological fallacy, thus the findings can only be interpreted at the county level.
According to our findings, demographic (i.e., % male above 65), socioeconomic (i.e., median household income), and environmental factors (i.e., maximum terrain slope and precipitation) are influential in predicting COVID-19 incidence, indicating that the disease is not merely affected or driven by physiological conditions. The findings support and extend the previous study of Mollalo et al. [17], who utilized multiscale geographically weighted regression to explain geographic county-level variations of COVID-19 incidence in the United States. Their results indicated that counties with higher median household income and income inequalities were positively correlated with elevated disease incidence, predominantly in the tristate area. Kavanagh et al. [53] proposed that socioeconomic and demographic factors are vital to consider when addressing the pandemic as they may be associated with income disparities that exist in the United States. This may be the case of some employees that may not have the option to work remotely from home, instead, potentially resulting in more frequent exposure to the virus, contributing to further spread of the disease. The study of Qu et al. [54] emphasize the significance of examining the effects of environmental factors pertaining to COVID-19. Their results suggest that COVID-19 may be aggravated by air pollutants (i.e., airborne particulate matter), influencing infectivity. Hence, further studies on preexisting conditions, socioeconomic, demographic, and environmental impacts on COVID-19 incidence preferably at a less coarse granularity level are essential.
We acknowledge that the obtained consistency between the model and ground truth is not notably large. This is likely due to the limited knowledge about the recently emerged disease and factors that may be influential but not included in this study. Therefore, future studies should focus on improving the prediction accuracy of this initial model. Additionally, even though no significant difference is observed between the performance of MLP networks with one and two hidden layers, there may still exist complex relationships in the data that are not captured. This leads us to another limitation of this study, which is the number of training samples. With a higher amount of training data, one could apply deeper networks, i.e., networks with more than two hidden layers, and leverage the power of deep learning models. Deeper neural networks can capture potential non-linearity in the relationship between dependent and independent variables by stacking two or more hidden layers. Thus, such networks are, in general, capable of reaching higher accuracies and can reveal the nuances of the data. However, the amount of training data that was available in this study does not justify utilizing deep networks. A few possible solutions to increase the amount of data are to consider a longer temporal interval (which was not possible in this case), to incorporate data from other countries and regions, to use finer spatial units data (if available), or to use data augmentation techniques to (artificially) generate more training data and features. Moreover, although adjusted mortality rates of the diseases used in this study cannot be directly interpreted as preexisting conditions, higher mortality rates of a certain disease could allude to a higher incidence rate of it. Therefore, this study could be used to further investigate any potential correlation between disease prevalence and COVID-19 incidence.
After more than three months since the first confirmed case of COVID-19 in the US, and due to the substantial economic and social impacts of the pandemic itself and the resulting lockdown policies, discussions regarding "re-opening the country" are omnipresent. The findings of this paper could be used as one of the many guidelines needed by policymakers to decide if and where (at the county level) lockdown policies should be relaxed.

Conclusions
In this study, we examined the applicability of multi-layer perceptron artificial neural networks in modeling cumulative incidence of COVID-19 at the county-level across the continental United States. Although the employed model indicated a reasonable but not large consistency with ground-truth on holdout samples, the prediction capability of the model requires a significant improvement possibly by incorporating new related variables or perhaps by employing different machine learning algorithms. However, with the obtained accuracy, (age-adjusted) mortality rates of ischemic heart disease, pancreatic cancer, leukemia, Hodgkin's disease, mesothelioma, and cardiovascular disease together with two socioeconomic and environmental factors (median household income and total precipitation) could contribute with the disease incidence. Therefore, further studies of the factors and their associations with the disease may reveal useful information for monitoring COVID-19 outbreak.