Prediction of the Number of Patients Infected with COVID-19 Based on Rolling Grey Verhulst Models

The outbreak of a novel coronavirus (SARS-CoV-2) has caused a large number of residents in China to be infected with a highly contagious pneumonia recently. Despite active control measures taken by the Chinese government, the number of infected patients is still increasing day by day. At present, the changing trend of the epidemic is attracting the attention of everyone. Based on data from 21 January to 20 February 2020, six rolling grey Verhulst models were built using 7-, 8- and 9-day data sequences to predict the daily growth trend of the number of patients confirmed with COVID-19 infection in China. The results show that these six models consistently predict the S-shaped change characteristics of the cumulative number of confirmed patients, and the daily growth decreased day by day after 4 February. The predicted results obtained by different models are very approximate, with very high prediction accuracy. In the training stage, the maximum and minimum mean absolute percentage errors (MAPEs) are 4.74% and 1.80%, respectively; in the testing stage, the maximum and minimum MAPEs are 4.72% and 1.65%, respectively. This indicates that the predicted results show high robustness. If the number of clinically diagnosed cases in Wuhan City, Hubei Province, China, where COVID-19 was first detected, is not counted from 12 February, the cumulative number of confirmed COVID-19 cases in China will reach a maximum of 60,364–61,327 during 17–22 March; otherwise, the cumulative number of confirmed cases in China will be 78,817–79,780.


Introduction
An epidemic of the novel coronavirus disease 2019 (COVID-19) broke out in Wuhan City, Hubei Province, China, in early 2020, and spread rapidly in China and across the world, causing tens of thousands of people to be infected with the virus. On 21 January, the prevention and control of the epidemic sprouted at the national level, and many provinces in China launched the first-level emergency response of epidemic prevention. By 30 January, more than 7000 COVID-19 infection cases had been confirmed in China and more than 50 cases had been confirmed in countries and regions outside China, so the World Health Organization (WHO) had declared the outbreak of the novel coronavirus in China as a public health emergency of international concern (PHEIC). According to the latest data (http://www.nhc.gov.cn/) released by the National Health Commission of the People's Republic of China, it can be seen from Figure 1 that the number of confirmed cases increases in an S-shaped trend. Although the cumulative number is rising, the number of newly confirmed cases reached an inflection point, reaching the peak on 4 February and then gradually decreasing. The number of newly confirmed cases rose excessively on 27 January due to the increase in the number of institutions with the capacity of pneumonia detection. Figure 2 demonstrates that the number of suspected cases rose until 8 February and then decreased, while the number of deaths slowly increased and reached 2239 in total on 20 February. Moreover, the number of cured patients significantly increased and reached 18,278 on the same day. On 11 February, the new coronavirus disease was officially named coronavirus disease 2019 (COVID-19) by the WHO. The International Committee on Taxonomy of Viruses (ICTV) named the virus strain as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).  The accurate prediction of the number of patients infected with COVID-19 in China is undoubtedly of great significance for implementing prevention and control measures and carrying out economic and social activities. Considering the limited sample size since the outbreak of the epidemic and that a large sample size is needed to use the classical statistical prediction method, the present research used the grey system theory, which can be modeled with only four data points [1,2]. On this basis, a rolling grey Verhulst model and its derived models were established to predict the change trend of the number of cases of COVID-19 infection in China.

Literature Review
Since the outbreak of the epidemic of COVID-19, many scientists in the world have studied the causes and mechanisms of spread of the virus, along with relevant treatment programs. For example, based on the empirical analysis of a large amount of genomic data on a global scale, Chen et al. [3] firstly tried to explain the reasons for rapid mutation, multiple hosts and strong host adaptability of Betacoronavirus at the molecular level. Xu et al. [4] found genetic evolutionary relationships of the novel coronavirus with the severe acute respiratory syndrome (SARS) coronavirus and the Middle East respiratory syndrome (MERS) coronavirus. Furthermore, Lu et al. [5] found that this virus has specific nucleic acid sequences that are different from those of the known coronaviruses. Zhang et al. [6] studied and summarized the new methods for detecting common respiratory viruses. Koo et al. [7] found that a variety of interventions to maintain body distance are effective in reducing the number of SARS-CoV-2 cases. Bi et al. [8] tracked COVID-19-infected persons and their close contacts, showing that detection, isolation and tracking of cases can reduce the spread of the virus and help control the epidemic. Archer et al. [9] used patients from 24 countries/regions, including Europe and the United States, as data sources to investigate, for the first time, the impact of SARS-CoV-2 infection on pulmonary complications and mortality.
To predict the number of people infected with the SARS-CoV-2 accurately, scholars have established a series of models, generally divided into three categories: the basic infectious disease models in the fields of mathematics and medicine [10][11][12][13], the economic models based on traditional statistical methods [14][15][16] and the algorithm models based on machine learning [17][18][19]. However, the above prediction models have some flaws. For example, the assumptions of the infectious disease model are strong, assuming that there is no super communicator, and are easily affected by factors such as geography; economic models or intelligent algorithms have strict requirements on data, requiring a large amount of data to be trained and tested to obtain relatively accurate prediction results. In response to such new viruses, with low data availability and incomplete cognition, the above models may not be applicable. For this reason, this research selected an innovative topic as the research object, namely the prediction of the final number of infected cases, which can provide a scientific basis for the government policies.
Grey system theory and grey prediction models have been widely used in many fields, such as economics [20], demand prediction [21,22] and environmental protection [23,24], since being proposed. The traditional GM(1,1) and the grey Verhulst model are the core grey prediction models. Bao et al. [25] predicted all factors of disability in middle-aged and old people and the probability of specific injuries through the grey GM(1,1) model and found that non-communicable diseases (NCDs) are still the main threats to health for the elderly. In order to accurately predict the development of human echinococcosis in Xinjiang Uygur Autonomous Region, China, Zhang et al. [26] made short-term predictions using three models, i.e., a traditional GM(1,1) model, a grey-periodic extensional combinatorial model (PECGM(1,1)) and a flamelet-generated manifold (FGM(1,1)) model optimized by Fourier series. In the meanwhile, based on the transmission mechanism of echinococcosis, they established a prediction model for dynamic epidemics that can effectively predict the future development trend of epidemics. The traditional GM(1,1) model is mainly applicable to the sequences with strong exponential laws and can only describe the monotonic change process. However, the grey Verhulst model has strong prediction capability for the non-monotonic swinging developmental sequences or the saturated S-shaped sequences due to the first-order accumulated generating operation (AGO-1) of the original data. In recent years, some scholars have explored the optimization of initial conditions [27] and background values [28,29], the research of model properties [30,31] and the accuracy improvement [32] of the grey Verhulst model.
For application scenarios of the grey Verhulst model, some scholars have conducted research based on real socioeconomic systems, which to some extent reflects the effectiveness and superiority of the grey Verhulst model in comparison with the traditional model. In order to improve the accurate prediction of short-term traffic speed and travel time, Bezuglov and Comert [33] utilized the GM(1,1) model, the GM(1,1) model modified by Fourier error and the grey Verhulst model modified by Fourier error for prediction. The results demonstrate that the grey Verhulst model modified by Fourier error can better process sudden changes in the parameters of a traffic system sequence. Wang and Li [34] constructed a non-equal-interval grey Verhulst model and its derived model and optimized parameters of the models by using particle swarm optimization. On this basis, they verified the environmental Kuznets curve (EKC) of carbon dioxide emission in China by discussing the relationship between carbon dioxide emission and economic growth using a grey model. By building a grey Verhulst model, Wu et al. [35] predicted comprehensive air quality indexes in the Chinese cities of Beijing, Tianjin and Shijiazhuang, and the results show that the input of the government can promote improvement of air quality to some extent. Zhang et al. [36] combined the Verhulst model with the BP neural network to gain complementary advantages and improve prediction accuracy and stability. By utilizing the grey Verhulst model, Wang et al. [37] predicted the state of the iron and steel industry in 2025 based on the relationship between carbon emission of the industry and economic growth from 2001 to 2016; they also put forward relevant policy implications. In order to increase the predictive ability of the initial model, the model is changed appropriately to better adapt to the current needs. By introducing a new non-homogeneous exponential function, Zeng et al. [38] constructed a grey N-Verhulst model that overcame the defects of parameter dislocation and unreasonable initial value selection of the traditional Verhulst model. In addition, some scholars have built grey models based on a rolling mechanism to explore the hidden useful information of the original data sequence and improve the modeling accuracy. Akay and Atak [39] proposed a grey prediction model based on a rolling mechanism to predict the total and industrial power consumption in Turkey. The results demonstrate that the grey model based on a rolling mechanism shows greatly improved prediction accuracy. Considering the complex randomness and nonlinearity of short-term traffic flow, Xiao et al. [40] proposed a seasonal grey rolling prediction model based on the cycle truncation accumulated generation method and a rolling mechanism. Xu et al. [41] established a BR-AGM(1,1) model based on the adaptive rolling mechanism to predict greenhouse gas emissions in China and discussed the policy significance of model overfitting and the modeling process.Şahin [42] used a metabolism grey model, a nonlinear metabolism grey model and the optimized versions of the two to predict greenhouse gas emissions in Turkey. The results demonstrate that prediction accuracy of the optimized nonlinear metabolism model based on a rolling mechanism is higher.
The reminder of the research is arranged as follows: Section 3 introduces the traditional grey Verhulst model, the derived model of the grey Verhulst model, the grey Verhulst model based on a rolling mechanism and its derived models; Section 4 shows the prediction and empirical results of pneumonia infection in China; and the conclusions are made in Section 5.

Models and Methods
The grey Verhulst model is an effective model to describe and predict a process with a saturation state (S-type) under the condition of small samples; it is commonly used in prediction of population, biological reproduction and product life. Under strict anti-epizootic measures in China, it is assumed that COVID-19 cannot spread indefinitely in China and the cumulative number of confirmed cases will not increase indefinitely and will eventually converge to the corresponding saturation value. Therefore, the grey Verhulst model is suitable for modeling growth and changes in the number of virus infection cases, especially for prediction of the final value and inflection point of the number of confirmed cases. The grey Verhulst model selected in this study is proved to be able to predict non-linear data changes with small errors in multiple case studies [43]. Furthermore, the rolling grey Verhulst model established based on a rolling mechanism and its derived model can capture the dynamic characteristics of the future development trend of the system. To be clear, the model in this paper is only applicable to data with S-shaped growth, while the mortality rate and the number of deaths largely depend on the influence of a country's medical system, population parameters and various disease characteristics, which do not conform to the assumptions of the model and may lead to a large error. (1) represents the generated mean sequence of consecutive neighbors of X (1)

Definition 1. It is assumed that the original sequence is X
Definition 2. If X (0) , X (1) and Z (1) are described as Definition 1, the following formula is obtained.
The above formula is the basic form of the GM(1,1) power model, and is the whitening equation of the GM(1,1) power model. (1) and Z (1) are shown in Definition 1, parameter listâ = [a, b] T can be calculated by using the least squares method and shown as follows: Definition 3. Particularly, when r = 2, then is the basic form of the grey Verhulst model, and is the whitening equation of the grey Verhulst model.
The solution to the whitening equation of the grey Verhulst model is shown as follows: By substituting the initial valuex (1) (1) = x (1) (1) into the above formula, the corresponding time response formula of the grey Verhulst model is obtained as: Finally, it needs to be reverted to get the predicted value of the original sequence x (0) (k).

Derivation of Derived Form of the Grey Verhulst Model
In the modeling process of the traditional grey Verhulst model, a difference equation was firstly built and then converted into a differential equation, namely the whitening equation. Moreover, the whitening time response function was derived through integral operation, and the prediction and simulation were finally conducted. The transformation process inevitably led to the inherent deviation of the grey Verhulst model (see the work of Wang et al. [30]), so this study obtained a derived model of the grey Verhulst model by referring to the traditional GM(1,1) derived model proposed by Deng [44] and the method to derive the GM(1,1) power model by Wang [45]. This derived model did not need further prediction by virtue of the whitening response formula, which was a function that the traditional gray Verhulst model did not possess.
In order to facilitate the deduction of the derived models of the grey Verhulst model, the variable of the traditional GM(1,1) model is recorded as y and development coefficient is recorded as a . Moreover, the grey action quantity is expressed as b . The derived model, namely the GM(1,1,y (1) ) model, is defined as follows: where α = a 1+0.5a and β = b 1+0.5a .

Theorem 3.
According to this derived model, the derived model of the grey Verhulst model can be obtained as follows: Proof. Based on the derived y (1) -type GM(1,1,y (1) ) model, the following formula can be derived.
By adding the above k − 1 formulas up, By simultaneously adding y (0) (1) on both sides of the formula, the following formula is obtained: By substituting y (1) (k) = 1 into the above formula, the following equation can be obtained: When k = 2, When k = 3, 4, . . . , n, Therefore, By substituting α = −a 1−0.5a , β = −b 1−0.5a into the above formula, the following formula is obtained: that is, the derived model of the grey Verhulst model.
where α = −a 1−0.5a and β = −b 1−0.5a . The flow chart of the derived grey Verhulst model is shown in Figure 3.

Grey Verhulst Models with a Rolling Mechanism
When using the grey Verhulst model for modeling, the data before the real moment t = n are adopted. However, as time goes on, the development of any real socioeconomic system is accompanied by the constant access of some random disturbance factors, which affects the development of the system. A rolling mechanism can dynamically update the initial value of the data sequence and consider disturbance factors of the system, which has been proved to be able to greatly improve prediction accuracy [46]. Therefore, this study introduced a rolling mechanism into the grey Verhulst model and its derived model in order to reduce the influence of uncertain disturbance factors on the grey system in the future. The modeling process is presented as follows: The traditional grey Verhulst model and its derived model established by the original data sequence X (0) = x (0) (1), x (0) (2), · · · , x (0) (n) are used to predict the next value x (0) (n + 1). By supplementing the value into the original sequence and removing the earliest data point x (0) (1), a new sequence, , is formed. This sequence is taken as the original sequence used to build the model, and the above steps are repeated for prediction and supplementation one by one. The models established according to the above steps are the grey rolling Verhulst model and its derived model. The length of the rolling sequence is expressed as L. When L is 9, the rolling modeling process of the two models is shown in Figure 4. This model can make good use of new information and obtain more accurate predicted results. In order to compare accuracy and verify effectiveness and reliability of the models, absolute percentage error (APE) and mean absolute percentage error (MAPE) were used to calculate errors [47]: where e(i) = x (0) (i) −x (0) (i), in which x (0) (i) andx (0) (i) indicate actual value and predicted value, respectively. The levels of the accuracy of MAPE are shown in Table 1.

Empirical Analysis
Relevant data of the numbers of confirmed cases, suspected cases, cured patients and deaths were acquired by referring the latest data released by National Health Commission (http://www.nhc.gov.cn/). By using the number of patients infected with COVID-19 in China from 20 January to 20 February as the original data, empirical modeling and analysis were performed. Firstly, the rolling grey Verhulst model and its derived model were established, and the lengths of the rolling sequences were 7, 8 and 9. In accordance with the length of the rolling sequence, the original data were classified into a training set and a testing set, and the prediction accuracies of basic model and the derived model were compared. Secondly, in view of different lengths of the rolling sequences, the model with the highest prediction accuracy was selected to predict the final value and inflection point of the number of confirmed cases.

Parameter Estimation
In accordance with the rolling mechanism described in Section 3.3, the rolling grey Verhulst model and its derived model were built. By replacing the earliest data with the latest ones, the prediction and supplementation were carried out successively. Due to different analytic formulas of the general model and derived model, parameter estimation results are identical. By using the least squares method to estimate parameters of the two models, the parameter listsâ = [a, b] T are obtained when the rolling sequence lengths are 7, 8 and 9. The results are shown in Table 2.
Parameters of the rolling grey Verhulst model and its derived model were obtained with different rolling sequence lengths, and predicted values were calculated based on recurrence prediction formula.

Comparison of Model Accuracy
By establishing the rolling grey Verhulst model and its derived model with rolling sequence lengths of 7, 8 and 9, data from 20 January to 20 February 2020 were predicted, and the training and testing sets were established based on the length of the rolling sequence. On the basis of ensuring that the models could accurately simulate the number of confirmed patients in the training set, three models with high prediction accuracy in the testing set were selected to predict the maximum value and inflection point for different rolling sequence lengths. The results of the training and testing sets are presented in Tables 3 and 4. As displayed in Table 3, the prediction errors of the six models in the training set are less than 10%, indicating that the six models can accurately simulate changes of the number of confirmed cases in China. By observing prediction performance of the six models in the testing set, it can be seen that their prediction accuracies are all less than 10%, suggesting that the models can accurately predict changes of the number of confirmed cases in the future. As demonstrated in Table 4, in the testing set, prediction accuracy of the grey Verhulst model with the rolling sequence length of 7 is higher than that of its derived model; the grey Verhulst model and its derived model have MAPE values of 3.30% and 3.83%, respectively. When the rolling sequence lengths are 8 and 9, the grey Verhulst models show lower prediction accuracy than their derived models. MAPEs of the grey Verhulst model its derived model are 4.72% and 3.13%, respectively, when the rolling sequence length is 8. MAPEs of the grey Verhulst model its derived model are 2.93% and 1.65%, respectively, when the rolling sequence length is 9. Therefore, the grey Verhulst model with the rolling sequence length of 7 and derived grey Verhulst models with rolling sequence lengths of 8 and 9 were selected to predict the final value and inflection point of the number of confirmed cases in China.

Prediction of Final Value and Inflection Point of the Cumulative Number of Confirmed Patients
As shown in test results of models in Section 4.2, the grey Verhulst model with the rolling sequence length of 7 and derived grey Verhulst models with the rolling sequence lengths of 8 and 9 were used to predict final value and inflection point of the number of confirmed cases in China. In order to ensure combination of the rolling mechanism and the derived grey Verhulst models, for out-of-sample rolling prediction, the data from 12 to 20 February, 13 to 20 February and 14 to 20 February were used to predict the latest data, which served as the numbers of newly confirmed cases in the next period. Then, by removing the data on 12, 13 and 14 February, new initial sequences were established to continue the rolling operation until the predicted result in the latest day no longer changes, allowing this result to be taken as the final value.
On 13 February, Hubei Provincial Health Commission announced on its official website that, to be consistent with the case diagnosis and classification issued by other provinces in China, the province would release the number of clinically diagnosed cases and include this number in the confirmed cases. Since then, due to the appearance of clinically diagnosed cases, the number of confirmed patients increased greatly every day. Therefore, this study predicted changes of the number of confirmed patients with and without consideration of clinically diagnosed cases. This research further calculated out-of-sample prediction accuracies on 21 and 22 February to verify the predictive ability of the models. The results of final predicted values are demonstrated in Table 5. It was found that the three models could accurately predict out-of-sample data; the specific data are presented in Table 6.
As illustrated in Figures 5-7, in the case of not considering clinically diagnosed cases, the maximum prediction values of the three models with rolling sequence lengths of 7, 8 and 9 are 60,364, 61,327 and 61,327, respectively. Under the condition of considering clinically diagnosed cases, the final prediction values are 78,817, 79,780 and 79,780 on 17, 22 and 22 March, respectively. By analyzing changes in the number of confirmed patients, it is found that the number of confirmed patients does show an S-shaped trend. Moreover, the current number of confirmed patients has been approximated to the final value, and the number of confirmed patients growing in a single day has decreased. In order to further calculate the inflection point, that is, the day with the maximum number of confirmed patients, this research plotted growth rate changes in single days using three models, as shown in Figure 8.      As shown in Figure 8, the inflection point had appeared on 4 February, with the number of confirmed patients of 3892. In addition, the growth in single days declines rapidly, and the predicted results of the three models are basically the same, which proves that the results are robust.

Conclusions
Based on a rolling mechanism, the rolling grey Verhulst model and its derived models for predicting the number of patients infected with COVID-19 in China were constructed by adding the latest data and removing the earliest data. Empirical modeling and analysis was conducted by using the number of infected cases from 20 January to 20 February. Firstly, in order to ensure stability of prediction results, the rolling sequence lengths of 7, 8 and 9 were selected for the models. The original data were classified into the training and testing sets to compare prediction accuracies of the basic model and the derived models. Secondly, considering the different rolling sequence lengths, the models with high precision accuracy were selected to predict the final value and inflection point of the number of confirmed patients. The results showed that the rolling grey Verhulst model and its derived models could accurately predict the changes in the number of confirmed patients in China. The prediction accuracy of the rolling grey Verhulst model with the rolling sequence length of 7 was higher than that of its derived model, while the prediction accuracies of the rolling grey Verhulst models with rolling sequence lengths of 8 and 9 were lower than those of the derived models. Therefore, this study used the rolling grey Verhulst models with high accuracy to predict the final number of confirmed patients and the date of reaching the final number.
By predicting the final number of confirmed patients in China using the rolling grey Verhulst model, the maximum predicted numbers by the three models with rolling sequence lengths of 7, 8 and 9 were 60,364, 61,327 and 61,327, respectively, when clinically diagnosed cases were not considered.