Quantifying bias of COVID-19 prevalence and severity estimates in Wuhan, China that depend on reported cases in international travelers

Risk of COVID-19 infection in Wuhan has been estimated using imported case counts of international travelers, often under the assumption that all cases in travelers are ascertained. Recent work indicates variation among countries in detection capacity for imported cases. Singapore has historically had very strong epidemiological surveillance and contact-tracing capacity and has shown in the COVID-19 epidemic evidence of a high sensitivity of case detection. We therefore used a Bayesian modeling approach to estimate the relative imported case detection capacity for other countries compared to that of Singapore. We estimate that the global ability to detect imported cases is 38% (95% HPDI 22% – 64%) of Singapore’s capacity. Equivalently, an estimate of 2.8 (95% HPDI 1.5 – 4.4) times the current number of imported cases, could have been detected, if all countries had had the same detection capacity as Singapore. Using the second component of the Global Health Security index to stratify likely case-detection capacities, we found that the ability to detect imported cases relative to Singapore among high surveillance locations is 40% (95% HPDI 22% – 67%), among intermediate surveillance locations it is 37% (95% HPDI 18% – 68%), and among low surveillance locations it is 11% (95% HPDI 0% – 42%). Using a simple mathematical model, we further find that treating all travelers as if they were residents (rather than accounting for the brief stay of some of these travelers in Wuhan) can modestly contribute to underestimation of prevalence as well. We conclude that estimates of case counts in Wuhan based on assumptions of perfect detection in travelers may be underestimated by several fold, and severity correspondingly overestimated by several fold. Undetected cases are likely in countries around the world, with greater risk in countries of low detection capacity and high connectivity to the epicenter of the outbreak.

2 may be underestimated by several fold, and severity correspondingly overestimated by several fold.
Undetected cases are likely in countries around the world, with greater risk in countries of low detection capacity and high connectivity to the epicenter of the outbreak.

Introduction:
During the outbreak of a new virus SARS-Cov2 and its associated disease COVID-19, infection in travelers has been used to estimate the risk of infection in Wuhan, Hubei Province, China, the epicenter of the outbreak 1 . This approach is similar to that used for the the 2009 influenza pandemic where infections in tourists returning from Mexico were used to estimate the time-specific risk of infection (incidence or cumulative incidence) with the novel pandemic H1N1 influenza strain in Mexico (or parts thereof). The idea was that surveillance for the novel virus was not intense during the early days of the pandemic in Mexico, the source country, and that detection would be far more sensitive in travelers leaving Mexico, who would be screened when returning home as a means of preventing introductions of cases into destination countries 2,3 . Reports that health systems in Wuhan are overwhelmed and many cases are not being counted have led to the use of outgoing traveler data to estimate the time-specific risk of COVID-19 in Wuhan 4 . This estimate, in turn, has been used to estimate the cumulative incidence of infection by a certain date in Wuhan, and from there (often assuming exponential growth and no appreciable depletion of susceptibles) the cumulative number of cases. Two important assumption underlie this calculation: i) that the detection of cases in the destination country has been 100% sensitive and specific, whether they are detected at the airport (prevalent cases with symptoms) or later after arrival at their destination (cases that were incubating during travel); ii) that travelers have the same prevalence of infection as the average resident of Hubei , so the prevalence inferred in travelers may be directly applied in Hubei. Here we consider the extent to which these two assumptions are justified. We conclude that the first assumption is strongly inconsistent with observed data, resulting in potentially substantial underestimates of prevalence in Hubei and corresponding overestimates of case-severity measures that are normalised by case counts.
We previously showed that there was variability between locations in the world in the relationship between the number of travelers from Wuhan to each international destination and the number of imported cases detected in that destination. On average, for countries presumed to have high surveillance capacity, one imported case reported over the period 8th January to 4th February was associated with each additional 14 passengers/day historical travel volume 5 . However there was variation around this average. Among countries with substantial travel volume, Singapore showed the highest ratio of detected imported cases to daily travel volume, a ratio of one case per 5 daily travelers. Singapore is historically known for exceptionally sensitive detection of cases, for example in SARS 6 , and has had extremely detailed case reporting during the COVID-19 outbreak 7 . We therefore use Singapore in this work as the upper limit of case detection capacity. And we estimate this capacity of other locations relative to Singapore.
Regarding the second assumption, we demonstrate that the point prevalence of infection may be lower in visitors who have stayed only briefly in the source population (Wuhan) than in residents. All else equal, the discrepancy between resident and visitor prevalence is most pronounced if the visitors' durations is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. .

3
of stay are shorter, for slower-growing epidemics, and for longer durations of detectable infection; conversely, visitors are more similar to residents in their prevalence of infections if they stay longer, if the epidemic is growing faster, and if the duration of detectable infection is shorter. We quantify this discrepancy as a function of these features using a simple model.

Methods:
Data From a total of 195 worldwide locations (reflecting mainly countries without taking any position on territorial claims), we included n =194, which excludes the epicenter China. Data on imported cases aggregated by location were obtained from the WHO technical report dated 4th February 2020 1 (a zero case count was assumed for all locations not listed). We used case counts up to the 4th February, because after this date the number of exported cases from Hubei province drops rapidly 1 , likely due to the Hubei-wide lockdowns.
We defined imported cases as those with known travel history from China (of those, 83% had travel history from Hubei province, and 17% from unknown locations in China 1 ). Estimates on daily air travel volume were obtained from Lai et al. 8  Singapore as a special case for surveillance of COVID-19, and we assign it it's own category of highest-achieved surveillance. This study did not include human subjects, used publicly available data, and therefore no ethical approval was required.

Estimating detection probability relative to Singapore
We consider the detection of 18 cases by 4th February 2020 in Singapore 1 to reflect the highest surveillance capacity among all locations, and estimate the probability of detection in other countries relative to Singapore according to the following model. We model the case detection across i = 1,..., n . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. . https://doi.org/10.1101/2020.02.13.20022707 doi: medRxiv preprint worldwide locations, where the n = 194 locations are indexed with Singapore being i = 1, followed by the rest of the locations in order of decreasing GHS 2 index. We assume that the observed case count across the n locations follows a Poisson distribution, and that the expected case count is linearly proportional to the daily air travel volume and a random variable, reflecting the capacity of detecting cases relative , θ level to Singapore: where y i denotes the reported case count in the i -th location, λ i denotes the expected case count in the i -th is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. . https://doi.org/10.1101/2020.02.13.20022707 doi: medRxiv preprint In 2009, during the influenza pandemic which originated in Mexico, it was assumed that most travelers leaving Mexico were tourists, or other temporary visitors, with relatively short stays in Mexico, and that the risk that they were infected represented a cumulative hazard over the period of their stay 2,3 . The basic assumption was that short term visitors faced the same hazard of infection as residents of Mexico, but, given the shorter stay, they had a somewhat lower prevalence of infection when returning to their home country. Many estimates in 2019-20 for COVID19 have instead made the assumption of equal prevalence in travelers leaving Wuhan and in residents, which is equivalent to assuming either that all travellers are Wuhan residents, or that all visitors had stayed long enough during the epidemic that their prevalence was similar to that of residents.
To quantify the difference between these two scenarios -assuming that all travellers are short term visitors versus assuming that all travellers are residents or long term visitors -we considered a simple not in the intensity of exposure. Under these assumptions, the ratio of prevalence in visitors to that in residents, which we call V , would be . Once the number of cases is substantial, this term can be well approximated as We plot this approximation of V given doubling times . V ≈ 1 − e −(r+γ)d aligned with 4 , a range of durations of detectable infection and a range of lengths of stay 2 ( Figure 2 ).

Results:
We estimate that the global ability to detect imported cases is 38% (95% HPDI 22% -64%) of Singapore's capacity  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. . 6 We further find that the prevalence ratio between temporary visitors and residents approaches 1 as the epidemic growth rate, the duration of stay, and the recovery rate increase and it approaches zero for short duration of stay, long duration of infection, and slower epidemic growth ( Figure 2 ).  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. . https://doi.org/10.1101/2020.02.13.20022707 doi: medRxiv preprint 8 countries has relied largely on symptoms and travel history, so the number of asymptomatic or low-severity cases missed by such a strategy is unknown.
The second assumption we tested is that the true prevalence in travelers is similar to that of residents. It may be different for either of two reasons, one of which we attempt to quantify here. It could be less if those who travel for some reason are less well integrated into the social mixing that produces infection, for example if they tend to have stayed in certain parts of the city or in hotels. This aspect could conceivably increase or decrease prevalence in travelers relative to residents. Here we quantify a second difference, which is that some travelers (whom we refer to as visitors) will have been in the city only for a short time and had less exposure to the infection than residents. This effect, we find, is most pronounced when the epidemic is growing slowly, when the visitors have stayed only briefly, and when the duration of detectable infection is short. We find that for plausible parameters for COVID-19, prevalence in visitors staying only 3 days could be as little as half that of residents, but for longer stays of over a week the visitor prevalence should be 80% or more that of residents. Assuming that the traveler population is a mix of visitors of various durations and residents, this suggests that underestimation of source population prevalence due to the presence of short-stay visitors could be appreciable but more modest than the effect of imperfect detection.
These findings that detected cases in travelers likely underrepresent the source population prevalence have two important implications for public health response to SARS-CoV2. First, this finding has implications for approaches to case burden and severity estimation which use cases in travelers to impute cases in Wuhan, which are then compared (for severity estimation) against deaths in Wuhan. If the true number of cases in travelers is higher than previously thought, this implies more cases in Wuhan and a larger denominator, resulting in reduced estimates of severity compared to estimates assuming perfect detection in travelers. Future studies should account for our evolving understanding of detection capacity when estimating case numbers and severity in source population on the basis of traveler case numbers.
Second, the scenario where the virus has been imported from Wuhan and remained undetected in various worldwide locations is a plausible one, at least until the city lockdown (23rd January 2020), and one might speculate that detection capacity remained limited beyond this period as travelers infected elsewhere in China continued to leave China. Based on our model, the risk of undetected circulation correlates both with air travel connectivity and (inversely) to outbreak detection capacity, but could have happened in virtually any location worldwide leading to the potential risk of self-sustained transmission, which may be an early stage of a global pandemic. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020.  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 18, 2020. . https://doi.org/10.1101/2020.02.13.20022707 doi: medRxiv preprint