Accuracy of deep learning based computed tomography diagnostic system of COVID-19: a consecutive sampling external validation cohort study

Objectives: Ali-M3, an artificial intelligence, analyses chest computed tomography (CT) and detects the likelihood of coronavirus disease (COVID-19) in the range of 0 to 1. It demonstrates excellent performance for the detection of COVID-19 patients with a sensitivity and specificity of 98.5 and 99.2%, respectively. However, Ali-M3 has not been externally validated. Our purpose is to evaluate the external validity of Ali-M3 using Japanese sequential sampling data. Methods: In this retrospective cohort study, COVID-19 infection probabilities were calculated using Ali-M3 in 617 symptomatic patients who underwent reverse transcription-polymerase chain reaction (RT-PCR) tests and chest CT for COVID-19 diagnosis at 11 Japanese tertiary care facilities, between January 1 and April 15, 2020. Results: Of 617 patients, 289 patients (46.8%) were RT-PCR-positive. The area under the curve (AUC) of Ali-M3 for predicting a COVID-19 diagnosis was 0.797 (95% confidence intervals [CI]: 0.762-0.833) and goodness-of-fit was P = 0.156. With a cut-off of probability of COVID-19 by Ali-M3 diagnosis set at 0.5, the sensitivity and specificity were 80.6% and 68.3%, respectively, while a cut-off of 0.2 yielded a sensitivity and specificity of 89.2% and 43.2%, respectively. Among 223 patients who required oxygen support, the AUC was 0.825 and sensitivity at a cut-off of 0.5 and 0.2 were 88.7% and 97.9%, respectively. Although the sensitivity was lower when the days from symptom onset were few, sensitivity increased for both cut-off values after 5 days. Conclusions: Ali-M3 was evaluated by external validation and shown to be useful to exclude a diagnosis of COVID-19.


Title
Accuracy of deep learning based computed tomography diagnostic system of COVID-19: a consecutive sampling external validation cohort study

Abstract:
Objectives: Ali-M3, an artificial intelligence, analyses chest computed tomography (CT) and detects the likelihood of coronavirus disease  in the range of 0 to 1. It demonstrates excellent performance for the detection of COVID-19 patients with a sensitivity and specificity of 98.5 and 99.2%, respectively. However, Ali-M3 has not been externally validated. Our purpose is to evaluate the external validity of Ali-M3 using Japanese sequential sampling data.

Methods:
In this retrospective cohort study, COVID-19 infection probabilities were calculated using Ali-M3 in 617 symptomatic patients who underwent reverse transcription-polymerase chain reaction (RT-PCR) tests and chest CT for COVID-19 diagnosis at 11 Japanese tertiary care facilities, between January 1 and April 15, 2020. and goodness-of-fit was P = 0.156. With a cut-off of probability of COVID-19 by Ali-M3 diagnosis set at 0.5, the sensitivity and specificity were 80.6% and 68.3%, respectively, while a cut-off of 0.2 yielded a sensitivity and specificity of 89.2% and 43.2%, respectively. Among 223 patients who required oxygen support, the AUC was 0.825 and sensitivity at a cut-off of 0.5 and 0.2 were 88.7% and 97.9%, respectively. Although the sensitivity was lower when the days from symptom onset were few, sensitivity increased for both cut-off values after 5 days.

Results
Conclusions: Ali-M3 was evaluated by external validation and shown to be useful to exclude a diagnosis of COVID-19.
The copyright holder for this this version posted November 18, 2020. ; Introduction A proper triage system is necessary during this coronavirus disease (COVID-19) pandemic era, [1,2] as improper triage systems may disadvantage patients and lead to wastage of personal protective equipment (PPE) and hospital infections through admission of infected patients to facilities, causing collapse of the medical system. Although reverse transcription-polymerase chain reaction (RT-PCR) tests have been developed, the delay in waiting for RT-PCR results can hamper proper triage.
Computed tomography (CT) is a fast and useful diagnostic tool. Some studies have reported the characteristic findings on chest CT images of COVID-19 patients. [3][4][5][6][7][8] Use of chest CT images by radiologists has shown high diagnostic performance for COVID-19. However, even radiologists' interpretations vary largely, because of the influence of their habituation in the interpretation of COVID-19 CT images. [9] Therefore, using CT as a diagnostic tool in general clinical practice is difficult in the current situation.
Diagnostic support systems using artificial intelligence (AI) have the potential to replace many of the routine detection, characterisation, and quantification tasks currently performed by radiologists using cognitive ability. [10] AI can prevent the variability of diagnosis from inter-and intra-reader variability. In China, where COVID-19 infection originated, many AI systems were developed for establishing a diagnosis of COVID-19 based on chest CT images. [11][12][13][14][15] One such system, Ali-M3, can detect the likelihood of COVID-19 in the range of 0 to 1, and has excellent accuracy for the detection of COVID-19 with an accuracy, sensitivity, and specificity of 99.0, 98.5, and 99.2%, respectively. Although Ali-M3 has excellent accuracy, it was developed in a virtual population, which consisted of 3,067 examinations for COVID-19; 1,996 for community-acquired pneumonia; and 1,975 for non-pneumonia, which was different from the general population and its accuracy could be overestimated. [16] To use Ali-M3 to diagnose exclusion of COVID-19, its external validity must be evaluated based on the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.15.20231621 doi: medRxiv preprint distribution of diseases in a real-world setting. We here conducted a retrospective cohort study to evaluate the external validity of Ali-M3 using the Japanese sequential sampling data of patients who underwent RT-PCR tests and chest CT for diagnosis of COVID-19.

Study design
This retrospective cohort study consisted of 11 Japanese tertiary care facilities that provided treatment for COVID-19 in each area. We partially followed the guidelines of the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis Statement to plan and report this study (Supplemental Table 1). [17] The institutional review board of each facility approved the study and the need to obtain written informed consent was waived.

Participants
We included patients who underwent both RT-PCR examinations and chest CT for the diagnosis of COVID-19. The potentially eligible participants were identified on the advice of physicians that both RT-PCR test and chest CT be obtained when the patients presented with symptoms or were suspected of having COVID-19. The detailed information of the inclusion criteria is shown in Supplemental Table 2.
We selected patients by using consecutive sampling methods between January 1 and April 15, 2020. The RT-PCR results were extracted from the patients' medical records at each facility. Patients were excluded when the time-interval between chest CT and the first RT-PCR assay was longer than 7 days.
All available data on the database were used to maximize the power and generalizability of the results.

Chest CT protocols
All images were obtained on one of five types of CT systems, with the patient in the supine position.
The details of scanning parameters and systems are shown in Supplemental All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Image analysis
We used a three-dimensional deep learning framework for the detection of COVID-19 infections. [16] The details of this model are included in Appendix 1. The learning of Ali-M3 was stopped before our evaluation. We set a cut-off point for the model output at 0.5, because this cut-off point was used during the developing stage. The investigators who entered the CT images data into Ali-M3 were blinded to the RT-PCR results.

Reference standard
The diagnosis of COVID-19 was established by the RT-PCR test, which detected the nucleic acid of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the sputum, throat swabs, and secretions of the lower respiratory tract samples. [18] We established the RT-PCR tests as the main reference standard. Although the findings of chest CT, interpreted by radiologists, were included as the reference standard in the derivation study, we did not include it as the reference standard in the present study.

Statistical analysis
Statistical analysis was performed using R statistical software, version 3.6.3 (R Foundation for Statistical Computing). Data analysis was performed in a complete-case dataset. Continuous variables are presented as means (standard deviation) and categorical variables are presented as counts and percentages. Using the RT-PCR results as reference, the area under the curve (AUC), sensitivity, specificity, positive-predictive value, and negative-predictive value of the likelihood of COVID-19 as derived from the Ali-M3's analysis of chest CT imaging were calculated. A 95% confidence interval (CI) was determined by the Wilson score method. The goodness-of-fit was calculated using the Le Cessie-Van Houwelingen normal test statistic for the unweighted sum of squared errors.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Moving cut-off point
The objective of this study was to determine whether this AI model could be used as a screening tool for COVID-19 in the real world. In a clinical situation, physicians require an accurate diagnosis of COVID-19; hence, they insist more on sensitivity than on specificity. For sensitivity analysis, we moved the cut-off point and observed sensitivities and specificities to minimize overlooking COVID-19 patients.

Simulation of imperfect reference
In the main analysis, we assumed RT-PCR as the perfect reference (100% sensitivity and 100% specificity). However, in the real world, RT-PCR is not the perfect reference since the sensitivity of the RT-PCR test was estimated at 0-80%. [19] To evaluate the effect of this imperfect reference, we calculated the sensitivity, specificity, and AUC of Ali-M3 using the methods and R code described in the Supplemental Material when varying the sensitivity, but fixing the specificity of RT-PCR at 100%. [20] 3. Effect of the number of days after symptom onset The number of days that have passed since the onset of symptoms affects the performance of antibody and RT-PCR tests in COVID-19 patients. [19,21] However, it was not clear if this could affect CT images in COVID-19 patients. Sensitivity and specificity were calculated for a group of patients whose symptom onset date was known, among those were those with 14 days or more, as well as those at every 2 days from 0 to 13 days after symptom onset.

Effect of reconstruction slice
The thickness of the reconstruction slice can affect the diagnostic performance. [23] We separated the dataset for the main analysis by a 3-mm reconstruction slice thickness to account for the fissure in our data set between 3 mm and 4 mm and calculated the performance of the model in each dataset.  Table 1. Overall, 289 patients (46.8%) were diagnosed with COVID-19 using the RT-PCR test. Thirteen patients need more than two RT-PCR tests before being diagnosed with COVID-19. Major symptoms were dry cough (37.6%), fever (33.5%), and sore throat (25.8%).

Model performance
The performance of the confidence score after validation among symptomatic patients is shown in

Sensitivity analysis
1. Moving cut-off point Table 2 shows the relationship between cut-off points for the confidence score and performance.
The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.15.20231621 doi: medRxiv preprint 8 2. Simulation of imperfect reference Figure 3 shows the sensitivity and specificity, with the assumption of imperfect reference of RT-PCR test. The AUC was 0.865. When the cut-off point was set at 0.5, using the Youden Index, the sensitivity and specificity were 80.6% and was 81.3%, respectively. When the cut-off point was set at 0.2, the sensitivity and specificity were 89.2% and 51.9%, respectively.

Effect of number of days after symptom onset
Of all symptomatic patients, 600 patients (97.2%) were included in this sensitivity analysis. Of these, the number of days after the onset of symptoms was not known for 17 patients. Figure 4 shows the relationship between test performance and the number of days since the onset of symptoms when the confidence score of Ali-M3 was set at 0.5 or 0.2. Sensitivity values started at 0.7 and increased up to 1.0 until 10-11 days in both cases. However, specificity values remained similar across the strata. The sensitivity increased over 0.9 when the confidence score was set at 0.2 than when the confidence score was set at 0.5.

Changing the eligibility criteria
The effects of changing the criteria for patient eligibility are shown n Figure 5.

Dataset focused on asymptomatic patients
There were 86 asymptomatic patients (RT-PCR positive: 37). Using these patients only, the AUC was 0.623. When the cut-off point was 0.5, the sensitivity and specificity were 51.4% and 59.2%, respectively. When the cut-off point was 0.2, the sensitivity and specificity were 44.9% and 73.0%, respectively.
The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.15.20231621 doi: medRxiv preprint 9 5. Effect of the thickness of the CT reconstruction slice of CT There were 320 patients (RT-PCR positive: 121) with a reconstruction slice thickness of under 3 mm When considering these patients only, the AUC was 0.825. When the cut-off point was set at 0.5, the sensitivity and specificity were 82.6% and 69.7%, respectively. When the cut-off point was set at 0.2, the sensitivity and specificity were 94.2% and 51.5%, respectively. In patients with a reconstruction slice thickness over 3 mm, the AUC was 0.789 (Supplement Figure 1)

Discussion
In this external validation study, our results indicated that Ali-M3 could be useful for early triage of suspected COVID-19 patients with symptoms at a lower cut-off. In particular, higher accuracy was observed in patients with higher severity and a few days since symptom onset, and with images with a thinner reconstructed CT slice thickness.
Currently, all patients with symptoms, such as fever, are triaged as COVID-19 patients. Thus, medical practitioners must use PPE for each patient. [24] Additionally, bed zoning is essential to avoid contamination of non-infected patients. [25] On the other hand, under-triage cause hospital infections through admission of infected patients to facilities. This should continue until a definitive diagnosis is established. Since Ali-M3 is available on the cloud, the physician can receive the results immediately by sending the digital imaging and communications in medicine images from the ordinal picture archiving and communication system. When applying triage, clinicians require sufficient accuracy in terms of sensitivity, but specificity is less important. [19] The high sensitivity obtained at a cut-off of 0.2 with the AI diagnosis is useful for exclude the diagnosis of COVID-19.
The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.15.20231621 doi: medRxiv preprint level, and is currently inferior to RT-PCR tests. As the same patient sample is used, the antigen test cannot support the RT-PCR test. The RT-PCR test is currently used as a gold standard, but the sensitivity changes depending on the number of days after the onset of symptoms. [19] Therefore, for an exclusion diagnosis, multiple tests staggered over time are needed, rather than a single negative RT-PCR test. Even when this test is performed as rapidly as possible, it still requires a few days to obtain multiple test results.
On the other hand, Ali-M3 uses the configurational information of patients' lungs, and can add different information than obtained from RT-PCR, thereby complementing the drawbacks of RT-PCR.
In this study, the diagnostic accuracy at the validation stage was lower than the accuracy at the development stage. A two-gate (case-control) design was used in the development of the AI system but in the present study for evaluating the ability of Ali-M3 to assess a COVID-19 diagnosis by chest CT image, we used a single-gate (cohort) design. Although many studies have used the two-gate design in evaluation of AI for the diagnosis of COVID-19, [26] the two-gate design is generally prone to overestimation of diagnostic test results. [27] Thus, blindly using the results based on a two-gate design in a clinical situation can be inappropriate. Moreover, other factors should be considered. With the use of a two-gate design, the fact that RT-PCR is an imperfect reference standard is typically ignored. Furthermore, performing culture and tests to ascertain the true sensitivity of this test is difficult. In the present study, we simulated the diagnostic ability of Ali-M3 with consideration that the sensitivity of the reference standard was imperfect, which leads to underestimation of the specificity and AUC of Ali-M3, without distortion of the sensitivity. Furthermore, the outcomes while developing Ali-M3 and while examining its adequacy were different. Taking into account the patient flow in China, the outcomes at the development stage were set as positive cases with RT-PCR negative results and positive CT image findings. [28] This had a small effect on the sensitivity, but a large effect on the specificity. For example, if in the development stage, 33.9% of the positive patients had negative RT-PCR results and positive CT image findings,[28] then the performance that showed a sensitivity of 98.5% and specificity of 99.2% in the developing Ali-M3, [16] changes from 97.7% to 100% for sensitivity and from 80.8% to 81.6% for specificity when positive RT -1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 All rights reserved. No reuse allowed without permission. preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this this version posted November 18, 2020. ; https://doi.org/10.1101/2020.11.15.20231621 doi: medRxiv preprint PCR is the only reference used. Upgrading to a diagnostic AI that targets only RT-PCR-positive cases at the development stage is desirable.
This study had some limitations. First, the differentiation performance of Ali-M3 was poor in asymptomatic patients; thus, Ali-M3 should not be used to screen asymptomatic patients. While an alternative to the RT-PCR test for COVID-19 is expected in terms of screening for nosocomial infections and screening on admission for patients with other diseases, Ali-M3 is not recommended for this purpose.
Second, we could not differentiate COVID-19 from other viral pneumonias. Compared to the past five seasons, the number of Japanese people infected with influenza during this season was markedly low. [29] In fact, only a few cases in our cohort were diagnosed with other viral pneumonias. Third, it could not reflect the difference in imaging features caused by different COVID-19 types. In addition to type A COVID-19 that was initially prevalent in Asia, type B and type C were prevalent in Europe and in the United States. These different types were not determined in the PCR test, and thus we could not evaluate these differences.
In conclusion, we conducted a retrospective cohort study for external validation of Ali-M3. Our results indicated that AI-based CT diagnosis could be useful for a diagnosis of exclusion of COVID-19 in symptomatic patients, particularly those requiring oxygen and with only a few days since symptom onset.
The copyright holder for this this version posted November 18, 2020. ; PCR, polymerase chain reaction; bpm, beats per minute *Patients using oxygen support were included in symptomatic patients + is continuous data and the others are count data. Continuous variables are expressed as mean (SD) and count data as number (percentage).