Systematic review with meta-analysis of diagnostic test accuracy for COVID-19 by mass spectrometry


 Background
 The global COVID-19 pandemic has led to extensive development in many fields, including the diagnosis of COVID-19 infection by mass spectrometry. The aim of this systematic review and meta-analysis was to assess the accuracy of mass spectrometry diagnostic tests developed so far, across a wide range of biological matrices, and additionally to assess risks of bias and applicability in studies published to date.
 
 Method
 23 retrospective observational cohort studies were included in the systematic review using the PRISMA-DTA framework, with a total of 2858 COVID-19 positive participants and 2544 controls. Risks of bias and applicability were assessed via a QUADAS-2 questionnaire. A meta-analysis was also performed focusing on sensitivity, specificity, diagnostic accuracy and Youden's Index, in addition to assessing heterogeneity.
 Findings.
 Sensitivity averaged 0.87 in the studies reviewed herein (interquartile range 0.81–0.96) and specificity 0.88 (interquartile range 0.82–0.98), with an area under the receiver operating characteristic summary curve of 0.93. By subgroup, the best diagnostic results were achieved by viral proteomic analyses of nasopharyngeal swabs and metabolomic analyses of plasma and serum. The performance of other sampling matrices (breath, sebum, saliva) was less good, indicating that these protocols are currently insufficiently mature for clinical application.
 
 Conclusions
 This systematic review and meta-analysis demonstrates the potential for mass spectrometry and ‘omics in achieving accurate test results for COVID-19 diagnosis, but also highlights the need for further work to optimize and harmonize practice across laboratories before these methods can be translated to clinical applications.



Rationale
The COVID-19 pandemic has resulted in significant morbidity and mortality across the globe. 1 The severity of the pandemic has also triggered developments and accelerated application in many scientific fields, including vaccine technology, drug treatment, and testing. Whilst the global standard in diagnosis has been the polymerase chain reaction combined with reverse transcription (RT-PCR), at times demand has exceeded supply, leading to research across many analytical disciplines for alternative diagnostic solutions. 2,3 The potential of mass spectrometry (MS) for research into diseases and their diagnosis is well-established, 4,5 with the flexibility of the technique allowing both proteomic and metabolomic analysis across a wide array of biological matrices. A number of methods have been developed and improved over the last eighteen months, 6 but given the exigencies of the pandemic, researchers have often been unable to establish ideal case-controls, blind tests or sufficient participant recruitment to meet best-practice thresholds for either point of care or laboratory-based detection tests. 7 Whilst clinical diagnostic tools such as bilateral chest X-rays and similar methods have been systematically reviewed, 2 no such systematic review and diagnostic meta-analysis has to our knowledge been published on tests based on mass spectrometry.
In this review we explored the state of mass-spectrometry-led diagnostic testing for COVID- 19 infection across different biological matrices using 'omics approaches, incorporating a meta-analysis of key parameters. These included accuracy, sensitivity, specificity, and Youden's Index, as well as an assessment of heterogeneity. Any diagnostic test must have a J o u r n a l P r e -p r o o f relevant use-case, and in this review we focused on applicability to hospital admissions, 8 given that this use-case for MS would complement the capabilities offered by RT-PCR (highly sensitive, but slow turnaround relative to point of care tests) and lateral flow tests (faster but do not take advantage of the facilities and expertise available in a hospital setting). We additionally aimed to assess published studies for issues relating to bias and applicability, in order to review the undoubted progress made so far, as well as to highlight improvements that can be made in future work.

Objectives
The objective of this review was to benchmark a series of MS based diagnostic index tests against each other using RT-PCR as a reference test. We also aimed to identify how well new tests might meet a clinical role of accurate identification of COVID-19 infection, with a focus on admission settings. The review also sought to identify areas in existing research where bias or applicability issues may occur, and how future research may mitigate against these issues.

Information sources and search strategy
the reference lists of all studies identified by the search strategy described above. The search strategy included articles published on the above-listed databases up to and including 14 September 2021.

Study selection
For all articles identified under the search strategy, titles and abstracts were screened for eligibility. The relevant articles were then read in full, including data extraction for metaanalysis. In this work, the eligibility criteria for inclusion in the systematic review and meta analysis were set as follows: (a) evaluation of a diagnostic method for COVID-19 using mass spectrometry, based on 'omics approaches, (b) using human biological matrices and (c) including diagnostic analyses, at a minimum reporting sensitivity and specificity by confusion matrix, or receiver operating characteristic (ROC) curves provided that the sensitivity / specificity trade-off was unambiguous. Articles in non-Roman characters were not included.
The above search and eligibility steps were carried out by two researchers, with differences in identified articles reviewed by a third author for inclusion / exclusion.

Data collection process
The following items were collected by two researchers from articles identified above: key metadata for each article (authors, date of publication, country of origin); methods employed (mass spectrometry, separation, biological samples collected) and diagnostic outcomes (true positive -TP; false positive -FP; false negative -FN; and true negative -TN).
Diagnostic outcomes were taken directly from research where possible, or were calculated using confusion matrices based on reported sensitivity and specificity outcomes as applied to cohort data, or in one case by use of a reported ROC chart.

Risks of bias and applicability
Two researchers independently evaluated risks relating to both bias and applicability using the Diagnostic Precision Study Quality Assessment Tool (QUADAS-2), 10 with the approach (and conflicts between the researchers) being reviewed by a third author.

Diagnostic accuracy measurements including meta-analysis of diagnostic accuracy
Meta-analysis was performed for the aggregate of mass spectrometry 'omics based approaches. Given the small sample sizes, not all subgroups offered meaningful results, but subgroups comprising viral proteomics, blood-based metabolomics, and novel 'omics approaches (saliva, sebum and breath) were reviewed independently from the aggregate.
Sensitivity was defined as the true positive rate, i.e. the probability that a positive test result will be obtained when the disease is present, and calculated as TP / (TP+FN). Specificity was defined as the true negative rate, i.e. the probability that a negative test result will be obtained when the disease is not present, and calculated as TN / (TN+FP). Youden's Index was defined as sensitivity -(1 -specificity), or alternatively, one minus the sum of the error rates. The PLR was defined as the true positive rate / false positive rate. The NLR was defined as false negative rate / true negative rate.
Heterogeneity of diagnostic power across the different biofluids investigated in this work was investigated by measuring Cochran's Q and Higgins I 2 . In this work, a p-value below 0.10 or I 2 value greater than 50% was taken as evidence of substantial heterogeneity of diagnostic power; it should be noted however, that lower values do not necessarily confirm homogeneity, only an absence of evidence for heterogeneity. 11,12 A summary receiver operating characteristic (sROC) curve was also constructed for the studies included herein.
ROC curves show the trade-off between sensitivity and specificity, whereby a test can be more sensitive (by over-diagnosing disease) at the cost of being less specific (more false positives), and vice versa. A test that was 100% sensitive and 100% specific would generate an area under the curve (AUROC) of exactly 1, and more generally values closer to 1 indicate better diagnostic performance.

Statistical tools
All statistical analysis was performed in the R Studio environment, 13,14 with additional functionality using the epiR, forestplot and mada packages. 15,16,17 3 Results

Study characteristics
In total, 253 articles were identified in the initial search strategy by the terms described, after removing 308 duplicate results. From this initial list, 51 were identified as meeting the eligibility criteria and 202 were excluded. The articles on this shortlist were then read in full.
23 of the 51 identified articles contained the complete set of diagnostic accuracy data to allow for meta-analysis, albeit for one article 18 the data were imputed from provided ROC charts. Figure 1 provides a flowchart illustrating these steps.  Table 1, grouped by methods whose focus was on host characteristics, methods focused on the virus (by proteomics), and groups that identified features but were agnostic as to the source of those features. J o u r n a l P r e -p r o o f   In the analysis that follows, Unclear does not denote 'medium' risk of concern; rather it denotes that insufficient information was provided, and there is no basis to consider the study to be at 'low' risk of bias or inapplicability.

Risk of bias and applicability of the tests reviewed
In terms of risks of bias around patient selection, 30% of the studies provided no cohort analysis, making it impossible to ascertain whether the work was free from bias in this regard. Furthermore, only 9% studies specified whether participants were recruited consecutively or at random. Only 23% of studies explicitly stated that asymptomatic patients were included, 39% stated that they were excluded, and 39% provided no information,

Diagnostic results of the studies
The key extracted diagnostic indicators are summarised in Table 2    Across the studies reviewed, sensitivity ranged from 0.62 to 1.00 (aggregate sensitivity of 0.87), and specificity ranged from 0.72 to 1.00 (aggregate specificity of 0.88). Specificity was greater than sensitivity on average, albeit the difference was not statistically significant based on a two-tailed t-test (p-value of 0.34).
In terms of biofluids analysed, sebum was analysed in 2 papers, and delivered the lowest aggregated sensitivity (0.76) and specificity (0.82), calculated by summing confusion matrices. Saliva was investigated in 2 studies, with sensitivity and specificity of 0.74 / 0.75 for metabolomic analysis of saliva, and 1.00 / 0.93 for proteomic analysis. Breath was analysed in 4 studies, with comparable sensitivity (0.78) and specificity (0.81) to sebum.
Nine (9) studies sampled nasopharyngeal swabs, with high sensitivity (0.89) and specificity (0.88). The remaining 5 studies sampled blood (either plasma or serum), with aggregated sensitivity of 0.89 and specificity of 0.96. Proteomic approaches that targeted the virus reported higher sensitivity and specificity than approaches that targeted the impact on the host, albeit within the latter category there was considerable variation. Table 1 lists the major features differentiating the populations by study. In studies focusing on proteomics, a number identified features by m/z only, but 2 studies targeted peptides originating from spike proteins, and 2 identified peptides originating from nucleocapsid proteins. For the 4 studies analysing breath, a wide variety of alcohols, aldehydes and J o u r n a l P r e -p r o o f ketones were found to differentiate the populations, but there was limited overlap, with heptanal and octanal featuring in 2 of the 4 studies. In terms of sebomics, the studies described in this review found no differentiating features in common. Within plasma and serum, 2 papers identified ratios of amino acids (kynurenine in particular) as key differentiating features, and 2 papers focused on lipid dysregulation.
As a single measure of performance, estimates of Youden's Index including confidence intervals are shown in Figure 3, with Youden's Index calculated as sensitivity minus (one minus specificity), or alternatively one minus the sum of error rates.

3.3
Heterogeneity assessment of the studies J o u r n a l P r e -p r o o f The studies show variation in their diagnostic performance measured by either sensitivity, specificity or Youden's Index (Table 2 and Figure 3) and -partly due to small participant populations -confidence intervals are wide. Cochrane's Q was calculated as 26.2 with a pvalue of 0.24, and Higgins' I 2 was calculated as 16%. The latter value should be treated with caution given the small samples sizes assessed in this meta-analysis as Higgins I 2 tends to be underpowered in the meta-analysis of studies with small n and therefore lower precision. 42 A low I 2 does not represent evidence of homogeneity per se, but may indicate that the variability in results could be due to wide confidence intervals rather than unexplained heterogeneity, as is this case in this work ( Figure S1, Supplementary Material).
Heterogeneity was also investigated by broad method employed, specifically proteomics versus metabolomics, and also by subgroup. Heterogeneity was notably low for proteomics including viral proteins, with Cochrane's Q calculated as 7.6, and Higgins' I 2 was 0%. For blood-based analyses, Cochrane's Q was calculated as 4.1, and Higgins' I 2 was calculated as 3%. For saliva, sebum and breath (the more novel 'omics analyses), Cochrane's Q was calculated as 3.2 and Higgins' I 2 was calculated as 0%.
Visual inspection also illustrates the differences between, but similarity within, these methods ( Figures 4A and 4B). This can also be illustrated by calculating summary area under the sROC curves for these groups. J o u r n a l P r e -p r o o f metabolomics analyses applied to other sampling matrices (saliva, breath, sebum) may partly reflect instrumental setup, but could also relate to confounders, and illustrates the need for much more inter-laboratory validation and comparison before these diagnostic techniques are likely to be suitable for translation to clinical practice.
RT-PCR as a reference standard achieves very high analytical sensitivity and specificity and is generally seen as the clinical gold standard for release of patients from isolation, 47 but there has also to be a role for less sensitive, faster approaches to support a triage environment, e.g. for ward allocation on hospital admission, where a negative RT-PCR result will often require additional testing for confirmation. 48 Antigen detection assays can offer an alternative to RT-PCR with faster response time, depending on type; one meta-analysis found sub-category sensitivity ranging from 0.66 (for lateral flow immunoassays) to 0.98 (for chemiluminescent immunoassays). 49 Bilateral chest X-rays have also been reported to be a useful supplementary tool in COVID-19 diagnosis. In a recent meta-analysis chest X-rays were found to have sensitivity of 0.91 and specificity of 0.78, again with RT-PCR as the reference, 50 albeit the American College of Radiographers has noted that chest imaging in COVID-19 is not specific, and overlaps with other infections. 51 Compared with these benchmarks, MS-based approaches show promise based on achieved sensitivity and specificity and -given that mass spectrometry facilities are often available in hospital settings -may find a use-case by offering faster turnaround than RT-PCR and so supplementing clinical diagnosis. In addition, MS-based approaches offer alternatives in the initial stages of a pandemic, when supplies for PCR or other tests may be in short supply.
Because of the ability of MS based techniques to identify dysregulation involving many pathways, such tests could provide information on the wider host metabolome and J o u r n a l P r e -p r o o f Journal Pre-proof proteome. This potentially allows for prognosis as well as diagnosis, and promising results have already been obtained for mass spectrometry-based prognostic analyses of serum, plasma and saliva. 31,52,53 In this work, the best results were found to be delivered by metabolomic study of homeostatically regulated biofluids (serum and plasma) and by proteomic study of nasopharyngeal swabs, with areas under their respective sROC curves of xxx and xxx respectively. These results were mainly obtained by UHPLC-MS (for blood metabolomics) and MALDI-TOF-MS (for proteomics). These sampling methods are, of course, more invasive than skin swabs or exhaled breath, but based on the studies reviewed here, the invasive methods deliver the greatest diagnostic accuracy and are most concordant with the WHO's  shows low heterogeneity of diagnostic performance for proteomics and blood-based sampling, the variation (and lack of overlap) in differentiating features suggests that much more inter-laboratory validation and optimisation will be required before these results can be translated into a clinical setting. The pilot studies described herein have shown the potential for accurate diagnosis of COVID-19, but we believe that future work should focus on larger recruitment cohorts, the inclusion of more blind tests for validation, validation across different locations, and optimisation of techniques.

Conclusions
The detection and diagnosis of COVID-19 by mass spectrometry has made substantial progress over the course of the SARS-CoV-2 pandemic. Achieved sensitivity and specificity of the diagnostic tests discussed in this review are encouraging, but with clear limits in the biases and applicability of the research undertaken so far. Whilst results based on proteomics and blood metabolomics delivered the most compelling performance, and these J o u r n a l P r e -p r o o f methods are most promising in terms of clinical application in the near term, more validation studies are still needed to reduce risks of bias and applicability. In the case of less invasive matrices, whilst the potential advantages are attractive, as yet there is little agreement between studies on suitably robust and reproducible targets.
Whilst mass spectrometry techniques may show promise, and advances in this field could be applicable to disease diagnosis beyond COVID-19, future research should focus on reducing bias by recruiting larger numbers of participants without inappropriate exclusions, especially to meet thresholds for determining suitability for point of care or other use-cases. In addition, greater use of blind test sets for validation would reduce bias from over-fitted machine learning models in MS based diagnostic testing. Furthermore, and especially for the less invasive sampling matrices, considerable work is required to harmonise and optimise methodologies so that features can be validated between labs.