Impact of partial-volume correction in oncological PET studies: a systematic review and meta-analysis

Purpose Positron-emission tomography can be useful in oncology for diagnosis, (re)staging, determining prognosis, and response assessment. However, partial-volume effects hamper accurate quantification of lesions <2–3× the PET system’s spatial resolution, and the clinical impact of this is not evident. This systematic review provides an up-to-date overview of studies investigating the impact of partial-volume correction (PVC) in oncological PET studies. Methods We searched in PubMed and Embase databases according to the PRISMA statement, including studies from inception till May 9, 2016. Two reviewers independently screened all abstracts and eligible full-text articles and performed quality assessment according to QUADAS-2 and QUIPS criteria. For a set of similar diagnostic studies, we statistically pooled the results using bivariate meta-regression. Results Thirty-one studies were eligible for inclusion. Overall, study quality was good. For diagnosis and nodal staging, PVC yielded a strong trend of increased sensitivity at expense of specificity. Meta-analysis of six studies investigating diagnosis of pulmonary nodules (679 lesions) showed no significant change in diagnostic accuracy after PVC (p = 0.222). Prognostication was not improved for non-small cell lung cancer and esophageal cancer, whereas it did improve for head and neck cancer. Response assessment was not improved by PVC for (locally advanced) breast cancer or rectal cancer, and it worsened in metastatic colorectal cancer. Conclusions The accumulated evidence to date does not support routine application of PVC in standard clinical PET practice. Consensus on the preferred PVC methodology in oncological PET should be reached. Partial-volume-corrected data should be used as adjuncts to, but not yet replacement for, uncorrected data. Electronic supplementary material The online version of this article (doi:10.1007/s00259-017-3775-4) contains supplementary material, which is available to authorized users.


Introduction
Positron-emission tomography (PET) enables in vivo assessment of metabolic and intracellular processes. Whereas in clinical practice, PET is predominantly used to qualitatively assess tracer uptake, PET(/computed tomography [CT]) may also serve as a surrogate quantitative biomarker of, for example, tumor metabolism and proliferation. The application of quantitative tumor assessment methods for distinguishing benign from malignant lesions, staging, prognostication, and determining or predicting response to therapy has garnered increasing interest [1][2][3][4].
Accurate quantification of metabolic volumes <2-3× the spatial resolution of PET is hampered by partial-volume Electronic supplementary material The online version of this article (doi:10.1007/s00259-017-3775-4) contains supplementary material, which is available to authorized users. effects, leading to underestimations of standardized uptake value (SUV), and possibly compromising lesion detection [5,6]. Many methods for partial-volume correction (PVC) have been advocated [7]. The simplest technique uses recovery coefficients (RC) obtained from phantom experiments under the assumption that true metabolic volume is known and that lesions are spherically shaped with homogeneous uptake. More sophisticated methods have been developed, but all suffer from limitations [7,8]. Voxel-wise resolution recovery methods, incorporating the point spread function (PSF) within iterative reconstruction [9] (PSF reconstruction) or performing post-reconstruction iterative deconvolution [10], could improve both qualitative and quantitative reads. To date, consensus on standardized application of PVC in oncological PET/ CT studies is lacking, and perhaps as a consequence PVC is not yet routinely applied. In fact, most current clinical quantitative PET studies merely exclude small lesions (e.g. <2 cm in diameter), as recommended in the PET Response Criteria in Solid Tumors (PERCIST) criteria [3].
The clinical impact of PVC in an oncological setting, and thus the need for standardized application, is not yet fully elucidated [7]. We performed a systematic review and metaanalysis to assess the impact of PVC in clinical PET studies, focusing on diagnosis, staging, prognostication, and response assessment.

Search strategy
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) statement. A comprehensive search (Supplemental Tables 1 and 2), in collaboration with a medical librarian (LJS), was performed in PubMed and Embase.com from inception to May 9, 2016. Both controlled terms (MesH in PubMed, Emtree in Embase) and free-text terms were included in the search. The following were used (including synonyms and closely related words) as index terms or free-text words: 'positron-emission tomography or 'PET' and 'partial volume correction' or 'point spread function reconstruction' and 'neoplasms' or 'cancer'.

Selection process
Abstracts and titles of all studies retrieved from the search were independently screened by two researchers (MCFC and GMK). Afterwards, eligible articles were studied in full text. In case of differences in judgment, consensus was reached through discussion. Cross-referencing was performed to further identify relevant articles.
The following were included: studies applying PVC in clinical PET studies, using oncological patients, reporting PET data with and without PVC, and investigating clinical impact of PVC on either diagnosis, staging, prognostication (reporting survival data), or response assessment.
Exclusion criteria were as follows: reviews, letters, editorials, conference abstracts, case reports, full text not available or not in English, no adequate reference data, no description of or reference to PVC method, combined PVC and motion blur correction method, or patient cohort overlapping with another included study.

Quality assessment
The quality of included articles was assessed (independently by MCFC and GMK) according to the QUADAS-2 [11] (n = 25) or QUIPS [12] (n = 12) tools. QUADAS-2 assesses bias and applicability of diagnostic studies, whereas QUIPS assesses bias of studies investigating prognostic factors. Staging and response assessment studies were assigned to either of the quality assessment tools. Consensus was reached through discussion.

Data extraction and meta-analysis
Both researchers independently extracted results regarding impact of PVC on diagnostic accuracy (for diagnosis and staging), prediction of survival (for prognostication), and response assessment. Measures of diagnostic accuracy were derived with and without PVC. If test characteristics were described for subgroups, overall measures of accuracy were calculated when possible. When p-values of differences in accuracy between uncorrected and PVC data were not reported, these differences were deemed not statistically significant. Descriptive data regarding cancer type, number of patients, lesion sizes, scanner type, and PVC method were also extracted. Unless stated otherwise, we presented data on SUV quantification.
Diagnostic studies on the same topic were pooled using bivariate random effects meta-regression analysis, which is the recommended method for meta-analysis of diagnostic studies [13]. This method provides summary estimates of sensitivity and specificity with 95% confidence intervals, taking into account the correlation between sensitivity and specificity and heterogeneity in results between studies. We tested for differences in overall diagnostic accuracy between different diagnostic tests using a likelihood ratio test, comparing models that included and excluded a covariate for the diagnostic test. For illustrative purposes, summary receiver operating characteristic (ROC) curves were calculated according to the Moses-Littenberg method [14]. We used Stata software (version 14; StataCorp LP, College Station, TX) for statistical analyses.

Study selection
Pubmed and EMBASE searches yielded 371 potentially eligible studies (Fig. 1). Three additional studies were found through reference screening. Two hundred and ninety-three abstracts were excluded based on eligibility criteria, leaving 81 for fulltext screening. For 19 (5.1%) abstracts, judgments were conflicting, and consensus was reached through discussion. After fulltext review, 31 studies met eligibility criteria (Fig. 1). Studies on diagnosis (n = 10), staging (n = 10), prognostication (n = 6), and response assessment (n = 5) are presented in Tables 1, 2, 3, and 4, respectively. Supplemental Table 3 contains the PVC and tumor delineation methodologies, reconstruction settings, fullwidth-at-half-maximum values, and voxel sizes of each included study. Thirty studies used 18 F-FDG as radiopharmaceutical, one study used 18 F-choline.

Quality assessment
For extensive descriptions of the QUADAS-2 and QUIPS scoring criteria, we refer to their respective primary publications [11,12].
Considering QUADAS-2 (Fig. 2a), the 'reference standard' and 'patient selection' items resulted in low risk of bias (high risk of bias in 14% of studies for either item). Elevated risk of bias for the 'reference standard' item was caused by use of multiple reference tests within the same study. Risk of bias in the index test was high in 24% of studies due to the use of data-driven instead of pre-defined SUV cut-offs. Applicability concerns regarding patient selection were mainly caused by large tumor size spectra and unspecified tumor sizes.
Using QUIPS (Fig. 2b), low risk-of-bias scores were found in the majority of the studies for the items measurement of outcome and prognostics factors, study attrition, and statistical analysis and reporting. Several studies did not adequately investigate potential factors of study confounding, which resulted in a moderate risk of bias in 40% of studies and high risk of bias in 40% of studies. Unclear descriptions of included patient cohorts ('study participation' item) resulted in moderate risk of bias in 40% of included studies.
With diagnosis of breast lesions, using data-driven SUVmean thresholds of 2.1 for PVC and non-PVC, at a fixed specificity of 90%, PVC increased sensitivity from 69 to 81%, but the impact on accuracy was not statistically significant [15]. In discriminating between aggressive and indolent non-Hodgkin lymphoma (NHL), PVC decreased specificity without affecting sensitivity [22]. Similarly, PVC did not improve differentiation between high-and low-grade NHL [16]. PVC also enabled differentiation between indolent NHL and Kikuchi-Fujimoto disease [24].
In non-small cell lung cancer (NSCLC) patients the association between primary tumor SUVmax and overall TNM stage disappeared after PVC [25]. For nodal staging using SUV, non-significant trends of increased accuracy for breast, head and neck squamous cell, and thyroid cancer (from 80%, 66% and 95% to 84%, 71% and 100%, respectively) [26,27,31], and decreased accuracy for nasopharyngeal and prostate cancer (from 84% and 85% to 73% and 80%, respectively) were observed [32,33]. The study investigating accuracy of nodal staging of nasopharyngeal cancer did observe a large increase in accuracy, from 14 to 71%, when stratifying for lesion size (6-7 mm diameter) [32].

Prognosis
Impact of PVC on prognostication (Table 3, n = 6) was investigated for NSCLC (n = 2), esophageal (n = 2), and head and neck cancer (n = 2). Applied PVC methods were the RC method (n = 4), iterative deconvolution (n = 1), and maskbased PVC (n = 1). Only prognostic studies providing survival data were included. One study did not specify lesion sizes. None of the studies stratified results on PVC for lesion size in secondary analysis.
PVC did not alter the association of SUVmax with disease-free survival of NSCLC (various histological types) patients in multivariate analysis [35,38]. Similarly, in NSCLC patients (various histologic types), PVC did not alter the ROC area under the curve of primary tumor SUVmax to differentiate between groups of patients in terms of disease-free and overall survival [38]. Primary tumor SUVs, regardless of PVC, were insufficient as prognostic markers in esophageal (adeno-and squamous cell) cancer in univariate and ROC analysis [36,37]. In head and neck cancer patients, partial-volume-corrected SUV was significantly different between patient groups stratified according to disease-free survival, whereas uncorrected SUV was not [39]. In univariate analysis, PVC did not affect predictive value of head and neck cancer primary tumor SUV on local recurrence-free survival, distant metastasis-free survival, and disease-free survival, but did allow for prediction of distant metastasis-free survival in a subgroup of patients with PET-positive lymph nodes [40].

Response assessment
Impact of PVC on response assessment (Table 4, n = 5) was investigated for breast (n = 2), rectal (n = 1), colorectal (n = 1), and NSCLC (n = 1). Applied PVC methods included the RC method (n = 2), iterative deconvolution (n = 2), and both RC method and iterative deconvolution (n = 1). One study did not specify lesion sizes. None of the studies stratified results on PVC for lesion size in secondary analysis.
For locally advanced breast cancer [41], regardless of PVC primary tumor FDG, metabolic rate was not able to differentiate between clinical and pathologic responders and non-responders during neoadjuvant chemotherapy (after 2 months). In another study in breast cancer patients PVC did not significantly change prediction of pathologic response with primary tumor SUV during neoadjuvant therapy (after two cycles) [42]. In locally advanced rectal cancer patients treated with (preoperative) chemoradiotherapy, PVC had no impact on histopathological response prediction, at baseline or after 1 or 2 weeks of therapy [43]. In patients with metastatic colorectal cancer PVC significantly reduced the ROC area under the curve of SUV in discriminating between responders and non-responders after 2 weeks of chemotherapy, as defined with RECIST [44]. In NSCLC patients treated with radio-or radiochemotherapy, PVC changed PERCIST [3] classification of response in 5/24 lesions, which were verified as correct alterations in clinical follow-up [45].

Discussion
Quantification of functional tumor characteristics with PET is considered to be useful in clinical oncology, and often uses semi-quantitative analyses, resulting in SUVs. Unfortunately, partial-volume effects are known to cause underestimation of tumor activity, and hence the necessity of PVC for accurate semi-quantitative reads for small lesions is well recognized [5]. However, many factors affect its accuracy and potentially hamper its optimal usage. Perhaps as a consequence, its resulting advantage in oncological PET studies is not yet evident. Additionally, the lack of consensus on the preferred PVC and delineation method may result in suboptimal results and could hamper comparisons between studies. This review discusses the clinical impact of PVC and provides recommendations for specific research questions and analyses to be included in future studies applying PVC.
When applied to diagnosis of primary lesions and (mainly nodal) staging, PVC often yielded higher sensitivity at the expense of specificity (Tables 1 and 2 and Figs. 3 and 4), which is an obvious consequence when using the same test positivity SUV thresholds for uncorrected and PVC data. In the subset of studies which allowed statistical pooling (679 lesions), meta-analysis showed that PVC did not significantly alter the overall diagnostic accuracy in characterizing pulmonary lesions with PET (Fig. 5). When estimating the effect of PVC, the optimal trade-off between sensitivity and specificity (the SUV threshold of test positivity) may be different for PVC and uncorrected data. At an exploratory level, one should define this cut-off for either method. Of note, Degirmenci et al. (on pulmonary nodules) used data-driven SUV cut-offs of 2.4 and 2.9 for uncorrected and PVC data, respectively, which yielded a specificity fixed at 80%, with sensitivity of 62 and 73% for uncorrected and PVC data, respectively [21]. We performed a similar analysis using the (individual patient) data from Hickeson et al. [18]. At a predefined SUV cut-off of 2.5, PVC decreased specificity and increased sensitivity (Table 1). However, when applying cut-offs of 2.55 and 2.8 (as derived from ROC analysis) for uncorrected and PVC data, respectively, PVC increased sensitivity from 72 to 94%, while specificity remained constant at 91%. This further demonstrates that PVC may indeed increase diagnostic accuracy when SUV cut-offs are adequately adapted for this correction. Obviously, each proposed threshold requires external validation.
Another explanation for the limited impact of PVC on diagnostic accuracy as published in the literature may relate to the size spectra of included lesions, with the distribution of benign and malignant lesions therein. When performing PVC analysis simultaneously on all lesions, both large and small, the overall impact of PVC on diagnostic accuracy will be diminished. Indeed, several studies demonstrated a high impact of PVC on accuracy for small lesions (when stratifying for lesion size), but less so when including all lesions regardless of size [18,32]. Therefore, we suggest that investigators stratify diagnostic performance results for lesion size in secondary analyses. However, since partial-volume effects are not merely size-dependent, but are also affected by lesion contrast and shape, reliable classification of lesions that are (most) affected by partial-volume effects will be difficult. In our previous simulation study, we observed that for highcontrast spherical lesions, partial-volume effects started to occur below 3-cm diameter [8]. A practical approach for stratification would thus be to stratify results using a 3-cm lesion diameter or a 14-mL metabolic volume cut-off (corresponding to a 3-cm-diameter sphere). Even though larger lesions may also be somewhat affected by partial-volume effects, depending on their shape and contrast, such a size cut-off will ensure that lesions that are most affected by partial-volume effects are separated. Another approach would be to plot the percentage increases in SUV after PVC as a function of metabolic tumor volume to determine an appropriate size cut-off for stratification of results within studies (not possible when applying the RC method).
Regarding visual nodal staging, PSF reconstruction did not significantly alter accuracy, but tended to increase sensitivity in lung, breast, and colorectal cancer (Table 2) [28,30,34]. This may be attributed to improved qualitative reads, improved (small) lesion detection, and higher diagnostic confidence [28,30,34]. Therefore, it may be worthwhile to validate these higher-resolution reconstruction algorithms for use in clinical practice, especially for detection of small lymph node metastases and lesions embedded in high background activity such as in the liver or mediastinum. However, PSF reconstructions may suffer from Gibbs artifacts (overshoot in activity); moreover, they are known not to guarantee full signal recovery [9]. Also, further research into their impact on compliance with European Association of Nuclear Medicine (EANM) standards is needed to ensure equal scanner calibration in multicenter quantitative PET/CT studies, which may require an SUV harmonization procedure [46].
We found that PVC might improve prognostication in head and neck cancer [39,40], but these studies did not stratify for the human papillomavirus status, a prognostic marker associated with lower tumor SUV and smaller metabolically active tumor volume (MATV) [47]. For future studies, please note that appropriate PVC may not necessarily improve prognostication with SUV, but instead may enable it to reflect its true prognostic value. For example, Vesselle et al. found that PVC mitigated the correlation between primary tumor SUV and overall survival in NSCLC patients, and they also observed that the correlation between SUV and overall TNM stage, which in essence is based on patient prognosis, disappeared after PVC, suggesting that the 'prognostic value' of uncorrected SUV was based on tumor volume rather than metabolic activity [5,25,48].
For response assessment, no conclusions regarding the effect of PVC can be made at this point due to the small number of heterogeneous studies. One included study demonstrated that after PVC PERCIST classification of response was altered for 5/24 NSCLC lesions during radio-or radiochemotherapy [45]. This is an important observation, since, conceptually, PVC may correct changes in SUV during treatment for changes in tumor volume and contrast, allowing for more appropriate PET-based classification of tumor response. Interestingly, two studies (excluded since no clinical verification was performed) demonstrated PVC to alter response classifications according to European Organisation for Research and Treatment of Cancer (EORTC) or PERCIST criteria in patients with bone metastases and NSCLC [39,49]. In conclusion, future PET response assessment studies should include PVC to allow for metabolic response assessment, irrespective of tumor shrinkage or growth, and quantify its clinical impact.
To improve comparison of PVC's impact between studies, consensus on the preferred combination of PVC and lesion delineation methodologies should be reached. Many PVC methods have been advocated, some specific for oncological application [5,7,50,51]. Still, most studies in this review applied an RC method, a quite simple method assuming spherically shaped lesions, homogeneous activity distributions, and known tumor sizes. Using this method, even small errors in tumor size measurements may result in over-or underestimations of true SUVs. Also, size measurements are often CT-based, whereas partial-volume effects affect metabolic volumes, which may be different from anatomical tumor volume [52] (e.g. due to necrosis and treatment effects). In a previous phantom and simulation study we found that voxel-wise PVC methods such as iterative deconvolution may be preferred, since this only assumes approximate knowledge of PET/CT systems' resolution kernel size, has low dependency on accurate delineation, and has only limited effect on precision [8]. Additionally, such a voxel-wise PVC method could allow for more accurate delineation of tumors [53] and, theoretically, heterogeneous tumor background. However, iterative deconvolution is known to increase image noise levels, which may require some form of a denoising algorithm to be applied [37]. Iterative deconvolution may be relatively easy to implement, and has been demonstrated to perform well using commonly applied background-adapted threshold-based delineation methods [8]. To date, iterative deconvolution has been applied predominantly by the same research group (Supplemental Table 3); more extensive clinical evaluation is warranted. Our previous phantom and simulation study showed that for lesions ≤10 mm in diameter, even with PVC, the acquisition of fully accurate results was not yet possible [8], which may contribute to the relatively low impact of PVC. Owing to heterogeneity between studies, the impact of chosen PVC methods on outcomes cannot be established in this review.
A limitation of this systematic review and the metaanalysis was the small number of studies included (only six diagnostic studies could be pooled; which is the maximum number of studies in any of the other subsections), with several sources of heterogeneity, such as the included lesion types, malignancy prevalence, lesion size spectra, PET acquisition and reconstruction settings, quantitation methods, and methodological quality. The overall study quality as assessed by QUADAS and QUIPS was good (Fig. 2), but more specific research questions regarding PVC are needed, along with more rigorous designs. Although it was a limitation in this review, the small number of retrieved studies applying PVC in oncology is also an important finding, highlighting the reduced application of PVC in recent decades.

Recommendations
When applying PVC in studies investigating diagnostic accuracy, SUV thresholds should be redefined for corrected data. Also, results on test characteristics should be stratified for lesion size (using a 3-cm-diameter or 14-mL cut-off). In prognostication studies, partial-volume-corrected SUV may complement rather than substitute uncorrected SUV, and could be included separately in prognostic models. The impact of PVC on PERCIST classifications of response merits further investigation in prospective studies. For now, we recommend that lesions ≤10 mm in diameter should not be included in quantitative analyses until novel PVC methods proven to be efficacious for these lesions are available. To demonstrate dependency of results on the applied PVC methodology, studies comparing multiple methods in the same sample of patients are highly recommended. Both functional and volumetric semiquantitative PET metrics should be investigated simultaneously, including SUVs, MATV, and their product TLG (see for example refs. [31,37,40,42,43]). Also, when PET is used for therapeutic dosimetry applications, e.g. for nuclide radiotherapy, PVC will likely improve estimates of tracer or radionuclide uptake, and thereby improve estimates of tumor radiation dose.

Conclusion
The accumulated evidence to date does not support routine application of PVC in standard clinical PET studies. In meta-analysis of quantitative diagnostic PET studies, PVC did not increase diagnostic accuracy. Limitations of published studies include the lack of analysis stratified for size, limited exploration of the impact of alternative (SUV) thresholds of test positivity on diagnostic accuracy measures, and heterogeneity in applied PVC methodologies. For accurate and reproducible results on tumor uptake quantification, consensus on the preferred tumor delineation and PVC methodologies needs to be reached. Partial-volume-corrected metrics should be used as adjuncts to, but not yet replacement for, uncorrected data.