Comparison of clinical MRI liver iron content measurements using signal intensity ratios, R2 and R2*

Purpose To compare three types of MRI liver iron content (LIC) measurement performed in daily clinical routine in a single center over a 6-year period. Methods Patients undergoing LIC MRI-scans (1.5T) at our center between January 1, 2008 and December 31, 2013 were retrospectively included. LIC was measured routinely with signal intensity ratio (SIR) and MR-relaxometry (R 2 and R 2*) methods. Three observers placed regions-of-interest. The success rate was the number of correctly acquired scans over the total number of scans. Interobserver agreement was assessed with intraclass correlation coefficients (ICC) and Bland–Altman analysis, correlations between LICSIR, R 2, R 2*, and serum values with Spearman’s rank correlation coefficient. Diagnostic accuracies of LICSIR, R 2 and serum transferrin, transferrin-saturation, and ferritin compared to increased R 2* (≥44 Hz) as indicator of iron overload were assessed using ROC-analysis. Results LIC MRI-scans were performed in 114 subjects. SIR, R 2, and R 2* data were successfully acquired in 102/114 (89%), 71/114 (62%), and 112/114 (98%) measurements, with the lowest success rate for R 2. The ICCs of SIR, R 2, and R 2* did not differ at 0.998, 0.997, and 0.999. R 2 and serum ferritin had the highest diagnostic accuracies to detect elevated R 2* as mark of iron overload. Conclusions SIR and R 2* are preferable over R 2 in terms of success rates. R 2*’s shorter acquisition time and wide range of measurable LIC values favor R 2* over SIR for MRI-based LIC measurement. Electronic supplementary material The online version of this article (doi:10.1007/s00261-016-0831-7) contains supplementary material, which is available to authorized users.

Various diseases are associated with increased liver iron content (LIC), which may induce or contribute to liver damage [1][2][3]. Serial measurement of LIC during longterm follow-up and treatment is highly desirable, but repeated invasive measurements are not recommended due to risks of complications of serial liver biopsies. Surrogate biochemical markers including serum ferritin and transferrin-saturation are widely used, but are flawed by limited specificity. Thus, accurate non-invasive MRI-based methods of LIC measurement are used in clinical practice for patients (suspected) with increased LIC [4,5].
Several types of MRI LIC measurement have been described in the literature. Straightforward in-out phase gradient echo (GRE) shows signal loss at the later echo time (TE) but is only qualitative and easily confounded by the presence of hepatic steatosis. Quantitative approaches include (i) signal intensity ratio (SIR) measurement (e.g., the Gandon method) and (ii) MR-relaxometry. The Gandon method (henceforth referred to as ''SIR'') utilizes the liver-to-muscle SIR on differently weighted MRI-scans [6]. This method allows easy and free calculation of the LIC SIR , by entering ROI values in an online tool [7]. Hence, assuming the acquisition and placement of regions-of-interest (ROIs) are performed correctly, the method is robust to observer influences. A major limitation is its upper limit of detection of 350 lmol/g (equal to 20 mg/g): changes above that threshold cannot be measured.
MR-relaxometry relies on the calculation of tissue relaxation rates (R 2 and R 2 *, the inverse of relaxation times T 2 and T 2 *), which increase as iron accumulates and are sensitive to changes in LIC values well above the SIR-threshold. One commercialized R 2 approach using single-echo spin-echo (SE) MRI is the FDA-approved St. Pierre method [FerriScan Ò ], performed in 10 min in freebreathing [8]. The per-scan analysis price is~$300, on top of the costs of the MRI-scan itself. Alternative free-ofcharge approaches are available for R 2 using freebreathing or respiratory triggered SE-MRI and for R 2 * using single breath-hold GRE MRI [9].
Recent developments in MR-relaxometry include multipeak fat corrections and the use of complex instead of magnitude-only data fitting [10], assessment of the effect of fat suppression on R 2 * [11] and the comparison of advanced data fit models [12] and analysis approaches [13].
A comparative study of LIC SIR , R 2 , and R 2 * in 94 patients with b-thalassemia reported high correlations [14]. However, success rates, interobserver agreement, and applicability for diseases other than b-thalassemia were not investigated, nor were serum markers assessed. The latter may be useful to screen for elevated LIC (i.e., >36 lmol/g), saving expensive and limited MRI time. We hypothesize that R 2 * is preferable over SIR and R 2 in terms of success rate, acquisition time, and range of detection and over serum values in terms of accuracy in detecting elevated LIC.
In our center, the clinical LIC protocol has included SIR, R 2 , and R 2 * since 2005, with regular weekly clinical referrals since 2008. The SIR measurement is recommended by the national guideline for hemochromatosis [15]. It is supplemented by R 2 and R 2 * measurements to fill the gap caused by the SIR method's hard cut-off at 350 lmol/g. To investigate our hypothesis, we (i) assessed SIR, R 2 , and R 2 * LIC measurements and their success rates and interobserver agreement; and (ii) compared the diagnostic accuracies of LIC SIR , R 2 , and surrogate serum markers for correctly predicting elevated LIC based on increased R 2 * .

Ethical
All data used for this study were acquired in clinical setting and were anonymized prior to analysis. Informed consent was waived by the Medical Research Ethics Committee of the AMC Amsterdam.

Patients
All MRI-based LIC measurements performed between January 1, 2008 and December 31, 2013 were retrospectively included in this study. As additional measurements were added to the protocol in 2014, only measurements up to end 2013 were included. Clinical diagnosis and-when available-serum markers of iron metabolism (total iron, transferrin, transferrin-saturation, ferritin) were collected and subsequently anonymized by a colleague not otherwise involved in this study.

MRI
MRI-scanning was performed supine, feet first on a 1.5T Avanto MRI-scanner (Siemens AG, Erlangen, Germany) using phased-array coils (body array and spine coil) for localizers and R 2 and R 2 * measurements and the body coil for the SIR measurement [6]. Use of the body coil provided an as homogenous B 1 field as possible, reducing variation in SIR measurements due to variations of flip angles between patients. For R 2 * and R 2 , the B 1 variation is eliminated via the data fit. Breath-hold imaging (localizers, SIR and R 2 *) was performed in expiration. Three 10-mm slices with a variable slice gap to cover the liver were equally positioned for all three LIC measurements. Especially for the GRE-based SIR and R 2 * measurements, careful B 0 shimming is important to achieve a homogenous B 0 field, ensuring correct measurements. Shimming was performed with a shim box covering the field-of-view in the feet-head direction and the contours of the abdomen (i.e., excluding the arms) in the left-right and anterior-posterior directions. The SIR measurement according to Gandon et al. requires five (T1, PD, T2, T2+, and T2++) image weightings with specific TR/TE combinations [6]. Table 1 contains an overview of the relevant scan parameters. Of note, the TE interval used for R 2 * was shorter (1.41 ms) than the standard in-and out-of-phase interval (2.26 ms).

Data analyses
After inclusion all measurements were checked for correct TRs, TEs, and RF coils using DICOM header information as for SIR measurements, specific TR/TE combinations and the use of the body coil are mandatory. Image quality was assessed by a research trainee (JHR, 4 years of experience) and an abdominal radiologist (JS, 20 years of experience) using a 3-point scale (good/adequate/inadequate). The type of artifact(s) was noted. Measurements with incorrect scan parameters or inadequate image quality were classified unsuccessful.

ROI-placement
SIR, R 2 , and R 2 * data were processed using custommade software that allowed ROI-placement, LIC SIR calculation, and R 2 and R 2 * data fitting. Three blinded observers (JHR, MAT, and EMA) with four, a half and 9 years of experience, respectively, independently placed regions-of-interest (ROIs) for three slices per scan. First, the liver parenchyma was masked on R 2 * source data, excluding a rim near the liver edge (Fig. 1A). Next, nonliver voxels (e.g., vessels, gall bladder) inside the liver contour were masked (Fig. 1B). By subtracting ROI-2 from ROI-1, only liver parenchyma remained (Fig. 1C). Liver ROIs were copied from the R 2 * data for SIR analysis, with two additional ROIs in both paraspinal muscles, carefully avoiding areas of signal intensity loss close to the lung (Fig. 1D). This also allowed a check to identify whether patients had moved between R 2 * and SIR measurements, in which case new ROIs were placed. Ghosting artifacts caused by aortic blood flow were present in SIR measurements before November 2012 (when saturation slabs were added). Separate ROIs were placed to remove these artifacts from the liver and muscle ROIs (Fig. 1E, F). Some reports indicate that susceptibility artifacts may affect R 2 * measurements when using a single ROI in liver segments VII or VIII [16]. Due to the limited number of slices, we did not formally assess segmental variations of R 2 , R 2 *, or LIC SIR in this study.
The respiratory triggering applied for R 2 data acquisition resulted in slight changes in slice positioning so that new ROIs were placed using R 2 source data as described above.

LIC SIR
The calculations published by Gandon et al. were entered into the aforementioned program [7,17], which automatically chooses the most reliable SIR (i.e., T1, PD, T2, T2+, or T2++) which is converted to LIC SIR . The mean LIC SIR of three slices was used and, when one or more values exceeded the 350 lmol/g threshold, the final value was noted as >350 lmol/g. In two subanalyses, the R 2 and R 2 * values and the individual SIR ratios in patients with LIC SIR >350 lmol/g were evaluated.
In magnitude images, the noise is distributed in a non-Gaussian manner. This is known as Rician noise [18]. At high signal levels, the non-zero mean has a negligible effect on the average signal, but near the noise level, a noise bias exists which needs to be taken into account when fitting R 2 *. We explored three different fit routines: a truncated exponential fit (A) [19,20], an exponential + constant fit (B) [9,21], and an exponential + Rician noise (C). The truncated exponential method A is considered the reference standard, but is time-consuming, where methods B + C do not require further manual input. We compared method B and C with method A as reference using Bland-Altman analysis and R 2 * data from a single reader (EMA). Based on this comparison (mean paired difference ( d) was 0.8 Hz for A-C and 33.6 Hz for A-B), we employed method C (Rician noise bias) for the remaining analyses [22,23].
R 2 * calculation was thus performed with a monoexponential model (Eq. 1) with a Rician noise factor. In Eq. 1, E R describes the Rice distribution (Online Resource 1), where r is a noise parameter and S 0 Â e ÀR 2 Ã ÂTE reflects the true magnitude value. Data were averaged inside the ROI before data fitting (average-then-fit).
The effect of intrahepatic fat on R 2 * was assessed by applying a biexponential model in a subset (n = 10) with definite presence of fat, as identified by the presence of a oscillating signal intensity decay over time. R 2 * values with and without correction were compared using Bland-Altman analysis. The ( d) was 0.1 Hz-indicating low overall fat content in this cohort-and deemed negligible compared to the subset mean of 70 Hz. Monoexponentially fitted R 2 * values were used for all comparisons.

R 2
For R 2 calculation an average-then-fit routine was applied using a biexponential model as shown in Eqs. 2 and 3. In Eq. 2, S T (TE) is the signal intensity without noise at time TE, S 0 is the signal intensity at TE = 0, and R 2 is the relaxation rate. The subscripts a and b indicate fast and slow relaxation components, respectively. For R 2 , Rician noise bias was approximated by the Pythagorean addition of an extra fit parameter, the noise factor 'm' in Eq. 3.
In the biexponential model, an iron-dense and an iron-sparse component are assumed, with short and long R 2 , respectively. For further comparisons with LIC SIR and R 2 *, the bulk R 2 was calculated (Eq. 4) in accordance with the literature [8,9,14].

Statistical analyses
Data are described as number (%) or median (interquartile range, IQR). Results of observers were compared using a Friedman test and Wilcoxon Signed-Rank test as post hoc. Success rates are defined as the number of correctly acquired scans of at least ''adequate'' quality divided by the total number of measurements. These were compared using a McNemar test. Correlations were assessed with Spearman's correlation coefficients (r S ), interobserver agreement with two-way random, and absolute intraclass correlation coefficients (ICCs). Both were graded according to Landis et al. [24]. Bland-Altman analysis was performed to compare accuracy between the three MRI methods for a single observer and compare the performance of the three observers [22]. In a separate analysis, the calculated R 2 and R 2 * values were converted to LIC R 2 ðÃÞ values in lmol/g using the formulas provided by St. Pierre et al. and Garbowski et al. [8,20] as these were established with image analysis protocols similar to ours. ROC-analyses were performed for LIC SIR , R 2 , and serum values with significant correlation with R 2 * to establish their diagnostic accuracy to identify increased R 2 *, i.e., ‡44 Hz [9]. R 2 * was chosen as a reference value as it had the best success rate and shortest acquisition time. The optimal cut-off value for R 2 was found by optimizing the Youden index, while for LIC SIR Table 2. Thirty patients had multiple measurements. To prevent a repeated measurements effect on correlation assessment between LIC SIR , R 2 , and R 2 *, only the 114 baseline measurements were used. SIR, R 2 , and R 2 * data were available for 108/114 (95%), 72/114 (63%), and 113/ 114 (99%) baseline measurements.

MRI success rates
Five SIR measurements were classified unsuccessful because a surface coil was used, one due to erroneous TR/ TE combinations. Furthermore, image quality was inadequate (respiration artifacts) in a single patient (only R 2 and R 2 * acquired). Hence, SIR was successful in 102/ 114 (89%), R 2 in 71/114 (62%), and R 2 * in 112/114 (98%) subjects. The success rate of R 2 was lower than that of SIR and R 2 * (P < 0.0001, each). Missing datasets were presumed to not have been scanned, with time constraints and respiratory triggering problems as the major cause of the low success rate of the R 2 measurement. For subsequent analyses, only successful baseline measurements were used.
Comparison with the literature Figure 2A, B also shows published regression lines between either LIC SIR or LIC BIOPSY and R 2 ( Fig. 2A) and R 2 * (Fig. 2B). Contrary to our finding, these lines indicate a linear increase of R 2 * as LIC increases, and a nonlinear increase of R 2 as LIC increases. To assess whether this is caused by LIC SIR or by R 2 or R 2 *, we applied established conversion formulae to convert our R 2 (Eq. 7) and R 2 * (Eq. 8) values to LIC values [8,20] These established conversion formulae show a nonlinear relation between R 2 and true LIC (Eq. 7) and  linear relation between R 2 * and true LIC (Eq. 8). Hence, the scatter plot between LIC R2 * and LIC SIR also revealed a quadratic relation, and that between LIC SIR and LIC R2 a linear one (data not shown).

Diagnostic accuracies of LIC SIR , R 2 , and serum values
Serum total iron, transferrin, transferrin-saturation, and ferritin were available for 56, 56, 54, and 96 out of 114 measurements. All four correlated significantly with R 2 *, with best correlation for ferritin at r S = 0.80 (P < 0.0001, n = 94). Increased R 2 * ( ‡44 Hz) was present in 91 subjects. Of the MRI and serum methods, R 2 and ferritin had best diagnostic accuracies to detect increased R 2 * (Table 4). Figure 4A-C shows true and false positive and negative results of R 2 (Fig. 4A), LIC SIR (Fig. 4B), and ferritin ( Fig. 4C) for establishing increased R 2 *.

Discussion
This study shows that for routine clinical MRI-based LIC measurements SIR and R 2 * are more often successful than R 2 . Interobserver agreement was near perfect (ICC > 0.9) for all methods. R 2 and R 2 * methods provided relaxation rates when the SIR-threshold (>350 lmol/g) was already exceeded. This gives them an advantage over SIR in subjects with transfusional hemosiderosis (at least 55% of our population), when LIC values can easily surpass 350 lmol/g. The combination of high success rate, high interobserver agreement, ability to detect changes in LIC over a wide range of LIC values, and single breath-hold acquisition favors the R 2 * method for LIC measurement.
In our study, the relationship between R 2 * and LIC SIR was quadratic and remained quadratic when R 2 * was expressed as a LIC value using a previously published (biopsy-proven) conversion formula. Other authors report linear relationships. Given the physics of the R 2 *-iron relationship, which is basically linear [25], this discrepancy arises either from our R 2 * acquisition and analysis or from the reference standard. To rule out the former, we compared three fit routines. The exponential + Rician noise factor fit provided identical results in a fraction of the required time to the established and widely applied but laborintensive method of manual truncation before exponential fitting.
With respect to reference standard, St. Pierre et al. [8], Wood et al. [9], Hankins et al. [19], Garbowski et al. [20], and Anderson et al. [21] all used biopsy-determined LIC BIOPSY as reference standard, whereas we and Christoforidis et al. [14] used the LIC SIR according to Gandon. Given the similarity of our MRI protocols, it is unsurprising that Christoforidis' and our data points show considerable overlap. Arguably, their linear rela-  [8,9,[19][20][21] or LIC SIR [14]. Fig. 3. T1W liver-to-muscle SIR against R 2 *. This shows a scatter plot of R 2 * values (x-axis) against the liver-to-muscle SIR (y-axis) of successful baseline T1W SIR measurements. Data are grouped into the following: LIC SIR <350 and LIC SIR >350 lmol/g. tion between LIC SIR and R 2 * could also be described by a quadratic polynomial.
Apart from the linear relationship, the other authors report much steeper increase of R 2 * as LIC increases [9,[19][20][21]. Anderson et al.'s very steep increase could be due a long TE1 of 2.2 ms compared to all other studies (range of TE1: 0.8-0.99 ms) that hampers the ability to accurately estimate high R 2 * values. The fact that the control values of R 2 * in subjects without iron overload in those studies but also in this paper hover around 40 Hz is a further argument that the observed difference in LIC-R 2 * does not arise from the R 2 * acquisition or analysis but from the reference standard.
Hence, the most likely cause of the deviating quadratic relation between R 2 * and estimated LIC is the piecewise sampling of the LIC range with five differently weighted GRE-sequences for LIC SIR. This has artificially imposed a quadratic behavior on the actually linear relationship between R 2 * and true LIC BIOPSY . If one looks at the fundamental GRE signal equation (Eq. 9), where PD is proton density and a is flip angle and applies this to the liver-to-muscle signal intensity ratio, the PD and sin(a) terms drop out. By taking the natural logarithm, we find Eqs. 10 and 11. The latter proves that the relationship between R 2 * and SIR is logarithmic. Indeed, plotting Fig. 3 with a log-scale for the signal intensity ratio on the y-axis linearized the line (data not shown).
For R 2 , single-and multiecho SE acquisitions are possible: multiecho SE decreases R 2 due to residual signal of stimulated echoes at a given TE. Single-echo SE increases R 2 because long TEs cause increased sensitivity to diffusion, hence increased signal loss at a given TE.
Reported single-echo SE R 2 values [8,9] were concordantly higher for the same estimated LIC compared to multiecho SE results as in this study and in [14]. In terms of R 2 data fitting, we as many others applied a biexponential model and we did not assess non-exponential decay models as for instance proposed by Jensen et al. [26].
The main limitation of our study is the lack of biopsy confirmation. In our center, liver biopsy for iron determination is seldom performed. Both the national, European and American guidelines recommend reluctance in performing biopsy and underline the high sensitivity of MRI [15,27,28]. Moreover, differing processing steps to obtain LIC BIOPSY are reported, compromising generalizability. In Gandon's method, paraffin-embedded liver biopsy specimens are dewaxed using a protocol with a triple xylene wash to remove lipid solids from the sample. This approach was shown to have an elevating effect on the dry weight liver iron calculation compared to processing fresh tissue samples [29]. Another limitation is the fact that we did not perform multipeak fat-correction on complex data [10]. This was not feasible with only magnitude data available. Comparison to other literature is further hampered by the use of different image acquisition and postprocessing protocols which directly influence the calibration curves between the reference standard and the index test. We have opted to compare our findings to calibration curves obtained with similar postprocessing protocols.
ROC-analyses showed that R 2 and ferritin have the highest diagnostic accuracy to identify increased R 2 * ( ‡44 Hz). Both ferritin ( ‡524 lg/L) and R 2 ( ‡18.3 Hz) had positive predictive values of 100%, but the wide distribution of ferritin levels for R 2 * ‡ 44 Hz indicates that it cannot be used confidently to follow-up treatment nor accurately determine the LIC. In contrast, R 2 shows a different picture with a close distribution around the regression line. In addition, ferritin lacks the spatial information that MRI provides, allowing segmental LIC measurement and follow-up. R 2 datasets were missing (i.e., not scanned) in 42/114 (37%) subjects. As R 2 is part of our routine scan protocol, this illustrates that the long and artifact-prone R 2 AUROC area under the ROC curve, PPV positive predictive value, NPV negative predictive value Values in parentheses reflect the 95% confidence intervals series is skipped first by the radiographer. This makes the R 2 series less suited as first choice for LIC measurement. Our results favor the use of R 2 * measurements for daily clinical practice with the use of an exponen-tial + Rician noise fit method to save time in analysis. The recommendation to (only) use R 2 * comes with cautions. It requires careful consideration of scan parameters which should be kept equal for all measurements. Ideally, routine quality control with phantom testing should be performed.
In conclusion, as R 2 * can be obtained in a single breath-hold with excellent success rates, high interobserver agreement, and ability to detect changes over a wide range of LIC values and is available from all major vendors without additional per-scan costs, it is our first choice for LIC measurement.