Mapping Functions in Health-Related Quality of Life

Background. Clinical trials in cancer frequently include cancer-specific measures of health but not preference-based measures such as the EQ-5D that are suitable for economic evaluation. Mapping functions have been developed to predict EQ-5D values from these measures, but there is considerable uncertainty about the most appropriate model to use, and many existing models are poor at predicting EQ-5D values. This study aims to investigate a range of potential models to develop mapping functions from 2 widely used cancer-specific measures (FACT-G and EORTC-QLQ-C30) and to identify the best model. Methods. Mapping models are fitted to predict EQ-5D-3L values using ordinary least squares (OLS), tobit, 2-part models, splining, and to EQ-5D item-level responses using response mapping from the FACT-G and QLQ-C30. A variety of model specifications are estimated. Model performance and predictive ability are compared. Analysis is based on 530 patients with various cancers for the FACT-G and 771 patients with multiple myeloma, breast cancer, and lung cancer for the QLQ-C30. Results. For FACT-G, OLS models most accurately predict mean EQ-5D values with the best predicting model using FACT-G items with similar results using tobit. Response mapping has low predictive ability. In contrast, for the QLQ-C30, response mapping has the most accurate predictions using QLQ-C30 dimensions. The QLQ-C30 has better predicted EQ-5D values across the range of possible values; however, few respondents in the FACT-G data set have low EQ-5D values, which reduces the accuracy at the severe end. Conclusions. OLS and tobit mapping functions perform well for both instruments. Response mapping gives the best model predictions for QLQ-C30. The generalizability of the FACT-G mapping function is limited to populations in moderate to good health.


INTRODUCTION
In the United Kingdom, the National Institute for Health and Care Excellence (NICE) recommend the EQ-5D to measure preference-based health-related quality of life (HRQL) to estimate quality-adjusted life years (QALYs) in economic evaluations. 1 It has been demonstrated that the EQ-5D-3L is sensitive to changes in HRQL in patients with cancer. 2 However, many cancer studies do not include the EQ-5D and are more likely to include 1 of 2 cancer-specific HRQL questionnaires: the European Organization for Research and Treatment Quality of Life Questionnaire Core 30 (EORTC QLQ-C30) or the Functional Assessment of Cancer Therapy-General Scale (FACT-G), both of which are non-preference-based measures. NICE guidelines acknowledge that the EQ-5D is not always available and in these situations recommend using mapping to estimate EQ-5D. Mapping allows health state utility values to be predicted when no preference-based measure is included in the study. This approach involves estimating the relationship between a non-preference-based measure and a generic preference-based measure using statistical association, and it requires a degree of overlap between the descriptive systems of the 2 measures and that the 2 measures are administered on the same population. A review of mapping functions by Brazier et al. 3 demonstrates that researchers use a number of models, including ordinary least squares regression (OLS), generalized linear models, tobit, censored absolute deviance (CLAD), 2-part models, and response mapping, to predict health state preference values. Studies also report a variety of methods to assess model and predictive performance, including predicted mean and standard deviation, median, range of predictions, Akaike information criteria (AIC), Bayes information criteria (BIC), R 2 , pseudo-R 2 , mean estimates across severity groups, root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE).
Only 1 mapping function has been published that maps from FACT-G to EQ-5D-3L: it fits separate OLS and CLAD models at the FACT-G dimension level, and shows that values are poorly predicted for high and low EQ-5D values. 4 There are several published mapping functions for the EORTC QLQ-C30. Potentially, the most useful function is that of McKenzie and van der Pol, 5 who used OLS to predict EQ-5D-3L values and ordered probit to predict EQ-5D-3L dimension levels. The ordered probit model did not produce reliable predictions, but the OLS gave reasonable EQ-5D-3L estimates. It is possible that other mapping functions such as the tobit model would produce more accurate estimates, although these are not explored in this article. Recently, Khan and Morris 6 explored a number of alternative models for predicting EQ-5D in patients with lung cancer; their models assume EQ-5D scores lie between 0 and 1, and they show that the nonlinear beta-binomial model gave the best predictions of EQ-5D. The beta-binomial model is not going to be applicable to all populations because some respondents will have negative EQ-5D scores. Proskorovsky et al. 7 used linear regression to predict EQ-5D-5L using a sample of 154 patients with multiple myeloma and to estimate models with and without the EORTC QLQ-MY20; this is used in conjunction with the EORTC QLQ-C30 to assess quality of life in patients with multiple myeloma, but the results are not validated using an external sample. The remaining published mapping functions are not as useful. Kontodimopoulous et al. 8 used OLS to predict EQ-5D-3L; the model is based on a small sample of patients, and they state that their function is unreliable. Wu et al. 9 required data on FACT-G as well as EORTC QLQ-C30 to produce mapping estimates, which studies may not routinely collect together. Pickard et al. 10 mapped to patient time trade-off (TTO) values rather than the EQ-5D-3L index. Crott and Briggs 11 developed their function from a female-only sample, and therefore the results are not generalizable, as evident by results from assessment of this mapping function 12 that demonstrate that EQ-5D-3L estimates are not stable among different data sets. Versteegh et al. 13 only used OLS in their analysis and only provided Dutch values, not UK ones; whereas Versteegh et al. 14 developed separate mapping functions for those in poor health versus those in better health, and these have limited reliability due to small sample sizes at the poor health end. Some of these mapping functions have worse predictive performance than the McKenzie and van der Pol 5 model in a multiple myeloma patient data set. 15 There are other papers that map to non-UK values of the EQ-5D-3L. [16][17][18] Given the lack of robust mapping studies in this area, the aim of this article is to estimate mapping functions from 2 cancer-specific HRQL measures, FACT G and EORTC QLQ-C30, to the EQ-5D-3L to test the applicability of different mapping approaches that have been used in the literature and provide recommendations for future mapping studies. This article presents the results from testing alternative common modeling techniques that are recommended in the literature and uses recommended criteria to identify the most appropriate mapping functions.
EQ-5D-3L is referred to as EQ-5D and EORTC QLQ-C30 as QLQ-C30 in the rest of the article.

EQ-5D
The EQ-5D is the most widely used generic preference-based measure of health-related quality of life. The EQ-5D has 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. 19 Each dimension has 3 levels of severity. Each health state described by the EQ-5D has a utility value anchored on a 0 to 1 scale, where 0 represents death and 1 represents full health. The values used here are produced using the UK value set. 20

FACT-G
The FACT-G is a 27-item cancer-specific HRQL measure that has been widely validated. 21 Each item has 5 options ranging from not at all (a score of 0) to very much (a score of 4), and these are summed to obtain a global score as well as 4 subscale scores: physical well-being, social/family well-being, emotional well-being, and functional well-being.

EORTC QLQ-C30
The QLQ-C30 is a 30-item cancer-specific HRQL measure that has also been widely validated. 22 Two items ask about overall quality of life and overall health, and the remainder cover 5 functioning scales (physical, role, social, emotional, and cognitive functioning) and 9 symptoms scales (fatigue, nausea and vomiting, pain, dyspnea, sleep disturbance, appetite loss, constipation, diarrhea, and financial impact).

Data
Four data sets are used in this analysis: One contains the FACT-G and EQ-5D, and the remaining three contain the QLQ-C30 and EQ-5D and are combined to produce a reliable mapping function. The FACT-G data set consists of 530 US respondents with 13 different types of stage III and IV cancers who completed the EQ-5D and FACT-G. 23 Fifty-two percent of respondents are male, and the average age of the sample is 59 years. The 3 data sets combined for the QLQ-C30 mapping analysis are a randomized controlled trial of 572 patients with multiple myeloma (VISTA study; Clini-calTrials.gov number NCT00111319), 24 and 100 patients with breast cancer and 99 patients with lung cancer having consultations at a Canadian cancer clinic (Vancouver Cancer Clinic data). This gives a total of 771 cases for the mapping study; 44% of responders are male, and the mean age of patients is 68 years.

Models
Five commonly applied alternative types of model are fitted to the data: OLS, tobit, 2-part models, and splining to map to EQ-5D values, and response mapping to map to individual EQ-5D dimension scores. The most commonly used mapping model reported in the literature is OLS. 25,26 These models are typically able to predict the mean values but are poor at predicting those in poor health and full health, and they do not allow for the fact that the EQ-5D is bounded at 1 for full health and 20.594 for the worst possible state described by EQ-5D. Tobit models are therefore fitted to allow for the bounded nature of EQ-5D, thus limiting predictions to within a credible range. An alternative model that can be fitted in an attempt to predict responders in perfect health is the 2-part model, which uses a combination of 2 different model types to predict different parts of the distribution of the data. Logistic regression is fitted to predict the probabilty of whether responders are in full health (FH), and a truncated OLS is applied to predict EQ-5D values for those not in full health. The results from the 2 parts of the model are combined to obtain an overall value using an expected value approach, 27 that is, Given that the EQ-5D fails to approximate to the normal distribution, the final model that is fitted to the EQ-5D index uses splining, also known as fractional polynomials. Splining can be used to identify changes (cut points) in the distribution of the continuous explanatory variables (QLQ-C30 or FACT-G total or domain scores) and models these changes using different mathematical functions. The cut points were identified using the multivariable fractional polynomial function in Stata. 28 This function identifies cut points and fits all possible polynomial functions to the data using power functions ranging from 22 to 3, and identifies the best-fitting model for predicting the outcome variable (EQ-5D score). Splining functions are applied to the best-fitting OLS/tobit dimension based models to test whether splines offered an improvement over using squared terms.
In the mapping literature, OLS, tobit, 2-part, and splining models are usually reliable at predicting the group EQ-5D mean and median values, and they are able to distinguish between severity levels but are poor at predicting the overall range of EQ-5D values. An alternative to modeling the EQ-5D value is to use response mapping, which can predict the 5 EQ-5D dimension levels. 29,30 Multinomial logistic YOUNG AND OTHERS regression models are estimated for each dimension, and the estimates from these regressions are used to categorize respondents into levels 1, 2, or 3 of each of the EQ-5D dimensions and thus predict the EQ-5D health state for each respondent. A total of 2000 Monte Carlo simulations are run to estimate EQ-5D health states. The standard set of UK general population values is then applied to each predicted health state to obtain EQ-5D values. 20 Eight model specifications (models 1 to 8) are fitted for OLS, tobit, and 2-part models; these specifications can be seen in Table 2, illustrated using the FACT-G data set. Model 1 uses the FACT overall score. Models 2 to 5 are based on FACT-G domain scores; model 2 includes all domains regardless of statistical significance, model 3 includes only statistically significant domains, model 4 includes squared and square root terms, and model 5 includes interaction terms. Models 6 and 7 are item-level models that include only significant items; model 7 merges item levels for levels that are shown to be disordered in model 6 (item coefficent size does not increase or decrease by item level). Patient and disease characteristics were explored in model 8. Splining models were fitted to total score and domain level (models 1 to 5).
To avoid overfitting models, the rule of 10 participants per variable for continuous models and 10 events for the smallest category for responsemapping models is used. The FACT-G data set does not include responders with very poor health, with only 0.9% of responders having negative EQ-5D values in contrast to 18% of responders reporting full health on the EQ-5D. No responder had extreme problems for mobility, and few responders indicated extreme problems for self-care (0.4%), usual activities (6%), pain/discomfort (3%), or anxiety/depression (2%). Applying the overfitting rule to the response-mapping EQ-5D level 3 predictions for models that would include FACT-G item levels (models 6 and 7) restricted the number of items that should be included to 1 item. A model that could include only 1 FACT-G item is not going to be useful at predicting EQ-5D dimension responses, so model 6 and 7 are not fitted. There were a slightly higher number of EQ-5D level 3 responders for the QLQ-C30 data set, but we were again restricted to including 1 QLQ-C30 item; again, we chose not to include model 6. However, because there are more EQ-5D level 3 responses, it was possible to collapse QLQ-C30 items into a smaller number of levels. Therefore, we were able to fit model 7 to the QLQ-C30 data set.
Models are fitted using backward regression, and variables are removed from the model if nonsignificant at p \ 0.1. When variables are highly correlated (correlation . 0.7), the variable that is most significant and judged most likely to map to the EQ-5D based on prior expectations is selected. Standard errors of regression coefficents are calculated from bootstrap estimates with 5000 bootstrap samples for each model.
Model goodness of fit is measured using AIC, BIC, and MAE, in which smaller values indicate better model fit. Model performance is also assessed visually by plotting observed and predicted EQ-5D values. Standard model tests are also examined, including R 2 and adjusted R 2 for OLS and pseudo R 2 for the other models; the Ramsey Regression Equation Specification Error Test (RESET) is used in OLS to test whether nonlinear conbinations of variables in the model help explain the variability, where a significant result indicates that a nonlinear model is more appropriate. Sigma is reported for tobit and truncated regression models, and is the equivalent to RMSE in linear regression models. The link test is used to check model specification. The Hosmer-Lemeshow test is used to assess goodness of fit for logistic regression models (first part of 2-part models), which assesses whether predicted probabilities agree with observed probabilities and should be nonsignificant for a model that accurately predicts observed values.

Model performance and discrimination
Summary statistics, including mean and range, are examined to assess overall model predictions. A severity measure is used to assess the discriminative performance of the predicted EQ-5D value among different severity groups. For FACT-G, the Eastern Cooperative Oncology Group (ECOG) performance status 31 is used to categorize respondents according to severity. The ECOG has 5 response categories: normal activity without symptoms, some symptoms but do not require bed rest during the waking day, require bed rest for less than 50% of the waking day, require bed rest for more than 50% of the waking day, and unable to get out of bed. There are no patients in the most severe level, and few patients (n = 21 [4%]) required bed rest for more than 50% of the waking day; therefore, this category is merged with do not require bed rest less than 50% of the waking day. The general health status item of the QLQ-C30 is used to categorize respondents according to severity in the QLQ-C30 data set. Response options ranged from poor to excellent (i.e., from 1 to 7). Discriminative ability among severity groups using these

MAPPING FUNCTIONS IN HEALTH-RELATED QUALITY OF LIFE
measures is tested using ANOVA. MAEs are reported for each subgroup.

Model validation
Models are validated internally using bootstrapping techniques to estimate a shrinkage factor that allows for overoptimism of the predictive ability of the fitted model (a model is better at predicting estimates on the same data from which the model is derived, compared to an external data set). Methods reported by Steyerberg et al. 32 are used to assess all models, and shrinkage coefficients are reported to counter overoptimism of estimates. To estimate the shrinkage factors, 5000 bootstrap estimates are run, and for each bootstrap sample the EQ-5D predicted score (linear prediction) is calculated. The slope of the EQ-5D predicted score in relation to the observed score is then calculated for each sample, and the mean slope across the 5000 samples denotes the shrinkage coefficient. A shrinkage coefficient of less than 1 (the typical value expected for a shrinkage coefficient) reflects an ''overfitting'' of the data.

Model selection
When producing a mapping model, the factors that are important in selecting a model are accuracy of the predicted mean and standard error, the range of predictions, MAE, shrinkage, and the reproducibility of the model among different severity states. Mapping and model-fitting literature do not suggest a single criterion for use in selecting the best-fitting model, and the most appropriate measure may depend on the purpose of the mapping function; for example, populating a model may require accurate predictions of mean preference-based values for different severity groups, whereas accurate overall means at different time points may be sufficient when subgroup analysis is not undertaken. Therefore, when selecting models, all criteria are given equal weighting, models are ranked based on these factors, and the mean rank per model is estimated. The model with the best ranking is then selected, and these are then compared among the different estimation methods (OLS, tobit, 2-part, splining, and response mapping). All mapping functions are fitted in STATA version 12. 33 Financial support for this study was provided entirely by a grant from the MRC-NIHR (UK Medical Research Council and National Institute for Health Research) Methodology Research Programme. The study was part of the NICEQoL project looking at the use of generic and condition-specific measures of HRQL for NICE decision making. 2 The funding agreement ensured the authors' independence in designing the study, interpreting the data, writing, and publishing the report.

FACT-G
The characteristics and a summary of EQ-5D values and FACT-G responses are presented in Table 1. EQ-5D values did not cover the full possible range and went from 20.135 to 1. The distribution of the EQ-5D index for the FACT-G data set is shown in Figure 1a, which reflects the distribution of possible EQ-5D values. The average global FACT-G score ranges from 33 to 108; thus, like the EQ-5D, it did not cover the worse end of the FACT-G scale. The relationship between the global FACT-G score and EQ-5D is moderate (Spearman's correlation r = 0.575). The EQ-5D correlates moderately with the physical and functional domains of the FACT-G (r = 0.566, r = 0.501, respectively), although the correlations are weak for the social and emotional domains (r = 0.178, r = 0.382, respectively).

QLQ-C30
The characteristics and a summary of EQ-5D values and QLQ-C30 responses are presented in Table  1 for the combined sample and for each data set. Mean age and gender distribution varied by data set, as did mean EQ-5D values, which are lowest for the multiple myeloma data set. Only the multiple myeloma data set covered the entire range of the EQ-5D, and it has lower ceiling effects than the other data sets, with 8% of responses at full health on EQ-5D in comparison to 24% and 17% for the breast and lung cancer data sets, respectively. Figure 1b presents the histograms for each data set and the combined data set, showing that the distributions differed by data set, but without further information it was not possible to conclude whether this is due to differences in the severity of the patients in each data set or differences in the pattern of EQ-5D by condition. The scores for the QLQ-C30 scales most noticeably varied among the 3 data sets for physical functioning, role functioning, pain, dyspnea, constipation, and global quality of life. Assessment of the correlations between the EQ-5D and the QLQ-C30 scale scores indicated that the highest correlations are between physical functioning, role functioning, fatigue, and pain (r = 0.701, r = 0.688, r = 20.625, and r = 20.735, respectively). Table 2 illustrates the model selection process for OLS. Model 8 included patient and disease characteristics, but these are not statistically significant predictors of EQ-5D score; therefore, the results for this model are not shown in Table 2. By definition, all models predict the overall mean EQ-5D value for the data set, but underestimate those in near-full or full health and overestimate those in poorer health states. No model predicts a value lower than 0.155 (observed values range from 20.135 to 1). All models are able to discriminate between different levels of health, as measured by categorized EQ-5D value and ECOG. MAE is large for those in poor health, which is expected given the range of model predictions. Item-level models consistantly performed better than the domain and total score models, although they are most likely to overfit and are poor at predicting values away from the overall mean for the data set that is similar to that of other mapping studies in the literature. 14,34 All of the summary statistics and model performance tests are ranked. Giving all performance statistics equal weighting indicates that model 6 (significant FACT-G items) is the best-performing OLS model for estimating EQ-5D values from the FACT-G. This process is then repeated for tobit, 2-part, splining, and response mapping (see Longworth et al. 2 for these results) for the FACT-G and the QLQ-C30. The best-ranked functions for each model (OLS, tobit, etc.) are then compared using the same approach.

Selecting models
Best-fitting models: FACT-G Table 3 and Figure 2 summarize the best-fitting OLS, tobit, 2-part modeling, splining, and response-mapping models for the FACT-G. FACT-G item-level models give the best model predictions for OLS and tobit, whereas a significant domain-level model with square terms is the best model for the 2-part models (model 4). Only domain levels are fitted for splining and response mapping, and model 3, which is the one with significant domains, is the best for   both of these models. OLS gives the best estimates of the overall mean and the mean by severity group, and has 1 of the 2 best ranges of predicted values (the 2part model covers the widest range). OLS was the poorest at predicting the median and had the lowest shrinkage factor, suggesting it would be the most likely to overpredict results in other studies applying the mapping algorithm. The response-mapping model gave reasonable estimates of the mean and median, but the poorest MAE among severity groups. All models failed to predict anyone in perfect health, underpredicted the top of the EQ-5D scale, and overpredicted the bottom end of the scale. However, the overprediction at the lower end of the scale is perhaps unsurprising given that few responders in the FACT-G data set reported severe problems. A mean ranking of models among the different model performance statistics shows OLS to give the best predictions (mean ranking = 2.1), followed by tobit (mean = 2.4), with 2-part models and response mapping giving the poorest predictions (mean = 3.6, mean = 3.5, respectively). Table 4 presents the regression coefficients for the best-fitting model (OLS FACT-G significant items).

Best-fitting models: QLQ-C30
Seven models' specifications (models 2 to 8; model 1 was excluded because the QLQ-C30 does not have an overall total score) are fitted for the QLQ-C30. Table 5 and Figure 2 present the predicted EQ-5D values for the best-fitting models for each estimation technique alongside model performance statistics for the QLQ-C30. As with FACT-G, the item-level models give the best model predictions for OLS and tobit models (model 8, including items and sociodemographic characteristics). These models are best at predicting the overall mean EQ-5D value. Item-level models with sociodemographic characteristics give the best model performance for 2-part models. The 2-part model resulted in a more accurate prediction of the median than predictions for OLS, tobit, splining, and response mapping. The splining model has the least deviation from the shrinkage coefficient of 1 (model 3). The best-performing response-mapping model includes all domains with age and gender for some of the dimensions, and this model has the lowest MAEs on average. None of the models predicts the full range of observed EQ-5D values, with no predictions at the best or worst EQ-5D values. The mean ranking indicates that the response mapping was the best-performing model (mean rank = 2.4), with OLS and tobit also performing well (mean = 2.6, mean = 2.8, respectively) and splining giving the poorest overall performance (mean = 3.7). Table 6 presents coefficients for the best-fitting model (model 8 response-mapping model).

DISCUSSION
This article reports mapping functions from 2 widely used cancer-specific HRQL measures, the EORTC QLQ-C30 and the FACT-G, to the EQ-5D. Generally, OLS and tobit models perform well for both EORTC QLQ-C30 and FACT-G. However, the best-performing model for the EORTC QLQ-C30 was response mapping, whereas OLS gave the best estimates for FACT-G with response-mapping producing poor predictions. The advantage of response mapping being the best model for EORTC QLQ-C30 is that any EQ-5D-3L tariff can be applied to the model. However, these differences in model type are unlikely to represent general findings because it is expected that the best-performing model specification and type can vary by measures mapped to and from, including different EQ-5D-3L country tariffs, the population, and the data set.
The poor performance of the response-mapping approach in the FACT-G data set may be due to the limited number of responders in poor health. This restricted the response-mapping models we fitted to those, including FACT-G overall or domain subscores. The low number of responders in poor health was surprising, because the responders all had stage III or IV cancer and covered a range of different cancers, but it might be due to the FACT-G study asking respondents to fill in a large number of questionnaires. In addition to the ones reported here, responders completed EQ-5D-5L, disease-specific FACT-G modules, and 2 further psychometric questionnaires, making the task quite lengthy and thus potentially biasing the sample to more healthy respondents. The limited range of the FACT-G scores in the estimation data set means that the FACT-G mapping results are not necessarily generalizable to other studies, unless applied to a population in mild to moderate health.
The FACT-G sample size may also have added to the poor performance of response mapping. With a larger sample (e.g., 2000 respondents), you would obtain more accurate predictions of those in level 3, because although the percentage of observations for this level might remain the same (e.g., 3%) the number of observations from which estimates could be made would increase, giving more reliable estimates. Further work is needed on sample size

MAPPING FUNCTIONS IN HEALTH-RELATED QUALITY OF LIFE
recommendations for the more complex models, such as response mapping.
Other common mapping models are CLAD and GLM. Like the tobit model, the CLAD model also deals with the censored nature of the data and produces consistent estimates in the presence of heteroscedasticity and nonnormality, 35,36 but it is a medianbased model rather than a mean-based model, and there is some debate regarding its suitability for estimating utility values in economic evaluation. 37,38 Therefore, this model was not fitted here. Generalized linear models (GLMs) were not fitted either because initial GLM models gave similar results to OLS.
A number of mapping functions have been published in the literature for EORTC QLQ-C30. 5,6,[8][9][10][11]14 These published studies did not explore the full range of possible mapping functions, as has been done here. Proskorovsky et al. 7 estimated a mapping function from the same multiple myeloma data set used in this study but, similar to other studies, only reported using OLS. The results here are also based on pooled data from 3 data sets for 3 different types of cancer to produce reliable mapping estimates from a large sample. Pooling 3 types of cancer has increased the generalizability of the results rather than focusing on 1 specific type of cancer. Furthermore, a recent study by Arnold et al. 39 examined the external validity of published mapping functions from QLQ-C30 to EQ-5D and included the response-mapping function presented in this article. They found that our mapping function performed better than other published mapping functions in predicting EQ-5D. 2,5,6,7,10,12,15,16 For FACT-G, only 1 published mapping function exists in which the authors acknowledged that estimates were unreliable. 4

YOUNG AND OTHERS
It is evident from the literature that studies report different model fit and selection criteria, with some focusing on model goodness of fit and others on the predictive ability of the model. Mapping models should be selected based on their predictive ability; 25,40 however, within this, there are still a number of criteria from which a model can be selected, and different choices can result in alternative models being selected. For example, if the accuracy in predicting EQ-5D values based on mean values by severity groups was chosen as the key criterion, then the tobit model would be preferred to the OLS model in    the FACT-G models. If focus was on the overall mean for the QLQ-C30 models, then the OLS models would be preferred. In this article, we have given equal weighting to all model-fitting criteria and have used this to generate a ranking of each model. The ranking criteria used here do not take account of the magnitude of the performance statistic and how accurate these are, and further research is needed to explore whether it is possible to account for this when selecting models and to produce more detailed guidelines on selecting appropriate mapping models. The OLS models perform reasonably well in predicting EQ-5D values from both cancer-specific measures, and response mapping performs the best for the QLQ-C30 data set. We recommend that both types of models are considered for future mapping studies, but note that the response mapping is likely to require a broad spectrum of EQ-5D responses to produce a reliable mapping function and potentially a large number of responses. Both preferred models presented here could be used to predict EQ-5D values in studies that include similar patients; however, the generalizability of the FACT-G mapping function is limited to predictions for respondents in mild and moderate health states. We also recommend transparency in reporting the criteria that are used to select mapping functions that are recommended for use and whether equal weighting is used.