Estimating a constant WTP for a QALY—a mission impossible?

Economic evaluations are an important input to decision-making and priority-setting in the health care sector. Measuring preferences for health improvements, as the demand-side value (willingness to pay) of gaining a quality-adjusted life year (QALY), is one relevant component in the interpretation of the results from health economic evaluations. Our article addresses whether willingness to pay for a QALY (WTP-Q) is sensitive to the size of the health differences and the probability for improvement. We use data from a contingent valuation survey based on 1400 respondents conducted in the spring of 2014. The results show that the expectation of sensitivity to scope, or higher WTP to the larger expected quality of life improvement, is not supported. We find WTP-Q values that conform reasonably well to previous studies in Sweden.


Introduction
Total expenditure on health as a percentage of the gross domestic product in Sweden was about 11% in 2015, which was higher than the OECD average of 8.9% [1]. Although the share of health care expenditures in the GDP has increased in recent decades, prioritization between medical treatments and public health interventions has perhaps never been examined more carefully. Economic evaluations are an important input to decision-making and priority-setting in this sector, and measuring preferences for health improvements is one component in the interpretation of the results from health economic evaluations. The most commonly used tool for economic evaluation in health is cost-effectiveness analysis (CEA), where the incremental cost of an intervention is related to the incremental health effect (where the intervention is compared to the most relevant intervention alternative). The preferred/chosen metric of health benefits for a CEA is commonly quality-adjusted life years (QALYs).
Considering that costs and health benefits are measured in different units, the result can never in itself indicate whether an intervention improves welfare and/or population health. Two decision rules have been suggested when using CEA with QALYs (or similar) as an outcome measure [2]: choose interventions in ascending order of cost per QALY until the budget is exhausted or select interventions with a cost per QALY less than or equal to a specified threshold value (V). The second decision rule ("threshold approach") is usually what policy makers rely on given that decisions are typically made sequentially in time. The decision rule can be written as: if ΔCost∕ΔQALYs < V → the intervention is cost-effective, i.e., if the incremental cost-effectiveness ratio (ICER) is below the threshold value. 1 In policy contexts where the analyst only cares about heath care sector-related costs and benefits and assuming a fixed budget, the estimate of V should be based on the value (cost per QALY) of displaced services from implementing cost-increasing interventions. If the analyst considers a broader societal perspective and/or assumes a non-fixed budget, the estimate of V should be based on (or adjusted for) the consumption value of a QALY.
Attempts to assign a monetary metric to the consumption value one of a QALY can be conducted by eliciting the willingness to pay (WTP) for a QALY, henceforth referred to as WTP-Q (e.g., [3][4][5][6][7][8][9]). It is well known that the underlying theoretical assumptions to directly translate QALYs into monetary units, and treat QALYs as a utility metric, are restrictive [10][11][12]. 2 It has also been shown that under a range of admissible utility functions the WTP-Q will vary with the type of QALY gains [10,13]. 3 The empirical evidence also provides evidence that the WTP-Q varies with the type of QALY gain. In a review of 24 previously published WTP-Q studies, it is shown that larger QALY changes give lower WTP-Q estimates and therefore that WTP is not proportional to the QALY change [14]. Several studies that have examined the sensitivity of scope, i.e., how willingness to pay changes with the amount of QALYs gained, have also found the same disproportionality [3,7,8,15].
However, although the theoretical possibilities to estimate one unique WTP per QALY are insurmountable, there may still be valuable information to extract from studies on individual preferences for health improvements by the range of WTP-Q estimates in a population [13,15].
In policy/jurisdictional contexts that evaluate cost-effectiveness of new interventions and medical technologies with respect to the consumption value of a QALY, it is rare to rely on one constant WTP-Q, but rather to have an interval of what possibly constitutes the relevant value of a QALY (e.g., [16]). Empirical estimates are thus important to provide insights for the potential bounds of such an interval of WTP-Q estimates.
This study adds to the previous literature in building knowledge about the estimate of WTP-Q, the variables influencing its size, and testing whether WTP is sensitive to changes in health as well as the level of uncertainty regarding the health improvement. WTP is estimated based on survey responses to a web-based contingent valuation survey. We test the sensitivity of the WTP to the magnitude of the health change as well as to the probability for health improvement and all valuation scenarios are framed in a decision context with uncertainty regarding the outcomes. This is standard in the literature on WTP for mortality risk reductions, for example, but most papers in the WTP-Q literature have used scenarios with choice under certainty (although with some recent exceptions [17]).
Specifically, we address the following two research questions: (1) if WTP increases with the amount of health difference and probability for improvement and (2) if WTP is approximately proportional to the magnitude of health difference and probability for improvement. We describe the methods and data in Sect. 2, where we also outline our specific hypotheses. Results are presented in Sect. 3, and the paper concludes with a discussion in Sect. 4.

Survey structure
We use data from an internet panel survey conducted in the spring of 2014. The full contingent valuation survey consisted of five sections as well as an introductory note to respondents. The five sections contain the following: (1) respondents' self-reported health status using a visual analog scale (VAS) between 0 and 100 as well as their views on the subjective health status of a number of different described health states, 4 (2) a description of two health states (named A and B) where the respondent was asked about their individual WTP to move from the worst to the better health state, (3) a scenario where respondents were asked to act as social decision-makers "voting" yes or no to introducing/ reimbursing a new pharmaceutical that would increase life expectancy at old age for terminally ill patients, (4) questions on respondents' attitudes to different prioritization "rules," e.g., if they agreed to different normative statements about how resources should be allocated, and (5) socioeconomics and demographics. Sections 3 and 4 are not analyzed in this article.
The survey consisted of 21 questions in total and was approved by the regional ethical vetting board. We tested the survey in small focus groups and subsequently in a pilot survey with approximately 200 respondents. As a result of Estimating a constant WTP for a QALY-a mission impossible? the pilot survey, we modified the cost levels upwards and clarified the text further for specific sections (mainly by using extra bold type text to stress the importance of certain keywords).

Scenario design
In this article we focus on the individual WTP for an improved health state (Sect. 2 of the survey). A general description of the WTP scenarios took the following form (the valuation scenario and WTP question are also presented in the Appendix). First, we asked the respondents to consider a possible treatment that is able to improve a specific health state. The treatment does not cause any pain, has no side effects, and is not subsidized by the society. We reminded the respondents about their opportunity cost, i.e., if they would choose to pay for the treatment they would have less money for food, travel, entertainment, clothes, etc. ("cheaptalk" script to reduce the risk of hypothetical bias). It was also assumed that the national insurance compensates for all possible health care costs and loss of income due to sickness in order for the responses not to be biased by perceived income changes (we want to estimate the "pure" value of health).
Second, the respondents were asked to assume that their health state today was equal to a specific EQ-5D state (health state A). Then, they were told that there is a 1% probability of health improvement by natural causes that would result in a better EQ-5D state (health state B). If they would choose to pay for the treatment, the probability of attaining the better health state would increase to 5%. 5 Third, the respondents were asked if they were willing to pay SEK 20/200/500/1500/3000 per month 6 over the next year for this treatment (yes/no). 7 Finally, we included a certainty calibration question about how confident the respondents were about their response to the WTP question. 8 The health states were chosen and paired to represent an expected good spread of QALY differences according to the UK EuroQoL tariffs [19]. We establish a 'small,' 'medium,' and 'large' health difference ( Table 1). All have the same final health state, but different initial states. 9 Since we did not have any Swedish EuroQoL scoring function at the time, we assumed that the UK values would represent the best approximation. In 2014, Burström et al. [20] published a Swedish value set for EQ-5D health states, and we compare the results in this respect as well. Further, from the results in Sect. 1 in the survey, we have the respondents' self-assessed EQ-5D VAS tariffs for health states A and B, which we also analyse.
The three health differences are paired such that five alternatives are established (Table 1). Three scenarios represent a value set where the probability for a better health state is increased by 4% points. Two scenarios represent a set to enable a distinct scope test regarding the uncertainty level, since the differences in health are the same as for scenarios 'small' and 'large,' but the probability for improvement is increased to 40%. The expected quality of life (QoL) difference is calculated as health difference multiplied by the probability difference. Each scenario was randomly presented to the respondents and each respondent only answered one valuation question.
The scenarios represent different baseline levels and magnitude differences between QALY scores. If we assume that If we assume that both propositions hold, we would expect WTP to be approximately proportional to the quality of life difference. Practically, this would imply that our hypotheses are: Hypothesis (1) is based on proposition 1 of weak scope sensitivity, i.e., we expect WTP for a 'small' quality of life difference to be larger than a 'medium' quality of life difference, etc. (get more, pay more). Hypotheses (2) and (3) are based on the proposition of strong scope sensitivity, i.e., we expect WTP for a 'small'/'large' quality of life difference to be ten times smaller than their scope alternatives. We expect WTP to be proportionally sensitive to the quality of life differences among the first three scenarios as well, i.e., ~3× WTP ('small') = WTP ('medium'). 10

Data
Respondents were 1400 members of a web panel consisting of Swedish citizens older than 17 years. They were randomly recruited to the panel by phone and had to be internet users. The on-line survey was conducted in the spring of 2014 and was carried out by the Scandinfo company. The respondents were assigned at random to one of the scenarios described in Table 1. 11 In Table 2, we show the summary statistics for the full sample as well as for the five individual scenarios. The self-reported health is based on the individual responses to the EQ-5D descriptive system. There are no statistically significant differences across the scenarios for any of the observable variables. Compared to national statistics, the sample has a higher share of individuals with university education of 3 years or more (33% compared to 20% in the population), a higher disposable monthly income (approximately SEK 32,000 compared to SEK 24,000), and a lower share of unemployed individuals (4.5% compared to 5.7%)  875 Estimating a constant WTP for a QALY-a mission impossible? 1 3 [22][23][24]. The share of females and mean age correspond to the national sample.
We exclude some respondents based on the following reasons: respondents indicating an implausibly high number of children in the household (n = 36) and young respondents with an implausible high income or education level (n = 22). In Sect. 3.2, we also exclude respondents based on other inconsistencies as part of a sensitivity analysis.

Proportions, WTP-Q, WTP, and determinants of WTP
The proportions of yes responses (Table 3) are decreasing for all samples and individual scenarios as the cost rises, although not monotonically for all. Notably, the scope scenarios still have a high proportion of yes responders (above 40%) at the highest bid level. We may also cross-compare the proportions between scope samples and their equivalent. Using a 95% confidence interval, there is no significant difference in proportions.
We estimate WTP-Q for each choice scenario by using a no-constant binomial logit model (Eq. 1), where the dependent variable was set to 1 if the respondents answered 'yes' to the WTP question and 0 if the answer was 'no' [13]. 12 The cost variable is multiplied by 12 to account for WTP on an annual basis. The value difference between health states A and B is multiplied by the risk reduction (4 or 40%) to calculate the effective quality of life differences (QoL_diff). and Full sample estimates indicate a WTP-Q of SEK 167,947-373,979 including the scope sample (€17,100-$38,100) and SEK 749,189-1,153,523 excluding the scope sample (€76,400-€117,700), depending on the tariff used (Table 4). Scenario-specific estimates of WTP-Q range from SEK 104,091 to SEK 10,716,153 (€10,600-€1,093,500), clearly implying that the hypothesis of a constant WTP-Q value is not met. We can see that the WTP-Q estimate for the scenario 'small' is significantly higher than the rest of the scenarios and the estimate for the (1) y = * 1 bid + * 2 QoL_diff  scenario 'large scope' is significantly lower, and the estimate for 'small' is significantly higher than for 'small scope' (UK and Swedish tariffs). For the Swedish tariffs, WTP-Q for the scenario 'small scope' is significantly higher than WTP-Q for the scenario 'large.' WTP-Q for the scenario 'large scope' is significantly lower than WTP-Q for the scenario 'large' for all tariffs. Estimates for the other VAS scenarios are not significantly separable, but the confidence intervals are very wide. Compared to the propositions (Sect. 2.2) estimated WTP-Q values are dependent on the quality of life differences, both the health difference itself and also the probability for improvement. If we present the isolated estimated WTP values for each scenario (QoL_diff*WTP-Q), we can see that the result implies that estimated WTP is lower for scenarios 'small' and 'large' than for their scope equivalents ( Table 5). None of the other scenarios show significantly different WTP estimates between each other. We also calculate mean WTP by using a non-parametric method (Spearman-Karber) and achieved the same results.
Depending on the values respondents assess to different health states, we expect WTP to differ proportionately (proposition 2). Table 6 shows the quality of life differences and expected WTP differences for the UK tariffs, the Swedish tariffs, and the self-assessed VAS tariffs. We can see that the quality of life differences vary and that the hypotheses of proportional scope sensitivity are somewhat altered depending on which tariff is used. Table 7 presents the estimated WTP (probability of a yes answer to the WTP question) by the samples and sociodemographic variables of interest in a logit regression. In general, we can see that neither gender nor age has a significant effect on WTP. Being unemployed, the income level, and having a university education have significance for some scenarios. As expected, we see a negative effect on the proportion of yes responses as the bid level increases (−11 to 23% percentage points per SEK 10,000).

Sensitivity analysis using certainty calibration and excluding inconsistent respondents
We have tested the results in the previous sections in two different sensitivity analyses to account for uncertainties in individual responses: (1) by using certainty calibration and (2) excluding inconsistent respondents. Hypothetical bias is found to be a serious problem of CV data, and incorporating respondent uncertainty can potentially improve the predictive power (e.g., [25,26]). We have used a version of the certainty approach that follows up on the WTP question by letting the respondents assess the degree of uncertainty with three statements: 'definitively sure,' 'probably sure,' and 'uncertain.' Only the 'definitely sure' yes responses were treated as yes responses, while the 'probably sure' yes responses and the 'uncertain' yes responses were treated as no responses. No treatment was carried out with the no responses.
Of the 499 yes responses to the WTP question, 249 (i.e., 50%) stated they were 'definitely sure. ' We can see the same decreasing proportions of yes responses after certainty calibration for all samples as the cost rises, as we saw in Table 3, although the proportions are naturally lower (since we convert yes responses to no responses). The estimated WTP-Q values are much lower and are not significantly different for any sample, nor are there any significant differences between any WTP values, i.e., neither weak nor strong scope sensitivity.
Based on our subjective assessment of responses, we believe that some respondents gave answers of low quality. They may not have understood the survey, or they may have considered it not to be worthwhile to leave a thoughtful response. The inconsistent respondents were defined as someone that: (1) rated the subjective health status of being dead higher than having 'perfect health' (n = 65), (2) rated the subjective health status of being dead higher than 50 on a VAS (n = 113), or (3) rated the better health state lower than the worse health state on a VAS (n = 73). Some overlapping existed, but 208 individual respondents were deleted.
The proportions of yes responses are decreasing for most bid levels. WTP-Q values for the adjusted sample are generally higher for all three tariffs than in Table 4, which indicates that the deleted respondents have lower WTP. The samples show significantly different WTP-Q values, especially between 'large' and 'large scope' (all tariffs). Estimated WTP is significantly higher for the scenario 'large scope' than for scenarios 'small,' 'medium,' and 'large.' The quotas are approximately two, implying weak but not strong scope sensitivity.
The last model in the sensitivity analysis combines certainty calibration with the exclusion of inconsistent respondents. The WTP-Q values are, in general, slightly higher for UK and Swedish tariffs, but lower for the self-assessed VAS tariffs. None of the scenario-specific WTP-Q or estimated WTP values were significantly different from another.

Discussion
Our article addressed one specific general research question: is WTP sensitive to the size of the health differences and the probability for improvement? We also examined what sociodemographic factors are related to the variations in WTP and what the willingness to pay per quality-adjusted life year (WTP-Q) in Sweden is. To answer these questions, we used data from an internet panel contingent valuation survey conducted in the spring of 2014. The results are based on 1400 respondents, and they were randomly blocked into different scenarios, where the health differences as well as the probability for improvement were varied.
The survey results show that the estimated WTP-Q ranges between SEK 170,000 (UK tariffs) to SEK 370,000 (selfassessed VAS tariffs). Swedish EQ-5D tariffs result in intermediate WTP-Q values of SEK 280,000. Previous WTP-Q values in Sweden have been estimated to range between SEK 400,000-655,000 [27,28]. One study, with a low response rate of ~12%, estimates WTP-Q to be in the wide interval SEK 100,000-1,900,000 [29]. A threshold value of SEK 500,000 has been suggested by government authorities [30].
In an international review of 24 WTP-Q studies, mean estimates amount to approximately SEK 700,000 [14].
However, our analyses looking at the sub-samples and addressing the questions of sensitivity to scope showed that the empirical results do not conform to necessary assumptions of the QALY concept. Our results showed that the prediction of expected utility theory and, more importantly, a standard assumption saying that the more QALYs the better, i.e., that WTP increases with the amount of quality of life improvements (weak scope sensitivity) or 'more is better,' can only be partially supported for the largest improvements (ratios above 10, and not always even then). We see no support for approximate proportionality (strong scope sensitivity), implying that we cannot empirically establish a constant WTP-Q value. The hypothesis set out was that WTP-Q values should be the same, while WTP values should differ depending on the health difference and probability for improvement, i.e., WTP varies with the QALY difference. The result was basically the reverse: WTP for the different QALY changes did not vary in any substantial way, and thus WTP-Q estimates vary substantially. Considering that WTP-Q = WTP/QALY difference and that WTP does not vary with the QALY difference, this implies that larger QALY changes give lower WTP-Q estimates.
The result points to an inadequate sensitivity in WTP to scope (scope bias), which is often reported in contingent valuation studies and stated preference approaches in general (e.g., [31,32]). Several studies regarding WTP per QALY have found evidence of scope bias [3,4,7,8,15], and a recent meta-analysis estimate a 64% decrease in WTP-Q per unit larger QALY change [14].
The low responsiveness to the changes in quality of life (QoL) differences, to changes in both health and probability for improvement, is problematic given the policy aim of pursing to find the populations' consumption value of a QALY. There is a risk that WTP-Q values rest on the specification of the health differences and the probability of improvement, which is often set by a researcher for a specific design. We have specified large and small variations in QoL between the scenarios, and the empirical results show that WTP-Q is far from constant. In our sensitivity analysis, we have used certainty calibration and excluded inconsistent respondents. The results were that WTP-Q values are not statistically significantly different over the scenarios, but neither are the estimated WTP values. The confidence intervals are very wide, which explains this result.
We acknowledge that sensitivity to scope is one test of the validity of CV, but not the only one. We can see that the proportion of yes responses decreases with the cost level and there are some variations in WTP comparing the lowest and highest QoL differences (weak scope sensitivity). If the true QALY model is in fact non-linear, we would not expect the assumptions about expected utility to reflect respondents' preferences [33][34][35]. We also acknowledge that the conventional scope test, i.e., comparing mean values, can hide important relationships and lead to false positives and false negatives [36].
The comparison among the three separate tariffs is another interesting aspect of this study. Most previous studies have used one single tariff or individual selfassessed QoL estimates. We have no reliable information on which tariff most accurately reflects the respondents' preferences and, as we saw earlier (Table 6), the expected ratios between different scenarios are very dependent on this. As the estimated WTP values in scenarios 'small,' 'medium,' and 'large' are not significantly different, this would lead us to consider that self-assessed VAS tariffs are most likely to reflect the invariance in WTP values. In this case, the ratios are between 0.7 and 3, while for the Swedish tariffs the ratios are 1-14. Hence, it is easier to justify that an expected WTP difference of 3 does not arise than an expected difference of 14. In sum, the result that the chosen tariff substantially affects the WTP-Q is not particularly convincing for the attempts to elicit an empirically reliable WTP-Q.
Our examination of determinants of WTP (Table 7) results in the conclusion that the only variable that is significant thorough all of the scenarios is the cost. There are some significant associations for specific scenarios, but the variable differs. Income and university education had a positive significant effect for the full sample, but not for specific scenarios.
Finally, a number of important study limitations should be mentioned. Compared to the Swedish population, the sample had a higher income and share of university level education, which implies that the generalizability can be questioned. However, given that the primary aim was not to elicit policy estimates of WTP-Q, the fact that the different subsamples that we used for comparison did not significantly differ from each other is assuring. Further, it is well known that it is difficult to intuitively communicate small changes in probabilities. Failure to understand our survey scenario could of course be one argument as to why we fail to find adequate sensitivity to scope.
To conclude, we found that our expectation of sensitivity to scope, or higher WTP with the larger expected QoL improvement, was not supported. We have also documented that the willingness to pay per QALY was substantially affected by the chosen tariff to estimate QoL. Even though a single WTP-Q may not be theoretically or empirically attainable, we believe that a threshold based on state-of-the-art research from several data sources will help to improve efficiency in society. However, this article, as well as the broader literature on this topic, struggles to provide estimates that pass the contingent valuation (or stated preferences) validity tests, including the near proportionality of willingness to pay. Now we want to know how much you think it might be worth it to pay for a treatment that can improve a supposed state of health. Imagine that the treatment does not hurt, does not have any side effects, and is not paid for by the community.
Remember, if you think that the treatment is worth the money it means that you and your family get less money for other things, such as food, travel, entertainment, and clothing. We also assume that the social security system compensates you for any medical expenses and loss of income in case of illness.
<NEW SCREEN> Consider the following scenario: Your health state today can be described as the left below (health state A).
There is a chance of 1 in 100 that your health is improved by natural causes during the coming year. Your health would then be described as the right below (health state B). If you imagine that you are part of a group of 100 people, this means that one of you will get an improved state of health in the coming year. But beforehand, no one knows who is going to get better.
There exists a treatment that increases the chance of achieving better health, does not hurt, and does not have any side effects.
If you have the treatment, your chance to a better level of health is instead 5 in 100.
Note that it is your personal chance that is affected, not the chance for any of the others in the group. They choose for themselves if they want to have the treatment.