Item usage in a multidimensional computerized adaptive test (MCAT) measuring health-related quality of life

Purpose Examining item usage is an important step in evaluating the performance of a computerized adaptive test (CAT). We study item usage for a newly developed multidimensional CAT which draws items from three PROMIS domains, as well as a disease-specific one. Methods The multidimensional item bank used in the current study contained 194 items from four domains: the PROMIS domains fatigue, physical function, and ability to participate in social roles and activities, and a disease-specific domain (the COPD-SIB). The item bank was calibrated using the multidimensional graded response model and data of 795 patients with chronic obstructive pulmonary disease. To evaluate the item usage rates of all individual items in our item bank, CAT simulations were performed on responses generated based on a multivariate uniform distribution. The outcome variables included active bank size and item overuse (usage rate larger than the expected item usage rate). Results For average θ-values, the overall active bank size was 9–10%; this number quickly increased as θ-values became more extreme. For values of −2 and +2, the overall active bank size equaled 39–40%. There was 78% overlap between overused items and active bank size for average θ-values. For more extreme θ-values, the overused items made up a much smaller part of the active bank size: here the overlap was only 35%. Conclusions Our results strengthen the claim that relatively short item banks may suffice when using polytomous items (and no content constraints/exposure control mechanisms), especially when using MCAT. Electronic supplementary material The online version of this article (doi:10.1007/s11136-017-1624-3) contains supplementary material, which is available to authorized users.


Introduction
In the last decade, computerized adaptive tests (CATs) [1] based on item response theory (IRT) [2] have become increasingly popular in health measurement. A CAT can be seen as a questionnaire that is tailored to the test-taker on the fly: it continuously updates the estimate(s) of the position on the construct of interest (latent trait) based on answers given by the test-taker to the questions (items) posed. The underlying algorithm then selects the item that is most informative at that particular moment, given the current estimate of the latent trait value. It is clear why CATs appeal to healthcare professionals (HCPs): by selecting only those items that contribute most to the reliable measurement of a patient's latent trait value, measurement efficiency is increased, which results in a substantial decrease in response burden [3]. Furthermore, CAT estimates can be used to generate automatic reports instantly, providing the HCP with all necessary information (latent trait estimate, standard error, norms, and graphic display) to facilitate communication with the patient. These properties make CATs excellent candidates for monitoring patients' physical and mental health routinely, be it on a monthly or daily basis.
CATs draw their items from item banks: large collections of items that have been calibrated with an IRT model using a large sample representative of the target population. The quality of the CAT and the latent trait estimate it generates depend to a large degree on the quality of the item bank. A psychometrically sound item bank contains items with location parameters that cover the whole range of relevant latent trait values, while having adequate to high discrimination parameters. A CAT drawing items from such an item bank will result in efficient measurement for all patients (irrespective of their latent trait score). Most CATs currently used for health measurement are based on item banks that were calibrated using unidimensional IRT models (e.g., [4][5][6][7]). Although less frequently used, multidimensional IRT models are available as well, and can be used to support multidimensional CAT (MCAT) (e.g., [8][9][10]). It has been shown that test length can be further reduced by taking the correlation among constructs into account during item selection and latent trait estimation, while maintaining adequate levels of measurement precision [11,12]. Perhaps equally important, patients often experience quality-of-life (QoL) domains as interdependent; taking this into account allows a closer alignment between psychometric modeling and patient perspective.
Since health-related quality of life (HRQL) has taken a central role in the evaluation of treatment interventions in patients with chronic obstructive pulmonary disease (COPD), we recently developed a multidimensional CAT (MCAT) to measure HRQL in patients with chronic obstructive pulmonary disease (COPD) [13]. Following the steps outlined by Paap et al. [14], we first established which domains of HRQL are most important to patients with COPD, using relevant literature (articles and existing questionnaires), as well as interviews with patients and HCPs [14,15]. Based on these findings, three generic domains/item banks from the PROMIS (Patient-Reported Outcomes Measurement Information System) framework were selected (fatigue, physical functioning, and ability to participate in social roles and activities) and a new COPDspecific domain/item bank (COPD-SIB) was developed [16]. This approach ensures comparability with other patient groups (generic domains), while providing additional sensitivity for measuring change within the specific patient group (disease-specific domain). In this paper, we aim to evaluate an important performance measure for our CAT: item usage.
Due to the adaptive nature of a CAT, it can be expected that certain items are used more frequently than others. Successive items are typically chosen to optimize an objective function [17], such as the Fisher information function. 1 Highly discriminating items, polytomous items covering a wide range of the latent trait (denoted h), and items targeting average h-values have a higher chance of being selected, all else being equal. If items are selected more frequently than could be expected based on chance or a predefined threshold, these items are typically referred to as being overexposed. Conversely, items selected less frequently than could be expected are referred to as underexposed. The terms item exposure and item usage seem to be used interchangeably in the literature. In the context of educational testing, item overexposure is seen as a threat to test security (examinees may be able to remember and share items with others) and receives a lot of attention in the literature (see, e.g., [18,20,21]); in health measurement, items do not need to be kept secret and therefore item exposure has received less attention [22]. However, item usage is an important outcome measure in evaluating CAT and item bank performance. Variability in item usage rates indicates that the CAT is working as intended (if the items were selected at random, the item usage rate would be expected to be equal for all items). However, if a number of items are not used at all, or very rarely, the ''real'' (active) size of the item bank is smaller than it was designed to be. The main aim of the current study is to evaluate item usage for a newly developed MCAT which draws items from the PROMIS domains fatigue, physical function, and ability to participate in social roles and activities, as well as the COPD-SIB. We will report on both active bank size and item overuse/ overexposure.

Multidimensional item bank
Adams et al. [23] divide multidimensional IRT models into two subclasses: within-item and between-item multidimensional models. Within-item multidimensional models allow items to relate to more than one latent dimension. When between-item multidimensional models are used, the restriction is imposed that the items relate to one dimension only; multidimensionality is expressed through the correlations among the latent dimensions (these are estimated jointly with the item parameters and latent trait values). In this study, we chose to use a between-item multidimensional model, since such models are useful when multiple distinct latent dimensions are measured 2 and relatively high correlations are expected. The multidimensional item bank used in the current study contained 194 items from four domains: the PROMIS domains fatigue (example item: ''To what degree did you have to push yourself to get things done because of your fatigue?''), physical function (example item: ''Are you able to climb up five steps?''), and ability to participate in social roles and activities (example item: ''I have trouble doing all of the activities with friends that are really important to me'') [25,26]; and the COPD-SIB (example item: ''It frustrated me that I couldn't do everything I wanted to do anymore'') [16]. The PRO-MIS ability to participate in social roles and activities item bank was used in its entirety (35 items). We included a subset of the other two PROMIS item banks: we selected 50 fatigue and 63 physical function items. Item selection was performed by JP who has ample experience with COPD patients and COPD research, and reviewed by an international colleague of JP's with comparable experience. The COPD-SIB contains 46 items: both newly written items, and (adapted versions of) items from the SGRQ-C, the Quality of Life for Respiratory Illness Questionnaire (QoL-RIQ), the COPD Assessment Test, the Maugeri Respiratory Failure Questionnaire Reduced Form (MRF26), and the VQ11 [27][28][29][30]. In our application, a higher latent trait score indicated better HRQL for all domains.

Test design
Multidimensional calibrations are not currently available for the PROMIS general population sample, and therefore the PROMIS calibrations cannot be used in the current study. In order to facilitate multidimensional calibration, our test design needed to be constructed in a way that would allow for item parameter estimation as well as estimation of the covariance structure among the domains. We used a booklet design, whereby the total number of items was distributed among three booklets each containing around 100 items. The booklets were linked using ten anchor items per domain (this type of linking is also known as alternate form equating or common-item equating). Each booklet contained items pertaining to at least two domains.

Calibration sample
The following inclusion criteria were used: a medical diagnosis of COPD; sufficient oral and written mastery of the Dutch language; and being able to complete a questionnaire. HCPs (pulmonologists, general practitioners, physiotherapists, and nurse practitioners) were recruited by JP, through his professional network. HCPs distributed the questionnaires accompanied by an information letter among COPD patients attending their clinics from October 2014 through December 2015. Of the 1500 printed booklets, 795 were returned by the end of December 2015. Our sample had a mean age of 67.2 years (SD = 10.08), and consisted of 52.7% men. More detailed patient characteristics are reported in Supplement 1.

Data preparation
All items in the item bank were scored on a 5-point Likert scale ranging from 0 to 4. In total, 10 different types of answer categories were used (depending on the domain and item formulation), for example, without any difficulty, with a little difficulty, with some difficulty, with much difficulty, unable to do or never, rarely, sometimes, usually, always. Twenty-eight percent of the items showed low endorsement (fewer than 10 responses) for one or more of its categories. Following Paap et al. [16], for 55 out of 194 items, item response categories that showed low endorsement (fewer than 10 responses) were merged with adjacent categories. Among these 55 items, 18 pertained to the fatigue domain, 23 to physical function, 2 to ability to participate in social roles and activities, and 12 to the COPD-SIB. For the majority of these items (51), the lowest two or highest two categories were collapsed. In the other cases, either the lowest or highest three categories were collapsed, or both the lowest two and the highest two. Note that items having different numbers of response categories due to merging does not constitute a problem for the IRT model used (multidimensional GRM).

Multidimensional IRT calibration
The multidimensional graded response model was used to obtain item parameter estimates and estimates of the covariance structure.
The probability of a response in category j in item i with m total response categories, PðX ij ¼ 1jhÞ, is given by and a 0 h denotes the dot product of the vector of discrimination parameters and latent traits. To ensure that the probabilities are always positive, response categories must be sorted by difficulty Up to five parameters were calculated for each item i: one discrimination parameter (denoted a i ) and several b ij parameters; the number of b ij parameters equals the number of categories minus one. The b ij parameter is related to the difficulty with which a respondent will reach the jth step of each item. Note that in unidimensional IRT, two types of parametrization can be used for x: a h À b ð Þ or ahb. In multidimensional IRT, a 0 is a vector containing an a value for each dimension; here, only the ahb parametrization can be used. Some software packages, such as IRTPRO, calculate ''easiness'' rather than ''difficulty'' parameters. In IRTPRO, this parameter is denoted as c. The b ij parameter described above equals the negative value of the c-parameter. The estimates of the item parameters and covariance structure were obtained using the software package IRTPRO [31]. A multivariate normal distribution was assumed for the four latent traits, with variances fixed to 1 and the covariances being estimated freely. The estimated correlation matrix among the four domains U equalled with rows and columns representing fatigue, physical function, ability to participate in social roles and activities, and the COPD-SIB, respectively. The item parameters are presented in Supplement 2. The discrimination parameters were relatively high for all domains (range: 0.82-5.40), which is quite common for clinical measures [32], and the b ij parameters showed a good spread (range: -7.57 to 7.67). Measurement precision for h-estimates was excellent

Data generation and CAT simulations
CAT simulations were run with the package ShadowCAT [33] in R [34]. To evaluate the item usage rates of all individual items in our item bank, responses were generated based on 21000 vectors of pre-specified h-values-1000 for every increment of 0.2 on the multidimensional hscale between values -2 and 2. The Maximum A Posteriori (MAP) estimator was used in all simulations to estimate h, at all stages of the CAT. The covariance matrix U estimated using the multidimensional GRM was used as a prior. Following Segall [19], item selection was based on the value of the determinant of the posterior information matrix. Diao and Reckase [35] refer to this item selection method as Bayesian Volume Decrease, whereas Yao [36] simply abbreviates it as Volume or Vm. One random item per domain was administered at the start in order to obtain initial h-values to initialize the CAT. The CAT was terminated, when the termination rule (threshold standard error of measurement SE(h) \ 0.316) 3 was met for all four domains. Item selection for a particular dimension was terminated, when the SE-threshold had been met for that dimension.

Outcome variables
The outcome variables in this study were overuse and active domain/bank size, all conditional on h. Each of the outcome variables will be reported by domain as well as across domains (i.e., at item bank level). An item was considered overused when its usage rate was higher than the expected item usage rate, 4 defined as the average test length for a given h-value divided by the total bank size (194). Active domain/bank size was calculated as total domain or bank size minus items that were never used in the respective domain or overall bank.

Results
The results of the CAT simulations are summarized in Tables 1 and 2, Fig. 1, and Supplement 4. Comparing Table 2 (percentage of overused items) to Table 1 (active bank size) shows that-for the total bank and average h-values-overused items dominated the active part of the multidimensional item bank; there was 78% overlap between overused items and active bank size. For more extreme h-values, the overused items made up a much smaller part of the active bank size: here the overlap was only 35%.  you answer this question. Getting washed or dressed''), SGRQ13 (''Please, indicate whether the following activity causes shortness of breath. If the weather influences your complaints, assume the weather conditions are favorable, when you answer this question. Walking around the home.''), SGRQ26 (''I get afraid or panic when I cannot get my breath.''), SGRQ42R1a (''My breathing problems make it difficult to do light gardening, such as weeding.''), and SGRQ42R1b (''My breathing problems make it difficult to do things such as dancing, playing golf, or playing bowls.''). Some items, such as CSIB13 (''It frustrated me that I couldn't do everything I wanted to do anymore'') and SGRQ31 (''Everything seems too much of an effort.''), show two peaks; something typical for polytomous data. Polytomous items have more than one b parameter and thus cover a wider h-range. A polytomous item can have more than one peak in its item information function, which would translate into more than one peak in the item usage plot. Longer CATs are needed to obtain reliable estimates of very low or high h-values, which explains why as many as 38 items show relatively high item usage rates for low or high h-values only.
In Fig. 1, the item step parameters are plotted against the discrimination parameters for each domain. The figure clearly shows that within each domain, the items with the highest discrimination values had the highest item usage rates. These items typically covered a wide range of h-values.

Discussion
In this study, we evaluated active bank size and item overuse/overexposure in a recently developed MCAT designed to measure HRQL in COPD patients using four correlated domains. Three generic PROMIS domains were used: the PROMIS domains fatigue, physical function, and ability to participate in social roles and activities [25,26]; as well as a COPD-specific item bank (the COPD-SIB) which was recently developed [16]. We found that, for average latent trait values, the overall active bank size was 9-10%; compared to 39-40% for more extreme latent trait values (-2 and ?2). Furthermore, as expected, domains with highly discriminating items were overrepresented in the active part of the multidimensional bank. For average latent trait values, the active part of the bank was almost entirely populated by overused items. In contrast, for more extreme latent trait values, the active part of the multidimensional bank was dominated by underused items. The number of items that showed good item usage and covered Overused items are defined as items whose usage rate exceeded the expected usage rate (average test length for a given h-value divided by the total bank size) almost the entire latent trait range varied between 2 (physical function) and 5 (COPD-SIB) per domain. We used a multidimensional item bank consisting of 194 items (35-63 items per domain). Given that we developed a MCAT without content constraints and with no exposure control, our results indicate that the MCAT was working as intended: for average latent trait values, a small number of highly discriminating items was selected; for more extreme values, the item bank usage was more balanced. However, our results also showed that a relatively large part of the multidimensional item bank was never used (60%). The active part of the bank consisted of 77 items at most, across the four domains. This may indicate that-if these findings can be generalized-roughly 19 polytomous items per domain might suffice, when developing a multidimensional bank populated by items with high discrimination parameters that adequately cover the latent trait range of interest, and with high correlations among domains. Research focusing on unidimensional CATs has shown that CATs based on polytomous rather than dichotomous items can be performed with substantially smaller item banks; an item bank of 30 items may be sufficient for polytomously scored health outcomes [38,39]. Our results suggest that MCAT potentially requires smaller item banks than UCAT. It would be interesting to study this further in a future study.
Item usage has received little attention in the field of clinical (psychological/health) measurement so far. One exception concerns developing IRT/CAT-based short forms. Several authors have suggested that CAT simulations can be used to select the most appropriate items for inclusion in a short form [40][41][42]. In these studies, typically the entire item pool is administered, after which the rank order in which the items were administered is calculated and averaged over all simulees. The ''best'' items (items with the lowest average CAT presentation ranks) would then be selected for the short form [41]. Items for the newest PROMIS short forms were selected based on the maximum interval information and CAT simulations (highest average administration rank) [43], making their measures easily accessible in situations where CAT may not be feasible. Because static short forms will be typically targeted at a relatively wide latent trait range, they are relatively long compared to CATs, especially for respondents with average latent trait values. Furthermore, although a short form may achieve adequate measurement precision for average to moderately high latent trait scores, CATs provide much better precision at the extremes [41,42]. Our results showed how active bank size and the rate of overused items also depended on latent trait values. In other words, which items are the ''best'' items (in terms of administration rank/usage) depends largely on the respondent's latent trait values. This is not something that can be satisfactorily addressed in a short form.
Another topic which has received little attention in our field is the influence of capitalization on item calibration error. Since the item selection criterion most frequently used is a direct function of the discrimination parameter, item selection is sensitive to large standard errors of discrimination parameters [44,45]. Typically, extreme discrimination parameter estimates tend to be associated with larger standard errors [46]. Furthermore, the smaller the selection ratio (CAT length divided by total item bank), the larger the danger of capitalization on chance [47]. Capitalization on item calibration error may lead to overestimation of test information and underestimation of the standard errors of latent trait estimates [46]. In this light, having a small set of items with very high item usage rates (and a large set not being used at all) may be worrying, regardless of the issue of test security. In this study, we did find a strong correlation (0.82) between estimated discrimination parameters and their respective standard errors. However, penalizing items with the highest discrimination parameter estimates (for example, by increasing the estimates by 1 or 2 times their corresponding standard error), would have had a very insubstantial effect on their ranking (data not shown). This being said, if we would have penalized items with relatively high standard errors during the CATs, test length would most likely have been somewhat longer, and subsequently the active size of the item bank would also have been larger. Since estimates are typically (also in our case; data not shown) more precise when using a multidimensional rather than unidimensional IRT models to calibrate the items, the impact of item calibration error can be expected to be smaller than if we had used separate unidimensional CATs. Research investigating the potential protective effect of multidimensional IRT and CAT on the consequences of capitalization on item calibration error is needed.

Conclusion
With this study, we extended the literature on item usage rates to multidimensional health measurement. We showed what happens when realistic CAT settings (typical for health measurement) are used: a relatively small number of highly discriminating items is selected. Currently, PRO-MIS item banks differ widely in length. Our results strengthen the claim that relatively short item banks may suffice when using polytomous items (and no content constraints/exposure control mechanisms), especially when using MCAT. This may be particularly relevant to item bank developers. However, if researchers or clinicians want to be able to influence the content (to ensure validity), different item selection procedures are necessary; in such instances, a larger item bank will be needed.