Patient-reported outcome measures for cancer caregivers: a systematic review

Purpose Informal caregivers provide invaluable help and support to people with cancer. As treatments extend survival and the potential burdens on carers increase, there is a need to assess the impact of the role. This systematic review identified instruments that measure the impact of caregiving, evaluated their psychometric performance specifically in cancer and appraised the content. Methods A two-stage search strategy was employed to: (1) identify instruments that measure the impact of caregiving, and (2) run individual searches on each measure to identify publications evaluating psychometric performance in the target population. Searches were conducted in MEDLINE, EMBASE, CINAHL and PsycINFO and restricted to English for instrument used and article language. Psychometric performance was evaluated for content and construct validity, internal consistency, test–retest reliability, precision, responsiveness and acceptability. Individual scale items were extracted and systematically categorised into conceptual domains. Results Ten papers were included reporting on the psychometric properties of eight measures. Although construct validity and internal consistency were most frequently evaluated, no study comprehensively evaluated all relevant properties. Few studies met our inclusion criteria so it was not possible to consider the psychometric performance of the measures across a group of studies. Content analysis resulted in 16 domains with 5 overarching themes: lifestyle disruption; well-being; health of the caregiver; managing the situation and relationships. Conclusions Few measures of caregiver impact have been subject to psychometric evaluation in cancer caregivers. Those that have do not capture well changes in roles and responsibilities within the family and career, indicating the need for a new instrument. Electronic supplementary material The online version of this article (doi:10.1007/s11136-016-1239-0) contains supplementary material, which is available to authorized users.


Background
Informal caregivers, whether they are spouse, family member or friend, often provide a significant amount of help and support for people with cancer. Informal caregiving is pivotal to the overall outcome of a patient's treatment, and thus, maintaining the health and satisfaction of caregivers is essential to maximise the well-being of both parties [1,2]. Caregiving can undoubtedly place a strain on the caregiver [3], but the role can also provide a source of happiness and boost self-efficacy and a sense of worth [4].
An increasing number of patients are living a longer life with cancer. As such, there is a growing recognition that broader aspects of their lives and those of the family are affected across the disease trajectory [5]. Informal caregivers could be viewed as ''second-order patients in their own right'' [6]; consequently, a well-validated measure to assess the impact of disease and treatment on their lives and overall well-being is crucial.
There is a raft of measures designed to assess caregiver impact so the choice for researchers may be unclear. The instruments currently used focus on three areas: caregiver burden, caregiver need and quality of life. Some measures are not well validated, and many have been developed for use with caregivers in very different circumstances, for example the elderly with cognitive impairment [7]. In order to better inform researchers on the content and evaluation of commonly used instruments, we identified and evaluated the psychometric performance of measures used in the cancer caregiver population and appraised their content, what is and what is not captured, with particular regard to broader areas of life experience such as the impact on career and family.

Methods
The review involved a two-stage search: (1) to identify generic and cancer-specific self-report instruments used to measure the impact of caregiving on informal caregivers, and (2) to identify evidence about psychometric properties and performance of these instruments in the specific context of cancer.

Search stage 1: Identifying candidate instruments
A combination of controlled syntax (MeSH) and free-text terms were used. Four groups of terms were generated: (1) generic names for measures; (2) impact on caregiver; (3) describing the population; and (4) psychometric performance. OvidSP was used for MEDLINE [MEDLINE(R) In-Process & Other Non-Indexed Citations and MEDLI-NE(R) 1946 to Present] and EMBASE (1947-current) searches. EBSCOhost was used for CINAHL (1937-present) and PsycINFO (from 1800s to present) searches. Terms were modified as appropriate for each database and limited to English language only. Searches were run on 20 November 2014 (see ''Appendix 1'' for the search strategy used for MEDLINE, adapted for other databases).

Study selection criteria
Inclusion criteria for stage 1 were self-report instruments of the impact of caring for patients with cancer or any other condition on the caregiver. Searches were not limited by study design or date, but were restricted to articles in the English language.
Instruments were excluded if they were developed: (1) to address a broad population not specifically for caregivers; (2) to focus on caregivers of children or children who are themselves caregivers; (3) to be administered only by an interviewer or clinician; (4) to measure unmet needs or objective aspects of caregiving, e.g. the amount of time or nature of tasks fulfilled; (5) to evaluate caregivers' assessment or beliefs about their caregiving skills or performance; (6) for use in a non-English-speaking population and for which an English version was not available; (7) for use by patients rather than caregivers (e.g. patient estimates of the impact of their illness on the caregiver); and/or (8) to measure caregiver bereavement.
Titles and abstracts were screened independently by two reviewers (VS/LM) for names of instruments that met the inclusion criteria, resulting in a list of eligible candidate instruments (Fig. 1).
Search stage 2: Identifying evidence of the psychometric properties of candidate instruments in the cancer caregiver population Separate searches were conducted for each of the candidate instruments for studies designed to evaluate their psychometric performance in caregivers of cancer patients. Search terms are grouped as follows: (1) names and acronyms of the candidate instruments identified in stage 1; (2) target population; (3) psychometric terms; and (4) cancer terms (see ''Appendix 2'' for the search strategy used for MED-LINE and adapted for other databases). Searches were run on 16 January 2015 (CINAHL and PsycINFO) and 5 February 2015 (MEDLINE and EMBASE).

Study selection criteria
Studies that reported the reliability, validity, responsiveness, precision and/or acceptability of the caregiver impact measure and met the inclusion and exclusion criteria listed above were selected for this review. Cross-cultural studies were included only if referencing an English language version of the instrument. Searches were not limited by study design or date, but were limited to articles and instrument use in the English language.
In addition to the exclusion criteria from stage 1, papers were excluded if: the instrument was used as a ''gold standard'' to test other measures; psychometric evidence was reported incidentally in studies not designed to evaluate those properties; studies addressing preference weighting or scaling issues for preference-based measures; editorials, opinions, letters and meeting abstracts. Titles and abstracts were screened independently by two reviewers (VS/LM, Fig. 2).

Citation chasing
Backwards citation chasing (one generation) using reference lists of all studies included in this stage of the review and forwards citation chasing (one generation) using Science Citation Index Expanded and Social Science Citation Index Expanded via Web of Science identified no additional eligible studies.

Data extraction
For each included measure, we extracted: name of measure and acronym, key reference/development paper, purpose of measurement, number of items, completion time, response options, recall period, population originally developed with and types of domains/dimensions assessed.
For each included paper, the following descriptive data were extracted: instrument version, first author name, publication year, study aim, study population, number of participants and setting/country where the study was conducted. Any data on evidence of the psychometric properties or performance of instruments were extracted including content validity (theoretical framework and/or qualitative research), construct validity (structural validity and hypothesis testing), internal consistency, test-retest reliability, precision, responsiveness and acceptability. Data were extracted by one reviewer (LM/VS) and checked by a second reviewer (LM/VS).

Evidence for psychometric performance
Evidence of psychometric performance was compared to reference criteria for (1) content validity (qualitative research with potential respondents and involvement in development stage and item generation, clear conceptual framework); (2) construct validity assessed through convergent and divergent validity demonstrated by the ability to differentiate known groups, and/or a pattern of correlation between the scale and other measures; (3) structural validity from factor analysis; (4) criterion validity (concurrent validity assessed through correlation with a gold standard and/or predictive validity where the predicted strength and direction of correlations/direction of group differences should be identified a priori); (5) reproducibility/test-retest reliability (intraclass correlation coefficient [0.7 adequate, [0.9, excellent); (6) internal consistency (Cronbach's alpha coefficient 0.7 B a C 0.9, item total correlations [0.2); (7) responsiveness (change pre-post intervention statistically significant and/or difference of expected magnitude); (8) precision (assessment of measurement error, floor or ceiling effects \15 %; evidence from Rasch analysis); and (9) acceptability (nonresponse/non-completion of questionnaires, proportion of missing data) [8].
For each property, the paper was given a rating of 0 if it did not evaluate or report the property, * if the property was evaluated and met the criteria partially (e.g. not for all domains), ? if the property was evaluated and met the criteria and -if the finding went against the prediction.
Judgements on whether criteria were met were made by two reviewers (VS/LM) with disagreement resolved in discussion with another reviewer (LJF/VJ) where necessary. Content validity is only appraised for papers reporting measure development.

Examination of instrument content and categorisation into related domains
Individual scale items from all included measures were systematically categorised by the authors into conceptual domains. Initial domains were identified from the literature, and additional domains were defined until all individual items had been mapped. The content of each was then reviewed by the team to ensure that the concepts were consistently applied and had face validity [9].

Results
The purpose of stage 1 was to generate a list of eligible candidate instruments. Thirty-two were identified (Fig. 1), and in stage 2, individual searches were conducted for each. Ten measures were excluded as no candidate papers After deduplication, 186 unique records were screened. One hundred and seventy-six were excluded because they did not meet inclusion criteria resulting in the inclusion of 10 papers that reported on the psychometric properties of 8 eligible measures in the cancer caregiver population (see Fig. 2 for full details of reasons for exclusion). Table 1 details the general characteristics of the 8 included measures, including full name and acronym. We refer to measures by the acronym. Of the 8 included measures, 5 were initially developed for cancer caregivers, 2 of which were specifically developed for use in the palliative setting [10,11]. Some measures were developed relatively recently [10][11][12], five between 1980 and 1999 [13][14][15][16][17]. Two [10,13] measure caregiver appraisal specifically, with a theoretical underpinning from the stress and coping model of Lazarus and Folkman [18]; two were designed to measure subjective burden ± distress [12,17] (the Zarit Burden Interview was later revised [19]); three were multidimensional quality of life measures [11,14,16] and one a multidimensional measure of caregivers' reactions to caring for a family member [15]. Table 2 describes the ten studies reporting on the psychometric properties of the measures in the cancer caregiver population in terms of the instrument and version, study aim, population, setting country and number of participants.

Psychometric performance
Appraisal of the psychometric performance reported in each paper is given in Table 3.

Content validity
Of the six studies describing measure development, content validity was generally well described and acceptable. Four (BASC, CQOLC, CRA, QOLLTI-F) describe qualitative work with potential respondents for item development and reduction [11,12,14,15]; two (ACS, FACQ-PC) describe a clear underpinning conceptual framework but no involvement of potential respondents [10,13].

Criterion validity
Concurrent validity: the 6 short forms of the ZBI were validated against the 22-item version as gold standard. Spearman rank order correlations ranged from 0.63 for the one-item version to 0.95 for the 12-item scale [20]. Concurrent validity of the ACS Benefit subscale only was assessed against the Benefit Finding Scale as the gold standard (r = 0.56) [21].
Predictive validity: predictive validity of the ACS was assessed against hopelessness and depression scores at time 2 [21]. Although overall a significant amount of variance in hopelessness (33.3 %) and depression (27.8 %) was explained by ACS scores at time 1, only half of the predictive validity hypotheses were supported. Criterion validity of the QOLLTI-F was assessed using a 2-item measure of global quality of life. QOLLTI-F was predicted between 43 and 55 % of the variance depending on whether individual items (55 %), subscale scores (53 %) or total score (43 %) was regressed.

Structural validity
Structural validity using factor analysis was described in five of the studies. For the CRA [15], exploratory factor analysis supported the five-subscale solution accounting for 65.1 % of variance. Confirmatory factor analysis demonstrated factorial invariance across disease (cancer vs dementia), caregiver type (spouse vs non-spouse) and over time. For the QOLLTI-F [11], the authors describe an acceptable seven-factor solution with exploratory factor analysis (although the total amount of variance explained is not reported) with factor loadings from 0.39 to 0.88. For the FACQ-PC [10], principal axis factor analysis supports a four-factor solution with factor loadings ranging from 0.33 to 0.92. Although all items load highest on the predicted factor, two items cross load ([0.3). Lambert and colleagues [21] report a three-factor solution for the ACS which supports the original subscales, had minimal cross-loadings and factor loadings ranging from 0.405 to 0.726. Glajchen et al. [12] report a five-factor solution for the BASC while noting that one item cross loads. The authors do not report their methods or the factor loadings.

Construct validity: hypothesis testing
Six studies assessed construct validity through convergent and divergent validity. For the ACS, only 5/12 correlations between subscales of the ACS and other measures exceeded the authors' criterion of ±0.3 to demonstrate construct validity [21]. Both papers assessing the CQOLC report moderate-to-high correlations with measures completed at the same time [14,22]. Only the initial validation study assessed divergent validity using dissimilar measures and found that these gave low correlations with CQOLC scores as expected [14]. Construct validity of the BASC was supported by an appropriate pattern of moderate-strong correlations with similar measures [12].
Strong correlations were found between subscales of the FACQ-PC [10] and measures used to test convergent validity; however, positive caregiving appraisals were only weak-moderately associated with positive affect (r = 0.3).  The data on confirmatory factor analysis are also reported by Stommel and colleagues [24] which was identified in forward citation searches. This paper is excluded as it concerns the same data b Authors have now produced a second version of this questionnaire with minor changes and an additional question; however, the validation studies are as yet unpublished Similarly, to demonstrate divergent validity, four correlations were calculated between subscales and other measures which should yield low, negative correlations. While all were negative, two correlations were moderate in magnitude (r = -0.4 and r = -0.38).
Construct validity for the CRA [15] was assessed by correlating subscale scores with caregiver depression and patient dependencies in activities of daily living (ADL). The five subscales were, as predicted, weakly correlated with patient dependencies in ADL. Correlations with depression were in the appropriate direction and ranged from -0.23 to 0.57 in magnitude.
Three studies conducted hypothesis testing by assessing ''known-group'' differences. Group differences analysis for the ACS was only partially supportive of construct validity with only 3/9 hypotheses significant [21]. The BASC was able to discriminate between male and female caregivers and between different relationships between caregivers and patients. The negative personal impact subscale, but not the total score, differentiated between caregivers with and without mental health conditions. There were weak correlations overall with depression, high blood pressure and gastrointestinal complaints. All short forms of the ZBI were shown to have good discriminative validity to correctly classify participants as those with and without burden (contrasting to the classification on the 22-item version as gold standard).
Two studies [13,23] report comparisons between groups (e.g. male/female; spouse/non-spouse), but these were not established a priori as known-group differences for hypothesis testing. One did not examine construct validity with convergent, divergent or known-group analysis [11].

Internal consistency
Internal consistency was assessed in all papers. All subscales of the ACS had a [ 0.7 in both papers [13,21] with the exception of the challenge subscale, which was subsequently dropped from the measure [13]; the threat subscale slightly exceeded the upper limit of a at 0.91 in one paper. [13] Overall a for the BASC was just acceptable (0.7); the negative personal impact factor, which can be used as an independent subscale, was 0.8. For the CQOLC, a approaches and slightly exceeds the upper limit (a = 0.87 and 0.91, respectively) [14,22]. The five subscales of the CRA range from a = 0.8 to 0.9 [15] and the four subscales of the FACQ-PC from a = 0.73 to 0.86. Item total correlations were all in excess of 0.2, the strongest 0.78 [10]. Overall a for the QOLLTI-F was 0.86. The individual subscales were generally weaker ranging from a = 0.48 to 0.81 which may reflect the small number of items in some subscales. The measure also includes two  single-item subscales [11]. Internal consistency for the QOL-FV was a = 0.89. Finally, internal consistency for the ZBI 22-item version as gold standard was a = 0.88 and ranged from 0.69 for the 4-item short version to 0.85 for the 12-item version.

Test-retest reliability
Three of the five papers with at least two time points did not attempt to assess test-retest reliability [15,21,23]. Test-retest reliability of the CQOLC was found to be excellent (0.95) [14]. For the total QOLLTI-F, test-retest reliability was found to be acceptable between T1 and T2 (0.77) and T2 and T3 (0.80). Intraclass correlations for individual subscales were below an acceptable level in 10 out of 14 cases, which may reflect the small number of items in the subscales [11].

Responsiveness
Responsiveness of the QOLLTI-F [11] was assessed by contrasting subscale scores on days that participants considered to be bad, average and good. These differences were statistically significant in all comparisons with the exception of the financial concerns subscale between average and good days. All differences between good and bad days exceeded 0.5 s.d. for minimal important difference. Only 3/8 comparisons did so between good and average and 4/8 between average and bad days. The ''potential to be responsive to change'' of the CQOLC was assessed by using CQOLC to predict patient performance status at a single time point rather than measure responsiveness to change over time. The studies report contradictory findings: in one [14], the predicted significant negative correlation between CQOLC scores and patient performance status is reported as significant (r = -0.46, p \ 0.0001), but in the other [22], this correlation approaches zero (r = 0.09).

Precision
None of the included studies conducted Rasch analysis or an assessment of measurement error. Floor and ceiling effects were not formally reported in any paper although two subscales of the QOLLTI-F [11] were described as having a lack of variance due to ceiling effects which made them less predictive of global quality of life. The subscales, quality of care and relationships, comprised two items each.

Acceptability
The acceptability of measures was not consistently reported and was difficult to assess using missing data and participation rates, as the measure is often given as part of a pack and information is not assessed separately. No information pertaining to acceptability was provided by four studies [10,20,21,23]. In five studies, acceptability was appraised as only partially evidenced due to high dropout or incomplete data [11,12,14,15,22], surprising for the QOLLTI-F which had thoroughly tested acceptability in the development phase [11]. For the ACS [13], overall response rate was 74 % (including postal responses) and only 3/50 participants were eliminated due to missing data, suggesting the questionnaire was acceptable.

Examination of instrument content and categorisation into related domains
The 8 included instruments yielded 194 individual items. These were categorised into 16 conceptual domains under 5 overarching themes of approximately equal size: lifestyle disruption (22 % of items); well-being (22 %); health of the caregiver (21 %); managing the situation (18 %) and relationships (18 %). Most dominant domains were ''confidence, self-esteem and self-efficacy'' (24 items across 7 measures) and ''psychological health of the caregiver'' (22 items across 6 measures). Least represented were ''impact on other family members'' (2 items across 2 measures) and ''impact on paid employment'' (2 items across 2 measures). The distribution and total number of items across the different domains along with example items are given in Table 4.

Discussion
This systematic review was conducted to investigate instruments commonly used to measure caregiver impact in cancer. Specifically, we sought to identify (1) what caregivers were being asked about, and (2) whether the measures performed well in psychometric evaluation. Psychometric appraisal is critical to establish the quality and standards of a measure in a given context. With so many instruments available to researchers, this review is intended as a resource to enable researchers to judge for themselves whether the content and quality of the instruments described match their requirements.
For 24 of the 32 identified measures, we found no evidence of psychometric performance using English language versions with cancer caregivers (see electronic supplementary material for a list of these measures). This is not to say that the measures have not been evaluated, but that we found no evidence in cancer. When assessing the performance of an instrument, the context is critical as it may perform differently in other populations. Without evaluation in cancer, researchers cannot be sure that   Items categorised under the domain ''spirituality'' make direct reference to spirituality, faith, prayer, church/temple, etc instruments are reliably measuring the intended constructs. For 6 of the remaining 8 questionnaires, evidence of psychometric performance was identified in only a single study. The small number of studies identified meant that the evidence for psychometric performance was appraised for each individual paper, rather than allowing the appraisal to consider performance across a group of studies. In particular, the lack of studies beyond the initial validation of some measures is of concern. Most studies collected data at a single time point, and so, responsiveness to change, test-retest reliability and measurement error were not assessed. For the most part, content validity and internal consistency were reported and were adequate. Structural validity was assessed using factor analysis in five studies; four met these criteria, one only partially. Construct validity was assessed using different approaches to hypothesis testing in seven papers, fully meeting the criteria in only three. In the limited number of papers included, strongest support for psychometric performance was reported for the CRA and CQOLC. We set out to examine not only the psychometric performance of these measures in a cancer population, but also to understand what concepts and domains were being assessed. Only one of the 16 conceptual domains, time for self, social life and leisure, was represented in all eight instruments. There was considerable overlap in the domains measured, however, with 8/16 domains being assessed in at least 6/8 measures. We have identified several areas which are not well captured by the instruments included in this review.

Paid employment
Impact on paid employment was assessed with a single question on two measures, neither of which addressed impact on career aspiration and planning or career progression, simply whether paid employment had been affected (FACQ-PC [10], QOL-F [16]).

Family members
We also found that impact on the family as a unit was not well covered in the current measures. Impact on other family members was only addressed with a single question on two questionnaires (again CQOLC [14] and QOL-F [16]).
Current scales do not adequately capture role changes and responsibilities in the household and family routines, for example the impact on other caregiving responsibility such as to children or parents. Taking on new roles and responsibilities can raise issues around feelings of competence in the role, role strain and conflict and family cohesion. Such changes may be fluid as the cared-for member of the family moves through different phases of their cancer and treatment.
Five of the eight measures included in this review were developed between 1980 and 1999. There has been considerable societal change in the intervening years; families, roles and responsibilities are structured differently. It is not clear how appropriate some of the older questionnaires are for the present day. The activities associated with informal caregiving incorporate a range of tasks affecting different aspects of the life of the caregiver and the whole family [25,26]. The impact of caregiving will vary depending not only on the patient's situation but also on family make-up, in terms of other caregiving responsibilities, financial and occupation role responsibilities and time of life [6,27,28]. Impact is also likely affected by the number of other social roles, such as employment and other caregiving responsibilities that the caregiver has [3]. There is limited research about how the effect on variables such as employment and role strain might change over time, as caring responsibilities likely vary in line with different lines of treatment or transition to palliative care [27,29,30]. Future measures should attempt to capture the changing nature of caregiver impact.
We have identified a number of areas which are currently not well captured by measures that have been evaluated in cancer. These gaps may exist for several reasons. First, some measures were not initially developed for this population, and so, constructs important to cancer caregivers may not have the same salience. Second, some measures are old and may not reflect what is important in current society. Third, advances in cancer treatments mean that many more people are living a long life with cancer. For some, initial therapy is just the start of a journey that will involve repeated lines of treatment over time. The patient and the whole family have to continually adjust to a fluid situation and will be impacted variably at different times while trying to maintain a sense of normality in other aspects of their lives. Better treatments mean a longer life with cancer is a possibility for patients; however, we need ways to measure the longer-term impacts of cancer and cancer treatment for them and their informal caregivers. At the very least, we would suggest caregiver input into updating some of the older content if not the development of a new measure to capture the broader impacts we have described.

Limitations
The search strategy may have limited the number of papers identified in two ways: (1) searching for measures by name and acronym. The precise wording of the measure name and even the acronym sometimes varied; (2) reporting standards have changed; some older papers have poor use of keywords and do not always include psychometric terms or the names of measures in title/abstract/keywords. The impact of both of these limitations is mitigated by thorough backwards and forwards citation chasing.
We intentionally restricted the review to studies that reported on the psychometric properties of the English version of measures. This decision was taken as we felt we could not assume cultural equivalence for the caregiving role or the salience of different aspects of burden and impact in diverse populations. We took the decision to exclude all non-English versions of the measures rather than make subjective decisions as to whether one culture was sufficiently similar, while another was not. We are aware, however, that there are a number of studies reporting on the psychometric properties of other language versions of measures included in this review, e.g. [31][32][33][34][35][36][37][38]. We acknowledge there may be cultural differences between and within different countries where English is commonly spoken and where measures developed in English have been used. While this is an extremely important area of research, it is beyond the remit of the current review and it is not an aim of this study to investigate these potential differences. In this review, 7/10 included studies were conducted in the USA and 1 study each in the UK, Australia and Canada.
We also recognise that the pool of individual items identified is restricted by our stringent inclusion criteria for measures. Measures developed in other contexts, e.g. family function in a paediatric setting [39], for economic evaluation [40], domain-specific measures [41] and multidimensional measures which have not been subject to psychometric evaluation in cancer caregivers in the English language [6,42,43] may include concepts and items that are pertinent but which would need to be evaluated in appropriate studies.

Conclusions
A large number of measures purport to assess caregiver impact, but most have not been subject to psychometric evaluation in cancer populations. Few studies met our inclusion criteria so it was not possible to consider psychometric performance of the measures across a group of studies. Our content analysis identified several areas which are currently not well captured. These include changes to career aspiration and planning, changes in roles and responsibilities within the family and the way the family functions as a unit. We also note that some of the measures were developed up to 35 years ago, and their relevance to the current day may need to be reviewed. Strategies to overcome some of these limitations could include caregiver input into revising existing measures or using two or more measures to cover a broader range of outcome domains. However, our review suggests there is a need for a new measure capturing the impacts on broader areas of life for the caregiver and the family unit.