A classification of response scale characteristics that affect data quality: a literature review

Quite a lot of research is available on the relationships between survey response scales’ characteristics and the quality of responses. However, it is often difficult to extract practical rules for questionnaire design from the wide and often mixed amount of empirical evidence. The aim of this study is to provide first a classification of the characteristics of response scales, mentioned in the literature, that should be considered when developing a scale, and second a summary of the main conclusions extracted from the literature regarding the impact these characteristics have on data quality. Thus, this paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality. It distinguishes between characteristics that have been demonstrated to have an impact, characteristics for which the impact has not been found, and characteristics for which research is still needed to make a conclusion.


Introduction
A challenge for questionnaire designers is to create survey measurement instruments (from now on called: survey questions) that capture the true responses from the population. To do so, they need to create survey questions that not only capture the theoretical concept under evaluation, but that also minimize the impact of their design characteristics on the quality of the responses.
Deciding about the right characteristics of a survey question is not a straightforward task. For instance, 'What is the optimal number of response options to use?' or 'Shall I label all options in the scale? are recurrent questions without a clear answer in the field of questionnaire design and survey methodology. However, making the right decisions is crucial if one wants to minimize the impact of those on survey's data quality (Alwin 2007;Dolnicar 2013;Krosnick 1999;Krosnick and Presser 2010;De Leeuw et al. 2008;Saris and Gallhofer 2014;Schuman and Presser 1981).
Within the Total Survey Error framework (Groves et al. 2009), the way a survey question is designed has a direct influence on the responses given to such question, and impacts the overall surveys' data quality. The observational gap between the ideal measurement and the response obtained, is defined as measurement error. Studies assessing the influence of questions' characteristics on measurements' error show that these characteristics explain between 36 and 85% of its variance (Andrews 1984;Rodgers et al. 1992;Saris and Gallhofer 2007;Scherpenzeel and Saris 1997). Saris and Revilla (2016, p. 4) state that if measurement errors are ignored: ''one runs the risk of very wrong conclusions with respect to relationships between variables and differences in relationships across countries''.
Among the wide range of components that influence the design of a survey question, the choice of the response scale is often the most important decision to assure good measurement properties. For instance, Andrews (1984) showed that the number of categories had the biggest effect on measurements' quality, followed by the provision or not of an explicit ''don't know'' option. Moreover, the design of the scale is often the most complex in terms of the amount of decisions that influence the way respondents interpret the options provided.
Literature on how to design scales is wide. Most research is directed to the study of a specific set of design characteristics, like the optimal number of points (Preston and Colman 2000;Revilla et al. 2014) or the kind of labels to use (Eutsler and Lang 2015;Moors et al. 2014;Weijters et al. 2010). Some literature reviews have been conducted to summarize all these findings (e.g. Dolnicar 2013; Krosnick and Fabrigar 1997;Krosnick and Presser 2010). However, these summaries focus on the most commonly used characteristics and do not provide an accurate guide of all design decisions that developing a scale can require. Moreover, one can get quite lost because of the different classification strategies and the different ways researchers use to refer to the same aspects.
In this paper, I aim to provide an updated and detailed classification of characteristics to be used in the development of scales in combination to their influence on data quality. Specifically, I focus on closed and ordinal response scales for forced-choice scales because, in contrast to multiple-choice, open and nominal scales, many more subjective design decisions can take place.
To make such a classification, I conducted a revision of the literature with two main objectives: (1) classify the characteristics of response scales, and (2) assess whether evidence has been found, in the literature, regarding the impact of those characteristics on data quality.
The reminder of this paper is organized in the following way: Sect. 2 presents the methodological procedure followed to review the literature and make the classification. Section 3 presents the findings from the literature review following the classification. And, finally, Sect. 4 concludes with the main findings of this research.
2 Methodological procedure I conducted a revision of the literature looking for evidence about the relevance of the characteristics of closed and ordinal response scales.
As a starting point, I took the list of characteristics developed by Saris and Gallhofer (2007) and further updated in Saris and Gallhofer (2014). They structured this list in characteristics which group different mutually-exclusive choices. For instance, the characteristic: labels of categories, groups three possible choices: no labels, partially-labelled or fully-labelled. In total, they considered more than 280 possible choices, among which 40 choices are related to the design of the scale and belong to 17 characteristics. Table 2 in Appendix provides the list of response scales' characteristics and the choices considered by these authors. This list covers most characteristics used in the development of scales for face-to-face surveys, that used showcards as visual aid for the respondent. Its major drawback comes from specific characteristics related to the design possibilities offered by other modes of survey administration, such as the different formats of scales' visual presentation which are available in web surveys. From this preliminary list, I conducted an in-depth search for publications that mention these 40 design choices in academic journals or book chapters.
While revising the literature I focused, on the one hand, on identifying other characteristics and design choices, and on the other hand, I searched for empirical evidence and/ or theoretical arguments in the literature that assess if these design choices have an impact on data quality or not.
In relation to the empirical evidence, it is often difficult to extract general conclusions since studies differ on the type of questions under examination, on the sample characteristics, on the mode of administration, and especially on the type of quality indicators used. Moreover, there are clear dependencies between characteristics. However, in this paper my goal is to identify if there is any kind of empirical evidence in the literature, thus, I will not differentiate the study characteristics or on the sign of the effect found, or on the kind of indicators. In fact, a wide range of measurement quality indicators, or its complement measurement error, are considered in the literature. Hereafter I considered different types of response style bias, like extreme and middle responding and acquiescence, item non-response, and satisficing bias as indicators of measurement error. Furthermore, I considered different measures of reliability and validity, as indicators of measurement quality.
The revised literature often uses different terms for the same types of design choices. To provide a clear summary of the literature review, an initial step is to harmonize the terminology. When necessary, I therefore renamed characteristics and add more possible design choices. I thereby also identified the gaps of non-studied variations that should also be considered. Subsequently, as illustrated in Fig. 1, I group within families, similar sets of related characteristics, and within a characteristic the different number of mutually-exclusive choices one could take.
Next, using this classification, I summarize the results of the literature review.
3 The findings from the literature review  Table 1 presents this classification and provides information on the four possible scenarios regarding its impact on data quality: (1) whether a characteristic has been empirically demonstrated to have an impact on data quality (Yes); (2) whether it has been shown to not impact data quality (No); (3) whether it has not been studied (NS); or (4) whether its impact is not clear yet to make a conclusion (NC). Following, a detailed description of each characteristic and design choices together with the findings related to their influence on data quality is provided using the classification presented in Table 1. The description below follows the detailed summary provided in the Table 3 in Appendix, which also provides all the theoretical and empirical references used as well as the indicators used to assess the impact on data quality for each study.

Scales' evaluative dimension
The evaluative dimension of the scale comes from the theoretical underlying concept that is intended to be measured by the survey question. The basic distinction is between agreedisagree and item-(or construct-) specific scales.
Agree-disagree scales can be used to evaluate the level of agreement or disagreement towards a statement or a stimulus. For instance, asking ''Do you agree or disagree that your health is good?'' and providing the respondents with the options ''agree'' and ''disagree''. Such type of scales has obtained a lot of attention by researchers. These scales are simple to design (Brown 2004;Schaeffer and Presser 2003) but they require a major cognitive effort from respondents (Kunz 2015). Empirical evidence has shown presence of acquiescence bias, i.e. the propensity to agree, in such scales (Billiet and McClendon 2000). Item-specific scales can be used to measure variables, for which the scale options directly refer to the theoretical concept under evaluation. For instance, when asking ''How good or bad is your health?'' an item-specific scale would provide the respondents with the options  None of the points ''good'' and ''bad''. Comparing item-specific with agree-disagree scales, studies have shown that item-specific scales provide higher measurement quality (Alwin 2007;Krosnick 1991;Revilla and Ochoa 2015;Saris et al. 2010;Saris and Gallhofer 2014). The choice of the scale's evaluative dimension has therefore, an impact on data quality.

Scales' polarity
Every concept has a theoretical range of polarity, which can be either bipolar or unipolar. While bipolar constructs range from positive to negative with a neutral midpoint; unipolar constructs range from zero to some maximum level with no neutral midpoint. Scales' polarity refers to the conceptual extremes of the labels used in the scale. A bipolar scale uses the two theoretical poles of the bipolar concept being measured in the scales' extremes, for instance, ''satisfied'' and ''dissatisfied''. A unipolar scale uses only one pole of the concept being measured for one extreme and its zero point for the other, for instance, ''important'' and ''not important at all''. This distinction is relevant, because in case a unipolar scale is used to measure a bipolar concept, the scale would be one-sided towards the positive or the negative pole. Moreover, it is important to consider since specific characteristics like the use of a midpoint or the use of a symmetric scale depend on whether the scale is provided as unipolar or bipolar. While bipolar scales ask about the neutrality, the direction and the intensity of an opinion, unipolar scales only ask about the extremity or intensity. Moreover, bipolar scales have the disadvantage that some respondents are reluctant to choose negative responses (Kunz 2015), and that reliability is somewhat higher in unipolar scales than bipolar scales (Alwin 2007). However, I have not found more studies assessing the impact of the scales' polarity on data quality. Thus, more research is needed to confirm its relevance.

Concept-scale polarity agreement
The distinction between the concepts and the scales' polarity is key, since the non-differentiation between bipolar and unipolar attributes has resulted in ''misinterpretations of the empirical findings'' (Rossiter 2011, p. 105). Even so, when designing survey questions, this characteristic has received quite little attention, compared to other aspects of the survey questions. It has been shown that this characteristic has an impact on the response styles (van Doorn et al. 1982) but no clear impact on measurement quality (Saris and Gallhofer 2007). Thus, more research is needed about its impact on data quality. Following the classification of Saris and Gallhofer (2007), the design of concept-scale polarity can be: both bipolar, both unipolar, or bipolar concept with a unipolar scale. In practise, even if, theoretically unipolar concepts should be designed using unipolar scales, we find also bipolar scales. For instance, a scale ranging from ''Completely unimportant'' to (3) relative metric scales, a kind of scale that also requires the specification of a standard to give relative evaluations. However, in this case, respondents are asked to draw a line relative to the standard provided instead of giving a numerical answer; and (4) absolute metric scales, where respondents should select a point in a continuum. Typically, it is presented as a straight horizontal or vertical line with specified anchors on each endpoint.
Rounding is the major problem of continuous numeric options. It has been shown that respondents create their own grouped response categories, often using exact multiples of 5 (Liu and Conrad 2016;Tourangeau et al. 2000), except for the relative metric scales which, in contrast, require lines' length to be measured later (Saris and Gallhofer 2014). Relative scales are argued to be more burdensome to respondents which should not give an absolute evaluation but instead a relative answer given the standard value specified (Krosnick and Fabrigar 1997). Moreover, the specification of an appropriate standard is sometimes hard, since it is important using a standard that is ''part of actual experience for all respondents'' and ''perceived as distinct from the 0 point'' (Schaeffer and Bradburn 1989, p. 412). The impact on measurements' error of using these types of scales has been studied by comparing absolute open-ended quantifiers with absolute metric scales with mixed results: Liu and Conrad (2016) find non-significant differences in item-nonresponse, and Couper et al. (2006) find higher item-nonresponse for the metric scale.
Scales can also provide a limited number of categorical options. I distinguish four main types of categorical scales: (1) dichotomous scales which only provide two substantive response options, typical dichotomous scales are yes-no and true-false; (2) rating scales which provide three or more categorical options; (3) closed quantifiers which are mainly used for objective variables such as the frequency of activities, omitting its response alternatives such scales become an open-ended quantifier; and (4) branching scales are used to simplify the respondents' task when answering to long bipolar scales. Branching scales consist on dividing the response task in two steps. First, the respondents are asked about the direction of their judgment, i.e. neutral alternative versus the extreme sides of the bipolar scale. Second, they are asked about the extremity or intensity of their judgement on the selected side.
Rating scales require more interpretative efforts that may harm the consistency of the responses compared to dichotomous scales (Krosnick et al. 2005), whereas branching scales have been argued to be useful to explore the neutral alternatives and to provide large fully-labelled scales without a visual presentation (Schaeffer and Presser 2003). A handicap of closed quantifiers, compared to open quantifiers, is that the specified ranges inform respondents about the researcher's knowledge of (or expectations about) the real world (Schwarz et al. 1985;Sudman and Bradburn 1983). In this direction, Revilla (2015, p. 236) for sensitive questions recommends providing ''answer categories with high enough labels such that respondents do not feel that their behaviour is not normal'', and for non-sensitive questions ''use labels following the expected population distributions such that respondents can use the middle of the scale as a reference point as to what is the norm, and evaluate their own behaviour as lower or higher than the average''. Looking at its impact on measurement quality, scales with 2-points usually perform worse than scales with more categories, with the exception of three-point scales (Krosnick 1991;Lundmark et al. 2016;Preston and Colman 2000). Only Alwin (2007) reports that dichotomous scales provide higher reliabilities than rating scales and absolute metric scales. On the contrary, some studies find evidence regarding branching scales producing higher measurement quality than rating scales (Krosnick 1991;Krosnick and Berent 1993). When rating scales are compared to continuous scales, like absolute metric scales or open-ended quantifiers, evidence is mixed: continuous scales are more reliable in Saris and Gallhofer (2007), but in Couper et al. (2001) and Miethe (1985) they provided higher item-nonresponse and lower reliability, respectively, than rating scales, and no differences between the two have been found on measurement quality by Koskey et al. (2013). Comparing rating to metric scales, the second appeared less reliable and leading to higher item-nonresponse in the studies of Cook et al. (2001), Couper et al. (2006) and Krosnick (1991), however, others find comparable impact between the two (Alwin 2007;Funke and Reips 2012;McKelvie 1978). Finally, Al Baghal (2014b) compares closed with open-ended quantifiers showing nonsignificant differences on measurement quality.
Overall, the decision on type of scale to provide has an impact on data quality and should be considered carefully when designing survey questions.

Scales' length
The length of the scale is one of the key issues in scale development. As Krosnick and Presser (2010, p. 269) say, ''the length of scales can impact the process by which people map their attitudes onto the response alternatives''.
The minimum and maximum possible values are used to evaluate the length of continuous scales. This characteristic has been fairly studied. Reips and Funke (2008) argue that differences on the length of metric scales may depend on the devices' screen size and resolution, while, Saris and Gallhofer (2007) find a significant effect of the maximum possible value to answer in continuous scales on measurement quality.
The number of categories is used to evaluate the length of categorical scales. Among the characteristics of categorical scales, the number of categories is one of the most studied and complex design decisions: while a two-point scale allows only the assessment of the direction of the attitude, a three-point scale with a midpoint allows the assessment of both the direction and the neutrality, and even more categories allow the assessment of its intensity or extremity. Furthermore, while too few categories can fail to discriminate between respondents with different underlying opinions, too many categories may reduce the clarity of the meaning of the options and limit the capacity of respondents to make clear distinctions between them (Krosnick and Fabrigar 1997;Schaeffer and Presser 2003). The results regarding its impact on data quality are mixed. Most evidence suggest using more than 2-points to increase measurement quality (e.g. Andrews 1984). Some find evidence in favour of using 5-7-points (Komorita and Graham 1965;Rodgers et al. 1992;Scherpenzeel and Saris 1997). Others argue that options from 7 up to 10-points should be preferred (Alwin and Krosnick 1991;Lundmark et al. 2016;Preston and Colman 2000). Some others argue that even more categories, i.e. 11-points, can provide better measurements (Alwin 1997;Revilla and Ochoa 2015;Saris and Gallhofer 2007). Finally, others do not find differences across different number of points (Aiken 1983;Bendig 1954;Jacoby and Matell 1971;Matell and Jacoby 1971;McKelvie 1978). More recently, research has looked at the specific circumstances of the questions when evaluating the impact of the number of points. Some find, when distinguishing between item-specific and agree-disagree scales, that the quality does not improve for agree-disagree scales with more than 5-points (Revilla et al. 2014;Weijters et al. 2010) and for item-specific it goes up between 7 and 11-points (Alwin and Krosnick 1991;Revilla and Ochoa 2015). Similarly, Alwin (2007) argue that the optimal of points in a scale should be considered in relation to the scales' polarity, and show that the use of 4-point scales improved the reliability in unipolar scales, while 2, 3 and 5-point scales improved the reliability in bipolar scales.
This summary has clearly shown that the length of the scale is a characteristic to consider.

Verbal labels
Verbal labels are words used as a reference to clarify the meanings of the different scale points and its interval nature and reduce ambiguity (Alwin 2007;Krosnick and Presser 2010). Although it has been found that fully-labelling all points increases the cognitive effort of reading and processing all options (Krosnick and Fabrigar 1997;Kunz 2015). Studies about its effects on response style bias show that acquiescence is higher and extreme responding is lower with fully-labelled scales (Eutsler and Lang 2015;Moors et al. 2014;Weijters et al. 2010). Other studies about its impact show, higher reliability of endpoints labelled scales compared to fully-labelled scales (Andrews 1984;Rodgers et al. 1992), while the majority show that labelling all points in the scale has a positive impact on reliability (Alwin 2007;Alwin and Krosnick 1991;Krosnick and Berent 1993;Menold et al. 2014;Saris and Gallhofer 2007). Thus, the impact on data quality is clear.
Usually a distinction between fully-labelled, partially-labelled and not at all labelled is made. However, there are multiple ways to design a scale partially-labelled and these should also be considered when assessing its effects on data quality. Thus, I propose the following distinction to cover the possible design choices in surveys: scales not at all labelled, only labelled at the end-points, labelled at the end-and the midpoints, labelled at the end-and more points but not all, and fully-labelled.

Verbal labels' information
Verbal labels can provide different lengths and amounts of information. The more information is provided in the labels, the less information is needed in the request. Saris and Gallhofer (2007) distinguish between short labels or complete sentences and conclude that reliability improved when short labels instead of sentences are used. But still, more research is needed to assess the impact of this characteristic on data quality.
The length of a label does not actually provide sufficient advice on how to design them. For instance, even if using complete sentences may improve reliability are very long labels still preferable? It is for this reason, that I belief what affects data quality may be the amount of information provided in the label rather than its length. Thus, I propose the following differentiation. Non-conceptual labels require a previous specification of the type of measurement concept. For instance, the labels ''Not at all'' and ''Completely'' cannot be used without a previous specification of the concept like in the form of a question: ''How satisfied are you with your job?''. Scales can otherwise provide conceptual labels like ''Not at all satisfied''. Verbal labels can also provide information about the object and/or the subject under evaluation. An example of objective label would be ''Not at all satisfied with my job'', and of subjective label, ''I am not at all satisfied''. Finally, a fullinformative label would be ''I am not at all satisfied with my job''.

Quantifier labels
Two types of labels for closed quantifier scales can be distinguished. First, vague quantifier labels which are known to be prone to different interpretations, e.g. ''often'' can mean ''once a week'' for a respondent and ''once a day'' for another (Pohl 1981;Saris and Gallhofer 2014). In terms of its impact on data quality no clear conclusions can be extracted so far: Al Baghal (2014b) show that measurement quality is not affected with vague labels for closed quantifiers compared to open-ended responses, while Al Baghal (2014a) find higher levels of validity than in open-ended scales. Second, closed-range (or interval) quantifier labels, compared to vague quantifiers, are argued to be more precise and less prone to different interpretations (Saris and Gallhofer 2014). However, when providing closed-range quantifiers, respondents may use the frame of reference provided by the scale in estimating their own behaviour (Schwarz et al. 1985). Selecting unbiased ranges allowing respondents using the middle of the scale as a reference point is preferable (Revilla 2015). More research is needed to shed light towards whether the use of vague or closed-range quantifiers impacts or not data quality.

Fixed reference points
Fixed reference points are verbal labels used in a scale to prevent variations in the response functions and set no doubt about the position of the reference point on the subjective mind of the respondent (Saris 1988;Saris and Gallhofer 2014). For instance, the use of ''always'' and ''never'' can be fixed reference points on objective scales, and the words ''not at all'', ''completely'', ''absolutely'' and ''extremely'' for subjective scales. Usually, these are provided at the end-points of a scale. However, with closed-range quantifiers usually all labels are fixed reference points (e.g. ''from 1 to 2 h''), and in bipolar scales, the midpoint alternative is also such. The use of fixed reference labels make the scale the same and comparable for all respondents (Saris and De Rooij 1988). Moreover, it has been proved to have a positive impact on improving measurements' quality (Revilla and Ochoa 2015;Saris and Gallhofer 2007), and that when fixed reference points are not provided, respondents use different scales (Saris and De Rooij 1988).

Order of verbal labels
The ordering of verbal labels can be from negative (or passive)-to-positive (or active) or from positive-to-negative. The order of the verbal labels is an important characteristic since it provides an additional source of information to the respondents (Christian et al. 2007a). Moreover, scales ordered form positive-to-negative tend to provide more quick responses, which increases the chance that respondents do not processes all options consciously (Kunz 2015). Studies find that the order does impact measurement error and response style bias (Christian et al. 2007a(Christian et al. , 2009Krebs and Hoffmeyer-Zlotnik 2010;Saris and Gallhofer 2007;Scherpenzeel and Saris 1997).

Nonverbal labels
Nonverbal labels are numbers, letters or symbols instead of words attached to the options in the scale. The most commonly used are numbers and symbols, e.g. radio and checkbox buttons. Krosnick and Fabrigar (1997) suggest combining numerical and verbal labels. Similarly, others suggest that numbers may help respondents to decide whether the scale is supposed to be unipolar or bipolar (Schwarz et al. 1991;Tourangeau et al. 2007). However, respondents may take longer to submit an answer when numerical labels are provided since they are an additional source of information to process (Christian et al. 2009). Regarding its effect on data quality: Moors et al. (2014) show that scales without numbers and only verbal end-labels evoked more extreme responses than those with numbers, while Christian et al. (2009) and Tourangeau et al. (2000) conclude that response style is unaffected by the use or not of numbers in the scale. Thus, slightly more evidence points toward the fact that the choice of nonverbal labels does not affect data quality.

Order of numerical labels
Order of numerical labels can be from low-to-high or from high-to-low. From the few studies about its impact on response style that have been found, two of them conclude that, when negative numerical labels are provided compared to when all numbers are positive, the differences in the response distributions are significant (Schwarz et al. 1991;Tourangeau et al. 2007), while Reips (2002) concludes that it does not influence the answering behaviour of participants.
Since there is no classification, I propose the following distinction to account for the different choices in surveys: numerical labels ordered from negative-to-positive, from positive-to-negative, from 0-to-positive, from 0-to-negative, from positive-to-0, from negative-to-0, from 1 (or higher)-to-positive or from positive-to-1 (or higher).

Correspondence between numerical and verbal labels
The order of numerical labels is of special relevance when these are combined with verbal labels. Correspondence between numerical and verbal labels refers to the extent to which the order of numerical labels matches with the order of verbal labels. Numerical labels should reinforce the meaning and the polarity of verbal labels (Krosnick 1999;Krosnick and Fabrigar 1997;O'Muircheartaigh et al. 1995;Schaeffer 1991;Schwarz et al. 1991). However, it should be considered that a more negative connotation is given to the label related to a negative number (Amoo and Friedman 2001;Schwarz and Hippler 1995). Following Saris and Gallhofer (2007) the level of correspondence is classified into: high correspondence which refers to combinations of numerical and verbal labels that match perfectly, e.g. a bipolar scale where numbers are ordered from -5 to ?5 and verbal labels range from ''Extremely bad'' to ''Extremely good'' or a unipolar scale where numbers range from 0 to 10 and labels from ''Not at all'' to ''Completely''; low correspondence which refers to combinations where the lower numbers are related to positive verbal labels or vice versa, e.g. a scale numbered from 0 to 10 and labelled from ''Good'' to ''Bad''; and medium correspondence which refers to any other combination of numerical and verbal labels that matches the order of the labels: negative/low and positive/high but not perfectly. Among the little amount of empirical evidence found, only one study concludes that low correspondence do not impact the distribution of responses (Christian et al. 2007a), while two conclude that reliability improves with high correspondence between the verbal and the numerical labels in the scale (Rammstedt and Krebs 2007;Saris and Gallhofer 2007), i.e. there is an impact.

Scales' symmetry
Symmetry is a specific characteristic of bipolar scales. Symmetric scales assure that the number of labels in bipolar scales is the same in the positive and in the negative side. Asymmetric scales assume previous knowledge about the population, otherwise it would be biased (Saris and Gallhofer 2014). However, its impact on measurement error is not clear: while Scherpenzeel and Saris (1997), for symmetric scales, find no effect (or very little) on reliability and validity, Saris and Gallhofer (2007) find a positive effect.

Neutral alternative
Neutral alternative is also a characteristic of bipolar scales, where the respondents are not forced to make a choice in a specific direction. Neutral alternatives can be provided implicitly or explicitly. Explicit neutral alternatives are usually labelled such as ''neither A nor B'', while implicit neutral alternatives do not need to be labelled to understand its implicit neutral connotation, i.e. a bipolar scale with an uneven number of points, the midpoint will be considered neutral even if it is not labelled. Some argue that providing a neutral alternative can increase the risk of survey satisficing (Bishop 1987;Kulas and Stachowski 2009). Others argue that not providing a neutral point forces respondents to select an option which do not reflect the true attitudinal position (Saris and Gallhofer 2014;Sturgis et al. 2014). Finally, Tourangeau et al. (2004) argue that the neutral point in a scale can be interpreted as the most typical and use it to make relative judgements. Regarding the impact on response styles, studies find that including a neutral point increases acquiescence and lowers the propensity towards extreme responding (Schuman and Presser 1981;Weijters et al. 2010). In terms of its impact on measurements' quality, most evidence suggest that providing the neutral impacts measurement quality (Alwin and Krosnick 1991;Malhotra et al. 2009;Saris and Gallhofer 2007;Scherpenzeel and Saris 1997). Only Andrews (1984) finds that the effect was very small.

''Don't know'' option
''Don't know'' (or ''No opinion'') option is a non-substantive response alternative. These can also be implicit or explicit. An implicit ''don't know'' option is an admissible answer not explicitly provided to the respondent, which requires an interviewer to record it. An explicit ''don't know'' option can be directly provided as a different response alternative to the respondent. Providing an explicit ''don't know'' option depends on whether researchers believe that respondents truly have no opinion on the issue in question (Dolnicar 2013;Kunz 2015). However, many authors argue that when the ''don't know'' is provided this leads to incomplete, less valid and less informative data (Alwin and Krosnick 1991;Gilljam and Granberg 1993;Krosnick et al. 2002Krosnick et al. , 2005Saris and Gallhofer 2014). Whether providing explicitly or implicitly a ''don't know'' option impacts data quality is not clear: some authors show that providing it explicitly impacts data quality (Andrews 1984;De Leeuw et al. 2016;McClendon 1991;Rodgers et al. 1992), while others conclude that there is no support towards this impact (Alwin 2007;McClendon and Alwin 1993;Saris and Gallhofer 2007;Scherpenzeel and Saris 1997).

Types of visual response requirement
The type of visual presentation requires from the respondent higher or lower effort when responding. Following are the different types of visual response requirements distinguished in the literature: (1) point-selection is the most standard way to present scales, either a continuous line or categorical options are provided from which the respondent should point and select the desired choice; (2) slider is a type of linear implementation in which the respondent should move a marker to give a rating; (3) text-box input is a typing space where respondents can type in their answer; (4) drop-down menu shows the list of response options after clicking on the rectangular box, i.e. before clicking the respondent do not see the whole list of options and sometimes respondents have to scroll down to select the most desired option; and (5) drag-and-drop refer to the technique where respondents need to drag an element (e.g. the item or the response) to the desired position.
Comparing point-selection to sliders, the first are less demanding but also less fun and engaging (Funke et al. 2011;Roster et al. 2015). In this line, Cook et al. (2001) and Roster et al. (2015) compare sliders with radio buttons and find non-significant differences on reliability or item-nonresponse, respectively. The use of box format is closer to how questions are asked on the telephone, and do not provide a clear sense of the range of the options (Buskirk et al. 2015;Christian et al. 2009). Comparing the use of text-box input with the use of point-selection or sliders, some demonstrate that item-nonresponse and response style and are comparable across the three types (Christian et al. 2007b), while others show that there is an impact on item-nonresponse and response style between the three (Buskirk et al. 2015;Christian et al. 2009;Couper et al. 2006). Christian et al. (2007b) argue that drop-down menus are more cumbersome than text-box input when large number of options are listed. In this line, other authors argue that drop-down menus are more burdensome to respondents because they require an added effort to click and scroll Dillman and Bowker 2001;De Leeuw et al. 2008;Reips 2002). Liu and Conrad (2016) compare drop-down menus with sliders or text-box input and find that item-nonresponse was non-significantly different. Similarly, when drop-down menus are compared to point-selection comparable results in terms of response style and item-nonresponse are found Reips 2002). Finally, drag-and-drop provides higher item-nonresponse compared to point-selection and it is argued to prevent systematic response tendencies since respondent need more time to process what is the task they are required to do (Kunz 2015).
Overall, the evidence provided by these studies suggests that there is no impact on data quality depending on the type of visual response requirement.

Sliders' marker position
Slider marker position is a specific characteristic of sliders. Markers can be placed at the top-or left-side, at the bottom-or right-side, at the middle or outside of a slider. A challenge when designing an slider is how to handle the starting position of the marker and identify non-respondents (Funke 2016). The impact of this characteristic on measurements' error is not yet clear, since only one study looks at its effect on data quality and finds that higher nonresponse and higher response style bias occurred when the marker position was at the middle or the right-side of the slider compared to when the marker was placed at the left-side (Buskirk et al. 2015).

Scales' illustrative format
Sometimes scales are presented using an illustrative format instead of using the traditional scales. Usual illustrative formats are ladders (or pyramids), to indicate levels of some aspect, and thermometers, to indicate degrees of feelings. Other illustrative formats can be clocks to indicate the timing of things, or dials to enter numerical values. The use of these types of scales usually require lengthy introductions and not all points can be labelled, but are useful to visually provide numerical scales with many points (Alwin 2007;Krosnick and Presser 2010;Sudman and Bradburn 1983). The few studies available suggest that this characteristic has an impact on data quality: thermometer scales provide less measurement quality than ladders or radio button scales (Andrews and Withey 1976;Krosnick 1991), ladder scales provide better measurement quality than traditional scales (Levin and Currie 2014) but lower validity compared to other illustrative formats (Andrews and Crandall 1975), and responses are significantly different whether a pyramid or an onion format are used (Schwarz et al. 1998).

Scales' layout display
The scales' layout display of the answer options can be horizontal, vertical or nonlinear. Nonlinear scales can provide, for instance, the answer options on different columns. Tourangeau et al. (2004, p. 372) argue that respondents usually expect, in vertically oriented scales, the positive points to appear first at the top. However, Toepoel et al. (2009, p. 522) argue that respondents read more naturally in a horizontal format. Two studies looked at the effect of scales' layout display on response styles but they both find that whether presenting the scales in an horizontal, vertical or nonlinear layout provided significant differences on the responses (Christian et al. 2009;Toepoel et al. 2009), i.e. it has an impact.

Overlap between verbal and numerical labels
Overlap between labels is a characteristic considered by Saris and Gallhofer (2014) for which no relevance has been found while reviewing the literature. This characteristic intends to indicate whether the verbal labels used in a horizontal scale are clearly connected to one nonverbal label or they overlap with several of them. More research is needed on this characteristic to assess whether it is or not relevant to consider when designing visually presented scales.

Labels' visual separation
Labels can be visually separated by adding more space between them, separating lines or the options in boxes. The aim of this is to provide a visual distinction between the labels in the scale. For instance, researchers may be interested in visually separating the ''don't know'' option from the substantive responses to make a clear differentiation. However, Christian et al. (2009) and Tourangeau et al. (2004) argue that visually separating some of the labels may encourage respondents to select it more often. The impact on data quality is clear: De Leeuw et al. (2016) show that by separating the non-substantive option reduces item-nonresponse and provides higher reliability, Christian et al. (2009) and Tourangeau et al. (2004) show that separating the non-substantive option lead to significant differences on the responses while it do not happen when the midpoint is separated.
The current distinction in Saris and Gallhofer (2014) is whether the labels are separated within different boxes or not. However, given that I found more choices in the literature, I propose to distinguish between visually separating the non-substantive option, the neutral option, the end-points, all points or none of the points in the scale.

Labels' illustrative images
Illustrative nonverbal labels can be used instead of or in combination with verbal and numerical labels when they are provided visually to the respondent. Usual illustrative labels are: feeling faces (also called smileys) which attach images of different face expressions (e.g. from sad to happy). They are easy to format and they attract the attention of the respondents (Emde and Fuchs 2013). Moreover, they have the advantage of being easier to identify by respondents than verbal labels because they eliminate the barrier of mapping feelings into words (Kunin 1998). Its effect on data quality indicate that there is no impact: while Derham (2011) shows that nonresponse is significantly higher in faces scales compared to sliders and point-selection scales, Andrews and Crandall (1975), Emde and Fuchs (2013) show that the differences in the responses between smiley scales and radio button are non-significant.
For the sake of completeness and to capture the different formats found in the literature I propose to distinguish two other types labels' illustrative images: other human symbols, like thumbs and manikins, and other nonhuman symbols, like stars or harts.

Conclusions
This paper provides a complete and updated classification of the characteristics and its possible design choices considered in the literature when designing forced-choice, closed and ordinal response scales. This classification has been summarized in Table 1 together with the main conclusion of the literature review, which indicate whether evidence has been shown in the literature of each characteristics' impact on data quality.
Three main limitations of this study should be kept in mind: First, to assess whether there is an impact or not on data quality, I did not consider the different sample sizes or the power of the studies. I considered the absolute amount of studies. Further research, could provide weights to the different studies. Second, it is likely that publication bias in favour of studies which found an effect of a certain characteristic is present, i.e. the number of characteristics which have an impact may be overestimated. Third, I did not aim to provide information to improve the design of response scales. Thus, the results on the impact are provided independently of its positive or negative effect.
From Table 1 the following main conclusions can be extracted: 1. 11 characteristics have an impact on data quality: the scales' evaluative dimension, the type of scale, the length of the scales, the use of verbal labels, the use of fixed reference points, the order of numerical labels, the correspondence between numerical and verbal labels, the use of a neutral alternative, the scales' illustrative format, the visual layout display of the scales, and the labels' visual separation. 2. 4 characteristics do not have an impact on data quality: the order of the verbal labels, the use of nonverbal labels, the type of visual response requirement, and the labels' illustrative images. 3. Further research is needed for 8 characteristics: to know whether the scales' polarity, the agreement between concept and the scale's polarity, the information provided by verbal labels, the quantifier labels, the scales' symmetry, the use of a ''don't know'' option, the slider marker position, and the overlap between verbal and numerical labels have or not an impact on data quality.
What is clear from the large body of research presented here and its often mixed results is that characteristics interact with each other, e.g. usually scales with more points are partially labelled. Thus, researchers should account for the effects driven by the overall design of the survey question, when assessing how to optimally decide upon a characteristic. That is in line to what Cox III (1980, p. 418) already concluded for the optimal number of categories: ''there is no single number of response alternatives for a scale which is appropriate under all circumstances''.
The results presented in this paper provide on the one hand a source for researchers that want a complete list of characteristics and its possible design choices for closed and ordinal scales, and on the other hand, a detailed summary of the literature that refer to the impact of each characteristic on data quality.
Finally, further research should provide the same summary for other characteristics related to the design of survey questions, such as the design of the request for an answer or the overall visual presentation of the survey question.
Acknowledgements I would also like to show my gratitude to Melanie Revilla, Wiebke Weber and Willem E. Saris for their fruitful comments and feedback on an earlier version of the manuscript, although any errors are my own and should not tarnish the reputations of these esteemed persons.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix
See Tables 2 and 3.

Characteristics of the response scales' conceptualization
Scales' evaluative dimension Agree-disagree (AD) Item-specific (IS) (Brown 2004): AD scales are clearer to interpret than vague or closed-range quantifier scales (Krosnick 1999): people simply choose to agree because it seems like the commanded and polite action to take (Krosnick et al. 2005): to eliminate acquiescence avoid AD scales (Kunz 2015): AD scales are more difficult to understand and map the appropriate judgement (Saris et al. 2010): AD more acquiescence because of its usual presentation in batteries (Schaeffer and Presser 2003): AD simpler to conduct (Alwin 2007 NO Table 3 continued Characteristics

Design choices
Theoretical arguments Empirical evidence on data quality (Schwarz et al. 1985): closed-range informs the respondent about the researcher expectations and adds systematic bias in respondent's reports and related judgements compared to absolute open-ended formats (Sudman and Bradburn 1983): better use open quantifiers than closed quantifiers for numerical answers to avoid misleading the respondent (Tourangeau et al. 2000): round answers in open-ended quantifiers may be a signal of the unwillingness to come up with a more exact answer and introduce systematic bias, in continuous scales (Lundmark et al. 2016 (Alwin 2007): the optimal number of points in a scale should be taken into consideration in relation to the polarity of the scale (Cox III 1980): there is no single number of response alternatives for a scale which is appropriate under all circumstances (Krosnick and Fabrigar 1997): optimal is a complex decision to few categories may compromise the information gathered, too long compromises the clarity of meaning (Reips and Funke 2008): optimal length of continuous scales depends on the size of the device screen (Aiken 1983 YES (Alwin and Krosnick 1991): no differences between AD with 2 and 5p, IS reliability increases from 3 to 9p, but no differences between 7 to 9p [Proportion of variance attributed to true attitudes] ? Table 3 continued Characteristics

Design choices
Theoretical arguments Empirical evidence on data quality (Schaeffer and Presser 2003): more categories compromise discrimination and limit the capacity of respondents to make finer distinctions between the options (Andrews 1984 From negative-to-positive (N-P) From positive-to-negative (P-N) (Christian et al. 2007b): responses vary depending on the order since it provides an addition source of information (Kunz 2015): P-N scales may tempt respondents to rush through a set of items at a faster pace (Christian et al. 2007b (Christian et al. 2009): adding numbers provides an additional source of information to process by the respondents before submitting an answer (Krosnick and Fabrigar 1997): numeric labels more precise and easier but have no inherent meaning (Tourangeau et al. 2007): numbers help respondents to decide whether the scale is supposed to be unipolar or bipolar (Schwarz et al. 1991): use numeric labels to disambiguate the meaning of scale verbal labels. 0 to10 numbers suggest the absence or presence of an attribute, while -5 to 5 suggest that the absence corresponds to 0 whereas the negative values refer to the presence of its opposite (Christian et al. 2009 (Dillman and Bowker 2001): respondents are more frustrated with drop-down menus as it requires a twostep process (Funke et al. 2011): more demanding requires more hand-eye coordination than point-selection and provides problems to identify non-substantive responses (Kunz 2015): drag and drop may prevent systematic response tendencies since respondents need to spend more time (Reips 2002): hand movement is longer than for other types of scales (Roster et al. 2015): sliders are more fun and engaging and produce better data than point-selection scales (Buskirk et al. 2015): differences on selecting the lowest, middle or highest options and in missing data between sliders, radio button scales and box format [Satisficing bias and Item-nonresponse] ?