Methods for Estimating Item-Score Reliability

Reliability is usually estimated for a test score, but it can also be estimated for item scores. Item-score reliability can be useful to assess the item’s contribution to the test score’s reliability, for identifying unreliable scores in aberrant item-score patterns in person-fit analysis, and for selecting the most reliable item from a test to use as a single-item measure. Four methods were discussed for estimating item-score reliability: the Molenaar–Sijtsma method (method MS), Guttman’s method λ6, the latent class reliability coefficient (method LCRC), and the correction for attenuation (method CA). A simulation study was used to compare the methods with respect to median bias, variability (interquartile range [IQR]), and percentage of outliers. The simulation study consisted of six conditions: standard, polytomous items, unequal α parameters, two-dimensional data, long test, and small sample size. Methods MS and CA were the most accurate. Method LCRC showed almost unbiased results, but large variability. Method λ6 consistently underestimated item-score reliabilty, but showed a smaller IQR than the other methods.


Introduction
Reliability of measurement is often considered for test scores, but some authors have argued that it may be useful to also consider the reliability of individual items (Ginns & Barrie, 2004;Meijer, Sijtsma, & Molenaar, 1995;Wanous & Reichers, 1996;Wanous, Reichers, & Hudy, 1997). Just as test-score reliability expresses the repeatability of test scores in a group of people keeping administration conditions equal (Lord & Novick, 1968, p. 65), item-score reliability expresses the repeatability of an item score. Items having low reliability are candidates for removal from the test. Item-score reliability may be useful in person-fit analysis to identify item scores that contain too little reliable information to explain values of the item-score reliability methods, to establish the relationship between item-score reliability and the other four item indices.
This article is organized as follows. First, a framework for estimating item-score reliability and three of the item-score reliability methods in the context of this framework are discussed. Second, a simulation study, its results with respect to the methods' median bias, IQR, and percentage of outliers, and a real-data example are discussed. Methods to use in practical data analysis are recommended.

A Framework for Item-Score Reliability
The following classical test theory (CTT) definitions (Lord & Novick, 1968, p. 61) were used. Let X be the test score, which is defined as the sum of J item scores, indexed i (i = 1, . . . , J ), that is, X = P J i = 1 X i . In the population, test score X has variance s 2 X . True score T is the expectation of an individual's test score across independent repetitions, and represents the mean of the individual's propensity distribution (Lord & Novick, 1968, pp. 29-30). The deviation of test score X from true score T is the random measurement error, E; that is, E = X À T . Because T and E are unobservable, their variances are also unobservable. Using these definitions, test-score reliability is defined as the proportion of observed-score variance that is true-score variance or, equivalently, one minus the proportion of observed-score variance that is error variance. Mathematically, reliability also equals the product-moment correlation between parallel tests (Lord & Novick, 1968, p. 61), denoted by r XX 0 ; that is, Next to notation i, we need j to index items. Notation x and y denote realizations of item scores, and without loss of generality, it is assumed that x, y = 0, 1, . . . , m. Let p x(i) = P(X i ! x) be the marginal cumulative probability of obtaining at least score x on item i. It may be noted that p 0(i) = 1 by definition. Likewise, let p x(i), y(j) = P(X i ! x, X j ! y) be the joint cumulative probability of obtaining at least score x on item i and at least score y on item j.
In what follows, it is assumed that index i 0 indicates an independent repetition of item i. Let p x(i), y(i 0 ) denote the joint cumulative probability of obtaining at least score x and at least score y on two independent repetitions, denoted by i and i 0 , of the same item in the same group of people. Because independent repetitions are unavailable in practice, the joint cumulative probabilities p x(i), y(i 0 ) have to be estimated from single-administration data. Molenaar and Sijtsma (1988) showed that reliability (Equation 1) can be written as Equation 2 can be decomposed into the sum of two ratios: Except for the joint cumulative probabilities pertaining to the same item p x(i), y(i 0 ) , all other terms in Equation 3 are observable and can be estimated from the sample. Van der Ark et al. (2011) showed that for test score X , the single-administration reliability methods a, l 2 , MS, and LCRC only differ with respect to the estimation of p x(i), y(i 0 ) . To define item-score reliability, Equation 3 can be adapted to accommodate only one item; the first ratio and the first summation sign in the second ratio disappear, and item-score reliability r ii 0 is defined as Methods for Approximating Item-Score Reliability Three of the four methods that were investigated, methods MS, l 6 , and LCRC, use different approximations to the unobservable joint cumulative probability p x(i), y(i 0 ) , and fit into the same reliability framework. Two other well-known methods that fit into this framework, Cronbach's a and Guttman's l 2 , cannot be used to estimate item-score reliability (see Appendix). The fourth method, CA, uses a different approach to estimating item-score reliability and conceptually stands apart from the other three methods. All four methods estimate Equation 4, which contains two unknowns -in addition to r ii 0 bivariate proportion p x(i), y(i 0 ) (middle) and variance s 2 T i (right) -and thus cannot be estimated directly from the data.

Method MS
Method MS uses the available marginal cumulative probabilities to approximate p x(i), y(i 0 ) . The method is based on the item response model known as the double monotonicity model (Mokken, 1971;Sijtsma & Molenaar, 2002). This model is based on the assumptions of a unidimensional latent variable; independent item scores conditional on the latent variable, which is known as local independence; response functions that are monotone nondecreasing in the latent variable; and nonintersection of the response functions of different items. The double monotonicity model implies that the observable bivariate proportions p x(i), y(j) collected in the P(+ +) matrix are nondecreasing in the rows and the columns (Sijtsma & Molenaar, 2002, pp. 104-105). The structure of the P(+ +) matrix using an artificial example is illustrated. For four items, each having three ordered item scores, Table 1 shows the marginal cumulative probabilities. First, ignoring the uninformative p 0i = 1, the authors assume that probabilities can be strictly ordered, and order the eight remaining marginal cumulative probabilities in this example from small to large: Van der Ark (2010) discussed the case in which Equation 5 contains ties. Second, the P(+ +) matrix is defined, which has order Jm3Jm and contains the joint cumulative probabilities. The rows and columns are ordered reflecting the ordering of the marginal cumulative probabilities, which are arranged from small to large along the matrix' marginals; see Table 2. The ordering of the marginal cumulative probabilities determines where each of the joint cumulative probabilities is located in the matrix. For example, the entry in cell (4,7) is p 2(3), 1(2) , which equals .81. Mokken (1971, pp. 132-133) proved that the double monotonicity model implies that the rows and the columns in the P(+ +) matrix are nondecreasing. This is the property on which method MS rests. In Table 2, entry NA (i.e., not available) refers to the joint cumulative probabilities of the same item, which are unobservable. For example, in cell (5,3), the proportion p 1(4), 2(4 0 ) is NA and hence cannot be estimated numerically. Method MS uses the adjacent, observable joint cumulative probabilities of different items to estimate the unobservable joint cumulative probabilities p x(i), y(i 0 ) by means of eight approximation methods (Molenaar & Sijtsma, 1988). For test scores, Molenaar and Sijtsma (1988) explained that method MS attempts to approximate the item response functions of an item and for this purpose uses adjacent items, because when item response functions do not intersect, adjacent functions are more similar to the target item response function, thus approximating repetitions of the same item, than item response functions further away. When an adjacent probability is unavailable, for example, in the first and last rows and the first and last columns in Table 2, only the available estimators are used. For example, p 1(1), 2(1 0 ) in cell (8,2) does not have lower neighbors. Hence, only the proportions .32, cell (8,1); .51, cell (7,2); and .70, cell (8,3) are available for approximating p 1(1), 2(1 0 ) . For further details, see Molenaar and Sijtsma (1988) and Van der Ark (2010).
Hence, following Molenaar and Sijtsma (1988), the joint cumulative probability p x(i), y(i 0 ) is approximated by the mean of at most eight approximations resulting inp MS x(i), y(i 0 ) . When the double monotonicity model does not hold, item response functions adjacent to the target item response function may intersect and not approximate the target very well, so thatp MS x(i), y(i 0 ) may be a poor approximation of p x(i), y(i 0 ) . The approximation of p x(i), y(i 0 ) by method MS is used in Equation 4 to estimate the item-score reliability.
Method MS is equal to item-score reliability r ii 0 when P

MS
x(i)y(i 0 ) . A sufficient condition is that all the entries in the P(+ +) matrix are equal; equality of entries requires Table 2. P(+ +) Matrix With Joint Cumulative Probabilities p x(i), y(j) and Marginal Cumulative Probabilities p x(i) .  item response functions that coincide. Further study of this topic is beyond the scope of this article but should be taken up in future research.

Method l 6
An item-score reliability method based on Guttman's l 6 (Guttman, 1945) can be derived as follows. Let E 2 i denote the variance of the estimation or residual error of the multiple regression of item score X i on the remaining J À 1 item scores, and determine E 2 i for each of the J items. Guttman's l 6 is defined as It may be noted that Equation 6 resembles the right-hand side of Equation 1. Let S ii denote the (J À 1)3(J À 1) inter-item variance-covariance matrix for (J À 1) items except item i. Let s i be a (J À 1)31 vector containing the covariances of item i with the other (J À 1) items. Jackson and Agunwamba (1977) showed that the variance of the estimation error equals When estimating the reliability of an item score, Equation 6 can be adapted to It can be shown that method l 6 fits into the framework of Equation 4. Letp l 6 x(i), y(i 0 ) be an approximation of p x(i), y(i 0 ) based on observable proportions, such that replacing p x(i), y(i 0 ) in the righthand side of Equation 4 byp l 6 x(i), y(i 0 ) results in l 6 i . Hence, Equating Equation 8 and 9 shows that Insertingp l 6 x(i), y(i 0 ) in Equation 4 yields method l 6 for item-score reliability. Replacing parameters by sample statistics produces an estimate.
Preliminary computations suggest that only highly contrived conditions produce the equality , but conditions more representative for what one may find with real data produce negative item true score variance, also known as Heywood cases. Because this work is premature, the authors tentatively conjecture that in practice, method l 6 is a strict lower bound to the item-score reliability, a result that is consistent with simulation results discussed elsewhere (e.g., Oosterwijk, Van der Ark, & Sijtsma, 2017).

Method LCRC
Method LCRC is based on the unconstrained latent class model (LCM; Hagenaars & McCutcheon, 2002;Lazarsfeld, 1950;McCutcheon, 1987). The LCM assumes local independence, meaning that item scores are independent given class membership. Two different probabilities are important, which are the latent class probabilities that provide the probability to be in a particular latent class k (k = 1, . . . , K), and the latent response probabilities that provide the probability of a particular item score given class membership. For local independence given a discrete latent variable j with K classes, the unconstrained LCM is defined as The LCM (Equation 11) decomposes the joint probability distribution of the J item scores for the sum across K latent classes of the product of the probability to be in class k and the conditional probability of a particular item score X i . Letp LCRC x(i), y(i 0 ) be the approximation of p x(i), y(i 0 ) using the parameters of the unconstrained LCM at the right-hand side of Equation 11, such that Approximationp LCRC x(i), y(i 0 ) can be inserted in Equation 4 to obtain method LCRC. After insertion of sample statistics, an estimate of method LCRC is obtained. Method . A sufficient condition for method LCRC to equal r ii 0 is that K has been correctly selected and all estimated parameters P(j = k) and P(X i = xjj = k) equal the population parameters. This condition is unlikely to be true in practice. In samples, LCRC may either underestimate or overestimate r ii 0 .

Method CA
The CA (Lord & Novick, 1968, pp. 69-70;Nunnally & Bernstein, 1994, p. 257;Spearman, 1904) can be used for estimating item-score reliability (Wanous & Reichers, 1996). Let Y be a random variable, which preferably measures the same attribute as item score X i but does not include X i . Likely candidates for Y are the rest score R (i) = X À X i or the test score on another, independent test that does not include item score X i but measures the same attribute. Let r T X i T Y be the correlation between true scores T X i and T Y , let r X i Y be the correlation between X i and Y , let r ii 0 be the item-score reliability of X i , and let r 0 YY be the reliability of Y . Then, method CA equals It follows from Equation 13 that the item-score reliability equals Letr 0CA ii denote the item-score reliability estimated by method CA. Method CA is based on two assumptions. First, true scores T X i and T Y correlate perfectly; that is, r T X i T Y = 1, reflecting that T X i and T Y measure the same attribute. Second, r YY 0 equals the population reliability. Because many researchers use coefficient alpha (alpha Y ) to approximate r YY 0 , in practice, it is assumed that alpha Y = r YY 0 . Using these two assumptions, Equation 14 reduces tõ Comparingr 0CA ii and r ii 0 , one may notice thatr 0CA ii = r ii 0 , if the denominators in Equations 15 and 14 are equal, that is, if alpha Y = r 2 T X i T Y r YY 0 . When does this happen? Assume that Y = R (i) .
Then, if the J À 1 items on which Y is based are essentially t-equivalent, meaning that (Lord & Novick, 1968, p. 50), then alpha Y = r YY 0 . This results in implying that r 2 T X i T Y = 1, hence r T X i T Y = 1, and this is true if T X i and T Y are linearly related: Because it is already assumed that items are essentially t-equivalent and because the linear relation has to be true for all J items, b i = 0 for all i andr 0CA ii = r ii 0 if all items are essentially t-equivalent. Further study of the relation betweenr 0CA ii and r ii 0 is beyond the scope of this article, and is referred to future research.

Simulation Study
A simulation study was performed to compare median bias, IQR, and percentage of outliers produced by item-score reliability methods MS, l 6 , LCRC, and CA. Joint cumulative probability p x(i), y(i 0 ) was estimated using methods MS, l 6 , and LCRC. For these three methods, the estimates of the joint cumulative probabilities p xðiÞ;yði 0 Þ were inserted in Equation 4 to estimate the item-score reliability. For method CA, Equation 15 was used.

Method
Dichotomous or polytomous item scores were generated using the multidimensional graded response model (De Ayala, 1994). Let u = (u 1 , . . . , u Q ) be the Q-dimensional latent variable vector, which has a Q-variate standard normal distribution. Let a iq be the discrimination parameter of item i relative to latent variable q, and let d ix be the location parameter for category x (x = 1, 2, . . . , m) of item i. The multidimensional graded response model (De Ayala, 1994) is defined as The design for the simulation study was based on the design used by Van der Ark et al. (2011) for studying test score reliability. A standard condition was defined for six dichotomous items (J = 6, m + 1 = 2), one dimension (Q = 1), equal discrimination parameters (a iq = 1 for all i and q) and equidistantly spaced location parameters d ix ranging from À1:5 to 1:5 (Table 3), and sample size N = 1, 000. The other conditions differed from the standard condition with respect to one design factor. Test length, sample size, and item-score format were considered extensions of the standard condition, and discrimination parameters and dimensionality were considered deviations, possibly affecting methods the most.
Test length (J ): The test consisted of 18 items (J = 18). For this condition, the six items from the standard condition were copied twice. Sample size (N ): The sample size was small (N = 200). Item-score format (m + 1): The J items were polytomous (m + 1 = 5). Discrimination parameters (a): Discrimination parameters differed across items (a = :5 or 2). This constituted a violation of the assumption of nonintersecting item response functions needed for method MS. Dimensionality (Q): The items were two-dimensional (Q = 2) with latent variables correlating .5. The location parameters alternated between the two dimensions. This condition is more realistic than the condition chosen in Van der Ark et al. (2011), representing two subscale scores that are combined into an overall measure, whereas Van der Ark et al. (2011) used orthogonal dimensions. Van der Ark et al. (2011) found that item format and sample size did not affect bias of test score reliability, but these factors were included in this study to find out whether results for individual items were similar to results for test scores.
Data sets were generated as follows. For every replication, N latent variable vectors, u 1 , . . . , u N , were randomly drawn from the u distribution. For each set of latent variable scores, for each item, the m cumulative response probabilities were computed using Equation 16. Using the m cumulative response probabilities, item scores were drawn from the multinomial distribution. In each condition, 1,000 data sets were drawn.
Population item-score reliability r ii 0 was approximated by generating item scores for 1 million simulees (i.e., sets of item scores). For each item, the variance based on the us of the 1 million simulees was divided by the variance of the item score X i to obtain the population itemscore reliability. It was found that :05 r ii 0 :41. Let s r be the estimate of r ii 0 in replication r (r = 1, . . . , R) by means of methods MS, l 6 , and CA. For each method, difference (s r À r ii 0 ) is displayed in boxplots. For each item-score reliability method, median bias, IQR, and percentage of outliers were recorded. An overall measure reflecting estimation quality based on the three quantities was not available, and in cases were a qualification of a method's estimation quality was needed, the authors indicated how the median bias, IQR, and percentage of outliers were weighted. The computations were done using R (R Core Team, 2015). The code is available via https://osf.io/e83tp/. For the computation of method MS, the package mokken was used (Van der Ark, 2007, 2012. For the computation of the LCM used for estimating method LCRC, the package poLCA was used (Linzer & Lewis, 2011).

Results
For each condition, Figure 1 shows the boxplots for the difference (s r À r ii ). In general, differences across items in the same experimental condition were negligible; hence, the results were aggregated not only across replications but also across the items in a condition, so that each condition contained J 31000 estimated item-score reliabilities. The bold horizontal line in each boxplot represents median bias. The dots outside the whiskers are outliers, defined as values that lie beyond 1.5 times the IQR measured from the whiskers of the first and the third quartile. For unequal as and for Q = 2, results are presented separately for high and low as and for each u, respectively.
In the standard condition (Figure 1), median bias for methods MS, LCRC, and CA was close to 0. For method LCRC, 6.4% of the difference (s r À r ii 0 ) qualified as an outlier. Hence, compared with methods MS and CA, method LCRC had a large IQR. Method l 6 consistently underestimated item-score reliability. In the long-test condition (Figure 1), for all methods, the IQR was smaller than in the standard condition. For the small-N condition (Figure 1), for all methods, IQR was a little greater than in the standard condition. In the polytomous item condition (Figure 1), median bias and IQR results were comparable with results in the standard condition, but method LCRC showed fewer outliers (i.e., 1.2%).
Results for high-discrimination items and low-discrimination items can be found in Figure  1, unequal a-parameters condition panel. Median bias was smaller for low-discrimination items. For both high and low-discimination items, method LCRC produced median bias close to 0. Compared with the standard condition, IQR was greater for high-discrimination items and the percentage of outliers was higher for both high-and low-discrimination items. For highdiscrimination items, methods MS, l 6 , and CA showed greater negative median bias than for low-discrimination items. For low-discrimination items, method MS had a small positive bias and for methods l 6 and CA, the results were similar to the standard condition. For the twodimensional data condition (Figure 1), methods MS and CA produced larger median bias compared with the standard condition. Methods LCRC and CA also produced larger IQR than in the standard condition. Method l 6 showed smaller IQR than in the standard condition.
A simulation study performed for six items with equidistantly spaced location parameters ranging from 22.5 to 2.5 showed that the number of outliers was larger for all methods, ranging from 0% to 9.6%. This result was also found when the items having the highest and lowest discrimination parameter were omitted.
Depending on the starting values, the expectation maximization (EM) algorithm estimating the parameters of the LCM may find a local optimum rather than the global optimum of the loglikelihood. Therefore, for each item-score reliability coefficient, the LCM was estimated 25 times using different starting values. The best-fitting LCM was used to compute the item-score reliability coefficient. This produced the same results, and left the former conclusion unchanged. Figure 1. Difference (s r À r ii 0 ), where s r represents an estimate of methods MS, l 6 , LCRC, and CA, for six different conditions (see Table 3 for the specifications of the conditions).
Note. The bold horizontal line represents the median bias. The numbers in the boxplots represent the percentage outliers in that condition. MS = Molenaar-Sijtsma method; l 6 = Guttman's method l 6 ; LCRC = latent class reliability coefficient; CA = correction for attenuation.

Real-Data Example
A real-data set illustrated the most promising item-score reliability methods. Because method LCRC had large IQR and a high percentages of outliers and because results were better and similar for the other three methods, methods MS, l 6 , and CA were selected as the three most promising methods. The data set (N = 425) consisted of 0=1 scores on 12 dichotomous items measuring transitive reasoning (Verweij, Sijtsma, & Koops, 1999). The corrected item-total correlation, the item-factor loading based on a confirmatory factor model, the item-scalability coefficient (denoted H i ; Mokken, 1971, pp. 151-152), and the item-discrimination parameter (based on a two-parameter logistic model) were also estimated. The latter four measures provide an indication of item quality from different perspectives, and use different rules of thumb for interpretation. De Groot and Van Naerssen (1969, p. 351) suggested .3 to .4 as minimally acceptable corrected item-total correlations for maximum-performance tests. For the item-factor loading, values of .3 to .4 are most commonly recommended (Gorsuch, 1983, p. 210;Nunnally, 1978, pp. 422-423;Tabachnick & Fidell, 2007, p. 649). Sijtsma and Molenaar (2002, p. 36) suggested to only accept items having H i ! :3 in a scale. Finally, Baker (2001, p. 34) recommended a lower bound of 0.65 for item discrimination.
Using these rules of thumb yielded the following results (Table 4). Only Item 3 met the rules of thumb value for the four item indices. Item 3 also had the highest estimated item-score reliability, exceeding .3 for all three methods. Items 2, 4, 7, and 12 did not meet the rules of thumb of any of the item indices. These items had the lowest item-score reliability not exceeding .3 for any method.

Discussion
Methods MS, l 6 , and LCRC were adjusted for estimating item-score reliability. Method CA was an existing method. The simulation study showed that methods MS and CA had the smallest median bias. Method l 6 estimated r ii 0 with the smallest variability, but this method underestimated item-score reliability in all conditions, probably because it is lower bound to the reliability, rendering it highly conservative. The median bias of method LCRC across conditions was almost 0, but the method showed large variability and produced many outliers overestimating item-score reliability. It was concluded that in the unequal a-parameters condition and in the two-dimensional condition, the methods do not estimate item-score reliability very accurately (based on median bias, IQR, and percentage of outliers). Compared with the standard condition, for unequal a-parameters, for high-discrimination items, median bias is large, variability is larger, and percentage of outliers is smaller. The same conclusion holds for the multidimensional condition. In practice, unequal a-parameters across items and multidimensionality are common, implying that r ii 0 is underestimated. In the other conditions, methods MS and CA produced the smallest median bias and the smallest variability, while method l 6 produced small variability but showed larger negative median bias which rendered it conservative. Method LCRC showed small median bias, but large variability.
The authors conjecture that the way the fit of the LCM is established causes the large variability, and provide some preliminary thoughts for dichotomous items. For the population probabilities p 1(i) and p 1(i), 1(i 0 ) defined earlier, letp 1(i) = P k P(ĵ = k)P(X i = 1jĵ = k) andp 1(i), 1(i 0 ) = P k P(ĵ = k)(P½X i = 1jĵ = k) 2 be the their latent class estimates based on sample data, and let p 1(i) denote the sample proportion of respondents that have score 1 on item i. For dichotomous items, the item-score reliability (Equation 4) reduces to In samples, method LCRC estimates Equation 17 by means of The fit of a LCM is based on a distance measure betweenp 1(i) and p 1(i) . However, the fit of the LCM is not directly relevant for Equation 18, becausep 1(i) does not play a role in this equation. A more relevant fit measure for Equation 18 would be based on a distance measure betweenp 1(i), 1(i 0 ) and an observable quantity, but such a fit measure is unavailable. The impact ofp 1(i), 1(i 0 ) not being considered in the model fit is illustrated by means of the following example. Table 5 shows the parameter estimates of LCMs with two and three classes that both produce perfect fit, that is, one can derive from the parameter estimates that for both modelŝ p 1(i) = p 1(i) = :68. In addition, one can also derive from the parameter estimates that for the twoclass model,p 1(i), 1(i 0 ) = :484 andr ii 0 = :099, whereas for the three-class model,p 1(i), 1(i 0 ) = :508 andr ii 0 = :210. This example shows that, although the two LCMs both show perfect fit, the P(ĵ = 1) = :4 P(X i = 1jĵ = 1) = :5 P(ĵ = 1) = :4 P(X i = 1jĵ = 1) = :5 P(ĵ = 2) = :6 P(X i = 1jĵ = 2) = :8 P(ĵ = 2) = :3 P(X i = 1jĵ = 2) = :6 P(ĵ = 3) = :3 P(X i = 1jĵ = 3) = 1:0 resulting values ofr ii 0 vary considerably. Hence, the variability of the LCRC estimate is larger than the fit of the LCM, and this may explain the large variability of method LCRC in the simulation study.
Values for item-score reliability ranging from :05 to :41 were used. These values are small compared with values suggested in the literature. For example, Wanous and Reichers (1996) suggested a minimally acceptable item-reliability of .70 in the context of overall job satisfaction, and Ginns and Barrie (2004) suggested values in excess of .90. It was believed that for most applications, such high values may not be realistic. In the real-data example, item-score reliability estimates ranged from \:01 to :47. Further research is required to determine realistic values of item-reliability. In this study, the range of investigated values for r ii 0 was restricted. The item-score reliability methods' behavior should be investigated under different conditions for a broader range of values for r ii 0 . This research is now under way.

Coefficient Alpha
An item-score reliability coefficient based on coefficient a can be constructed as follows. Let p a x(i), y(i 0 ) be an approximation of p x(i), y(i 0 ) based on observable probabilities, such that replacing p x(i), y(i 0 ) in the right-hand side of Equation 3 byp a x(i), y(i 0 ) results in coefficient a, that is, Van der Ark et al. (2011) showed that the numerator of the ratio on the right-hand side equals X where p is the mean of the J (J À 1)m 2 observable terms in the numerator of the first ratio in Hence, coefficient a equals Let w i be an arbitrary weight with w i ! 0 and P i w i = 1. Coefficient a in Equation A4 can also be written as a = P P The aim of including w i in the definition of a is to demonstrate identifiability problems in a for item scores. Consistent with Equation 4, for an item score i, Equation A5 may be reduced to Because w i is arbitrary, coefficient a for item scores is unidentifiable, which makes this itemscore reliability coefficient unsuited for estimating item-score reliability. Note that a natural choice would be to have w i = 1 for all i. In that case, the numerator of Equation A6 is a constant and coefficient a for item scores is completely determined by the variance of the item.
Coefficient l 2 A line of reasoning similar to that for coefficient a can be applied to coefficient l 2 . Letp l 2 x(i), y(i 0 ) be an approximation of p x(i), y(i 0 ) based on observable probabilities, such that replacing p x(i), y(i 0 ) in Equation A3 byp l 2 x(i), y(i 0 ) results in coefficient l 2 ; that is, Van der Ark et al. (2011) showed that Hence, coefficient l 2 equals l 2 = P P Let w ixy be an arbitrary weight with w ixy ! 0 and P i P x P y w ixy = m 2 J . Using weights w i , coefficient l 2 in Equation A9 can also be written as Consistent with Equation 4, for an item score i, based on Equation A10, consider Similar to the item version of coefficient a, the item version of coefficient l 2 is unidentifiable because w i can have multiple values, which renders this version of coefficient l 2 not a candidate to estimate r ii 0 . Setting w i to 1 results in a coefficient that depends on the item variance, making it unsuited as a coefficient for item-score reliability.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.