Two-stage procedures for selecting the best diagnostic biomarkers

Considered in the paper is the problem of selecting a diagnostic biomarker that has the highest classification rate among several candidate markers with dichotomous outcomes. The probability of correct selection depends on a number of nuisance parameters from the joint distribution of the biomarkers and thus can be substantially affected if these nuisance parameters are misspecified. A two-stage procedure is proposed to compute the needed sample size that achieves the desired level of correct selection, as so confirmed by simulation results.


Introduction
In the area of diagnostic medicine, a biomarker is used to identify subjects that have a certain condition, e.g. a disease or risk status for the disease, so that proper treatments can be followed (Zhou et al. 2002). Often, multiple candidate biomarkers are developed simultaneously and medical investigators are interested in selecting the one with the highest diagnostic accuracy among these newly identified biomarkers so that future studies can be focused on the selected biomarker.
One approach to achieving this goal is as follows. A number of subjects are randomly selected from a population with the condition and one without the condition, hereafter referred to as 'diseased' and 'non-diseased' populations, respectively. The candidate biomarkers are then tested on each subject and their diagnostic accuracy is evaluated from the test results. The biomarker that yields the highest (estimated) accuracy will then be identified as the best biomarker.
A key feature associated with such a selection process is the probability of correct selection (PCS), that is, the probability that the selected biomarker indeed has the highest diagnostic accuracy. The number of diseased and nondiseased subjects needs to be sufficiently large so that the PCS can be maintained at a certain desired level. Since the biomarkers are tested on the same subject, their test outcomes from the same subject are correlated, following a multivariate distribution. Thus, the PCS depends on a number of parameters from the joint distribution which have to be specified in order to have an estimate of the sample size. Unfortunately, however, the PCS can be substantially adversely affected if these parameters are not correctly quantified.
To relax the dependence of the PCS on these parameters, we propose a twostage procedure. First, we randomly select a number of diseased and nondiseased subjects and use the test results from these subjects to estimate the corresponding parameters in the PCS function. Then, the total number of diseased and non-diseased subjects is computed using these estimates that achieve the desired PCS level when the diagnostic accuracy of the best biomarker differs from the other biomarkers by a specified amount. Simulation results are presented to demonstrate the effectiveness of the proposed procedure in achieving the PCS. This paper ends with some discussions in §6.

Formulation of the problem
Suppose there are K candidate biomarkers under consideration, and each yields a binary test result on a subject, 1 or 0, representing the subject being classified as diseased or non-diseased, respectively, by the biomarker. To assess their diagnostic accuracy, these biomarkers are tested on a random sample of m diseased and n nondiseased subjects. For the kth biomarker, its test result is denoted by X ik from the ith diseased subject and Y jk from the j th non-diseased subject. With these notations, the sensitivity (the probability of correctly classifying a diseased subject) of the kth biomarker is p k Z PfX ik Z 1g and can be estimated byp k Z P m iZ1 X ik =m. Its specificity (the probability of correctly classifying a non-diseased subject) is PfY j k Z 0gZ 1K q k , with q k Z PfY jk Z 1g and can be estimated by 1Kq k , witĥ q k Z P n jZ1 Y jk =n. The classification rate q k of the kth biomarker is the sum of the two quantities, i.e. q k Z p k K q k C 1 Z PfX ik Z 1g KPfY jk Z 1gC 1, and can be estimated subsequently byq k Z P m iZ1 X ik ð Þ =m K P n jZ1 Y jk À Á =nC 1. The best biomarker is the one with the largest q, and the one that yields the largest q-estimate will be chosen as the 'best' biomarker. Selection error occurs when the largest q-estimate is not from the biomarker with the largest q.
Without loss of generality, we assume that the first biomarker is the best among the candidate markers, i.e. q 1 Z max kR1 q k . The l th biomarker is selected as the best ifq l Z max kR1qk : Therefore, the PCS is : The key issue is to determine m and n so that the PCS will be at least g, a specified level of probability, under certain parametric configuration of the qs. Usually we require that rRg when q 1 Kmax kR2 q k R D, where DO0 is also a prespecified constant. For sufficiently large m and n,q k ðk Z 1; .; KÞ follows asymptotically a multivariate normal distribution. Thus if rZg for q 1 Kmax kR2 q k Z D, then rRg for q 1 Kmax kR2 q k R D. Therefore, the sample sizes m and n are computed by solving the equation In order to compute the sample sizes, parameters other than D which appear in the equation are given values based on 'educational guess' or other data sources if possible.

Computation of PCS
The test outcomes of the K biomarkers from a diseased or non-diseased subject follow a multinomial distribution. Exact calculation of the PCS involves the joint probabilities of {X i1 Z0 or 1, ., X iK Z0 or 1} and {Y j 1 Z0 or 1, ., Y jK Z0 or 1}, which becomes extremely complicated when K is relatively large. Instead, we may use normal approximation to relax the computational complexity, assuming that the sample sizes are relatively large. Write qZ ðq 1 ; .; q K Þ andqZ ðq 1 ; .;q K Þ. Then by classical asymptotic theory (Bickel & Doksum 1977),q follows asymptotically a multivariate normal distribution with mean vector q and a variance-covariance matrix given by Define d k Z q kC1 K q 1 andd k Zq kC1 Kq 1 , for k Z 1; .; K K1, and write dZ ðd 1 ; .; d KK1 Þ anddZ ðd 1 ; .;d KK1 Þ. Thend is also normally distributed with mean vector d and variance-covariance matrix SZ ðs kl Þ, where Let Fðx 1 ; .; x KK1 ; d; SÞ be the multivariate normal distribution function with mean d and variance-covariance matrix S, then the PCS is approximately rðd; SÞ Z Pd k % 0; k Z 1; .; K K 1 n o Z Fð0; .; 0; d; SÞ: ð3:1Þ The sample sizes m and n are computed from the equation Fð0; .; 0; d; SÞZ g under the constraints that d k ZD for all k Z 1; .; KK1.

A two-stage procedure
The asymptotic formula (3.1) of the PCS involves K(KC1) parameters fp k ; q k ; p k;l ; q k;l : 1% k; l % K; l skg from the multinomial distributions. With the KK1 constraints d k ZD, there are K 2 C1 distributional parameters (referred to as nuisance parameters) left to be specified in order to solve for sample sizes, assuming that m/nZl is a known constant. If these nuisance parameters are misspecified, then the PCS can be substantially adversely affected, resulting in an unacceptably high level of selection error; see §5 for numerical demonstration. In order to maintain the desired level of the PCS, we propose, following Stein's (1945) idea, a two-stage selection procedure to compute the required sample size NZmCnZ(1Cl)n with ratio lZm/n fixed.
Step 1. Randomly select m 1 diseased and n 1 non-diseased subjects from the target populations and test the biomarkers on these subjects. Use the test outcomes X ik , iZ1, ., m 1 and Y jk , jZ1, ., n 1 to obtain empirical estimates of the nuisance parameters, given as follows: Y jk Y jl : ð4:2Þ Step 2. Plug these estimates into the PCS formula and then solve for sample size N 0 from the equation rZg. The required sample size N is then set to max(N 1 , N 0 ), with N 1 Zm 1 Cn 1 .
The sample size N computed in this way is no longer a fixed quantity. Rather, it is a random variable, a statistic of the data from the first stage. Note that the estimates in (4.1) and (4.2) are consistent estimators of the corresponding parameters. For sufficient sample size N 1 , the PCS is expected to be approximately g regardless of the true values of the nuisance parameters; see simulation results below.

Numerical evaluation
Numerical studies are conducted to investigate the effects of the nuisance parameters on PCS and the performance of the proposed two-stage procedure in maintaining the level of PCS. To reduce the computational burden, we use three biomarkers (KZ3) for illustration. Table 1 presents the values of PCS under various configurations of the parameter space which determine the joint distribution of the three biomarkers, that is, the joint probability of p ijk Z PfX 11 Z i; X 12 Z j; X 13 Z kg and q ijk Z PfY 11 Z i; Y 12 Z j; Y 13 Z kg, for i, j, kZ0, 1. These joint probabilities Table 1. PCS in fixed-size and two-stage designs. thus give the marginal probabilities, for example, p 1;2 Z PfX 11 Z 1; X 12 Z 1g Z p 111 C p 110 p 1 Z PfX 11 Z 1g Z p 111 C p 110 C p 101 C p 100 ; and so on. All configurations in table 1 correspond to a fixed value of 0.1 for DðZq 1 K q 2 Z q 1 K q 3 Þ. The first row in the table specifies the parameters used to compute the sample size, for a single-stage procedure, assuming that lZm/nZ1. It turns out that, with this configuration, 102 diseased and 102 nondiseased subjects are needed to achieve a PCS of at least 0.95.
The r-values, computed using formula (3.1), in the remaining rows of table 1 are the PCS with 102 diseased and 102 non-diseased subjects under other specifications of the parameter values. We see that the PCS can be substantially lower than the desired level of 0.95 if the parameters differ from the specified values in the first row of the table. For example, if the corresponding parameters from the joint distribution of the three biomarkers take values from the second row rather than as being specified in the first row of the table, then the PCS can be decreased by more than 20%! To investigate whether the proposed two-stage procedure can relax the dependence on nuisance parameters and maintain the PCS at the desired level, we conducted 10 000 simulations for each configuration of the parameter space in table 1. The simulated values of PCS are presented as r Ã in the last column of the table. For the first-stage sample size, we used m 1 Z75 diseased and n 1 Z75 nondiseased subjects to estimate the corresponding nuisance parameters.
The results in table 1 clearly show that the PCS is maintained at the desired level of 0.95, regardless of the parametric configuration under which the data are generated. Thus the two-stage procedure is proved to be efficient in controlling selection error rates. Numerical results (not shown here) for other configurations and ratios of l reveal similar findings. With such a multidimensional parameter space, however, it is difficult to characterize any configurations for which the two-stage procedure would perform relatively better.

Discussion
In this paper, we proposed a two-stage procedure to select the best diagnostic biomarker among a number of candidate markers, and numerically demonstrated its effectiveness in controlling the selection error. While there is no universal standard on how the first-stage sample size is determined, it should be relatively large so that accurate estimates of the nuisance parameters can be obtained. On the other hand, it should not be too large when taking the cost effectiveness of a study into consideration.
We confine our attention to biomarkers with dichotomous outcomes. However, the two-stage approach proposed in this paper can be easily extended to biomarkers with ordinal or continuous test results. In these two cases, an appealing selection criterion is to select the marker that has the largest area under its receiver operating characteristic curve (Hanley 1989).
Various selection problems and related procedures have been discussed extensively in the statistical literature (see Bechhofer et al. 1968;Gibbons 1988), but mostly focusing on multivariate normal distributions. Applications of the selection concepts and theory to the area of diagnostic medicine are relatively new, and many other issues remain to be solved such as selecting a fixed number of best diagnostic biomarkers and selecting the best biomarker (or all biomarkers) that are better than a standard biomarker. Further research efforts are much needed along these lines.
Throughout this paper we assume that the data will be available for all biomarkers from each subject. Very often values on some biomarkers may be missing. One possible approach to dealing with this situation is to use multiple imputation based on the multinomial distributions and then apply the proposed method to the imputed data. The effectiveness of this approach in controlling the PCS needs further investigation. Moreover, selection bias may occur when the subjects are not randomly selected, and correction for such bias in estimating the PCS is desirable.