New patient-oriented summary measure of net total gain in certainty for dichotomous diagnostic tests

Objectives To introduce a new, patient-oriented predictive index as a measure of gain in certainty. Study design Algebraic equations. Results A new measure is suggested based on error rates in a patient population. The new Predictive Summary Index (PSI) reflects the true total gain in certainty obtained by performing a diagnostic test based on knowledge of disease prevalence, i.e., the overall additional certainty. We show that the overall gain in certainty can be expressed in the form of the following expression: PSI = PPV+NPV-1. PSI is a more comprehensive measure than the post-test probability or the Youden Index (J). The reciprocal of J is interpreted as the number of persons with a given disease who need to be examined in order to detect correctly one person with the disease. The reciprocal of PSI is suggested as the number of persons who need to be examined in order to correctly predict a diagnosis of the disease. Conclusion PSI provides more information than J and the predictive values, making it more appropriate in a clinical setting.


Background
The main justification for performing a diagnostic test is to gain new information [1][2][3], beyond the existing probability (the prevalence) obtained from a positive test, i.e., prevalence minus the positive predictive value (PPV) and from a negative test, i.e., (1-prevalence) minus the negative predictive value (NPV). We introduce a predictive summary index (PSI), a new measure that summarizes the total gain in certainty, i.e., the overall additional certainty, expressed as PSI = PPV+NPV-1. We show that the reciprocal of PSI can be interpreted as the number of persons needed to be examined in order to correctly predict a diag-nosis of the disease (NNP). We compare the PSI with a less informative summary measure of a test in a limited study population, the Youden Index (J), proposed by Youden [1] as a measure of the goodness of a diagnostic test.

The terminology of diagnostic test characteristics [4-29]
Performance assessment of a dichotomous diagnostic test is usually based on assessing test performances in two different populations.
The first one is a selected study population of persons with and without a disease, in which both the diagnostic test and the definitive test (the gold standard) are evaluated, and for which sensitivity and specificity are calculated using a "sample truth table" [6] ( Table 1). The second one is the general patient population to which the diagnostic test is applied. The experience of this population is summarized in a second 2 × 2 table (Table 2), which samples the data according to test status (positive or negative, i.e., pathological or normal result). It is the data in Table 2 that are of interest to the patient and the physician. However, it is often difficult to obtain the information needed to construct Table 2 because it is unfeasible or unethical to perform both the diagnostic tests and an additional definitive test in the general population to determine the true diagnosis according to the gold standard. Therefore, the positive predictive value (PPV) and the negative predictive value (NPV) are calculated from Table 1 using the Bayes' theorem.

Definitions in the study population (Table 1) [4-29]
Sensitivity and specificity, and the likelihood ratio are frequently used measures (Table 1). False positive and false negative rates are often defined with reference to the study population. When diseased and non-diseased subjects are sampled, the false positive rate among persons without the disease is often defined as the α error (hence, specificity = 1 -α) and the false negative rate among persons with the disease is often defined as the β error (hence, sensitivity = 1 -β).

Definitions in the target (patient) population (Table 2) [4-29]
The measure of interest for physicians and patients alike is usually the positive predictive value (PPV) and the negative predictive value (NPV), (Table 2). When the prevalence P is known, the PPV can be derived from sensitivity and specificity.
In a previous publication [29] we proposed two new ratio measures in the patient population: the Positive Predictive Ration (PPR), which is analogous to the PLR, and the Negative Predictive Ration (NPR), which is analogous to the NLR. We do not discuss these ratio measures in the present article, and are mentioned only for better understanding the new difference measure proposed here.
Following Fleiss [27], Reid et al. [28], Knottnerus [22], and Riffenburgh [6], we use uppercase notations to define error rates in the patient (target) population (Table 2), in addition to the commonly used fnr and fpr defined above. Note that the interpretation of alpha and beta errors in the target population (Table 2) is different from that in the study population (Table 1); we therefore use the subscript p for errors in the target (patient) population.

The Youden index
In 1950, Youden [1] proposed the Youden Index as a measure of the goodness of a diagnostic test, using alpha and beta errors:  Error terms for the target population: If the test has no diagnostic value, the sensitivity = 1 -β equals the fnr = α, i.e., J = 0 (i.e., equal probability for disease among people with positive and negative test results).
If the test is always correct, all errors equal 0 and J = 1. Negative values of J (between -1 and 0) occur if the test is misleading, that is, if test results are negatively associated with the true diagnosis. Thus, as Pepe noted [26], J is a one-dimensional summary of test accuracy in the study population when the human and monetary costs associated with false positive errors in persons without the disease are similar in magnitude to those associated with false negative errors in persons with the disease [ [26], p.80].

Interpretation of the Youden index (J) in the study population
Another way of expressing this statistic in the study population is by means of sensitivity and specificity: Assuming that sensitivity and specificity are equally important in determining the expected gain, the above equation implies that when sensitivity+specificity = 1, the test provides no overall information. Thus, if the test has no diagnostic value, e.g., the sensitivity and specificity are both 0.5, J = 0. If the test is always correct, the sensitivity and specificity total 2, and J = 1.

Interpretation of the Youden index (J) as an average total net gain in certainty in the study population
As originally proposed by Youden [1], the index can be derived in another way: the net gain for persons with the disease can be defined as the difference between the percentage of persons diagnosed correctly (i.e., the sensitivity) and the fnr.
Similarly, the net gain for persons without the disease can be defined as the difference between the percentage of persons diagnosed correctly without a disease (i.e., the specificity) and the fpr.
Assuming that the value gain in certainty for these two populations (persons with and without the disease) is similar, and that false positives are as undesirable as false negatives, the unweighted average of the two measures is J (see appendix A):

Analogies to a cohort study
J can also be interpreted as the difference between the true and false positive rates.
Thus, J reflects the excess of the proportion of a positive result among patients with vs. patients without the disease [24][25][26].
This interpretation of J is analogous to a commonly used measure in cohort studies, the rate (risk) difference (RD). Counter-intuitively, the analogy is to a cohort study rather than to a case control study, although in a study population the compared study groups are persons with and without the disease because the "causative" variable (i.e., the "exposure") is the fact that a person does or does not have the disease. The diagnostic test results (positive or negative) are the "outcome" of the disease.
J can also be written as: Thus, J also reflects the excess in the proportion of a negative result among patients without vs. patients with the disease [25].
This interpretation of J is analogous to a rate difference of having no disease, when the focus of the investigation is the absence of a disease. Alternatively, it can be analogous to a follow-up study that defines better health as the outcome. In other words, it is analogous to the rate difference of a protective agent, such as vaccination, when better health and disease are compared as outcomes.
A further analogy of to the well-known measure of the "number needed to treat" (NNT) = is in order. Thus, may be interpreted as the number of patients needed to be examined in order to correctly detect (NND, Table 3) one person with the disease in a study population (Table  1) of persons with and without the known disease.

g a a c c a c tpr fnr a c a c
The commonly used PLR is a ratio measure, analogous to the risk ratio (RR) in a follow-up study. Thus, J and PLR describe a diagnostic test in the study population of Table  1 only, in two different dimensions.

The new measure in the target population: Predictive
Summary Index (PSI, Ψ) Because the Youden Index (J) is based on the study population (Table 1) [25,26], it does not convey information about the specific clinical setting in which the diagnostic test is being applied. Patients and physicians are more interested in a similar Predictive Summary Index (PSI, Ψ) in the target population (Table 2). Because the interpretation of alpha and beta errors in the two populations is different, J and PSI have different underlying interpretations.
PSI can be derived in the target (patient) population as a measure of the goodness of the predictability in a diagnostic test, using alpha and beta errors in the target population: If the test has no predictive value, PPV equals FNR, i.e., there is an equal probability for disease among people with positive and negative test results (Table 2). Hence: If the test is always correct, all errors equal 0 and PSI = 1. Negative values of PSI (between -1 and 0) occur if the test is misleading, i.e., occurrence of disease is negatively associated with tests results.

Interpretation of PSI in the study population as a true (total) gain in certainty
Another way of expressing this statistic in the study population is by PPV and NPV: This can be expressed using the data in the study population ( Table 1) by means of the Bayes' theorem.
PSI in the target population is a generalization of the measure of gain in certainty from a diagnostic test, as proposed by Connell and Koepsell [2].
Physicians can guess the probability of a disease without performing a diagnostic test based on prevalence (the prior probability of a disease). A true gain in the certainty that a disease is present occurs when the posterior probability (the PPV) is greater than the prior probability (the The use of exercise ECG with three types of patients with prior coronary disease probability of 5%, 90% and 50%, using angiogram as a gold standard. prevalence). A true gain in the certainty that there is no disease occurs when the posterior probability of no disease (the NPV) is greater than the prior probability of no disease (1-prevalence).
The total net gain in certainty is a summation of these gains. Algebraically, it is the PSI. Similarly, the net gain for persons with a negative test result is the difference between the percentage of persons predicted correctly to be without the disease (the NPV) and the FNR.
Assuming that the value gain in certainty for these two populations (persons with positive and negative test results) is similar, and that false positives (FPR) are as undesirable as false negatives (FNR), the unweighted average of the two gains is PSI (Appendix B):

Analogies to a case control study
The PSI can be interpreted as the difference between the correct prediction of a disease by the test and a false negative result of the test in the target population.
Thus, PSI reflects the excess in the proportion of the disease when the test yields a positive result vs. the proportion of the disease when the test is negative, similar to the Zhou et al. [25] interpretation of J in the study population.
This interpretation of PSI is analogous to the exposure rate difference, an uncommon measure of no interest in case control studies. Counter-intuitively, the analogy is to a case control study rather than to a follow-up study, although the compared study groups are persons with positive vs. negative test results. As mentioned above, the "causative" variable (the "exposure") is the fact that a person does or does not have the disease. The diagnostic test results (positive or negative) are the "outcome" of the disease.
Although in case control studies there is no interest in the exposure rate difference, where we are interested in the association of exposure with a resulting disease, the PSI, analogous to the exposure rate difference, is of interest in clinical epidemiology in the context of the data in Table 2, i.e., in the target population.
We suggest using a new statistic, NNP = , analogously to NND to estimate the number of patients needed to be examined in the patient population (Table 2) in order to correctly identify (predict) the positive diagnosis of one person. For example, this can be the number of people who would have to undergo exercise ECG to correctly identify one person who would eventually be diagnosed by angiography as having coronary artery disease. This measure may be abbreviated as the "number needed to predict," or NNP.
Similarly, one can also interpret PSI as: Thus, PSI reflects the excess in the proportion of no disease when the test yields a negative result vs. no disease when the test is positive, similar to the Zhou et al. [24] interpretation of J.
This interpretation of PSI is analogous to the exposure rate difference when the lack of exposure is the focus of the investigation, and it is compared with exposure to the causative agent. [ NNP measures also the number of patients needed to be examined in the patient population in order to correctly identify (predict) the negative diagnosis of one person.

Example using published data
Consider the example provided by Sackett et al. [ [8], p. 95-98] on the importance of prevalence for the evaluation by exercise ECG of three types of patients with prior coronary disease probability of 5%, 90%, and 50%, using angiogram as a gold standard (Table 3). Originally, the example was designed to demonstrate the importance of prevalence in determining the PPV and NPV of a diagnostic test (exercise ECG). The sensitivity of the ECG was 60.35% and the specificity 91.06%. Thus the Youden Index was J = 51.41% for all three types of patients.
As indicated by Sackett et al. [8], patient C, with a 50% prior probability of the disease, can benefit more from the test than patients A and B. But both the PLR and J statistics (in the study population) are identical for the three types of patients and do not convey this information. However, PSI statistics provide the information relevant to patients and physicians, in this case a PSI of 18.54%, 23.6%, and 56.47% for patient populations with a prevalence of 90%, 5%, and 50%. PSI is a comprehensive measure that conveys information about prior probabilities of a disease (the prevalence) together with the information about the posterior probability, after performing the diagnostic test, the PPV, as well as the probability of no disease (1-prevalence), and the NPV.
While NND = remains constant irrespective of prevalence, NNP = is dependent on prevalence and yields values of 5.4, 4.2, and 1.8 for patient populations with a prevalence of 90%, 5%, and 50%. The range of NNP demonstrates that the exercise test is most efficient when the prevalence is 50%, as Sackett et al. [8] claimed. Only two patients would be needed to show a valuable information gain from an exercise test for predicting coronary heart disease when the prevalence is 50%, compared with more than 5 patients when the prevalence is 90%.

Discussion
Most of the classic epidemiology textbooks do not discuss the Youden Index, probably because its utility is limited, as shown in the above example: it remains unchanged in populations with different prevalence of the disease. Similarly, only few textbooks discuss measures of gain in certainty despite the fact that the purpose of a diagnostic test is to reach a better diagnosis by a gain in certainty. We suggest a simple and informative summary measure of the total gain in certainty, the PSI, which is readily calculable from 2 × 2 tables that describe the performance of a diagnostic test in the patient (target) population. We suggest to use the capital Greek letter PSI (Ψ) for this index.
PSI is of interest to patients: it describes how much more likely the patient is to be correctly diagnosed with a disease after a positive test, and how much more likely the patient is not to be incorrectly diagnosed with a disease after a negative test. This information may be critical. PSI can serve as an indicator for the possible use of the results of a specific test, and makes possible comparisons between tests within the context of prior probabilities: the higher the PSI, the more informative the test is for patients and physicians.
J is a descriptor of a diagnostic test among theoretical groups of persons with and without a disease. PSI is a descriptor of test performance for persons who test positive or negative. Thus, PSI reflects the total net gain in certainty resulting from a diagnostic test in clinical conditions, which is of interest to physicians and patients alike. This information is not available through J or any of the diagnostic test characteristics, including sensitivity, specificity (and thereby the likelihood ratios), and PPV.
PSI is similar in form to J but not identical with it. J is depends entirely on sensitivity and specificity. By contrast, PSI is partially dependent on those parameters and on the prevalence. Although Connell and Koepsell [2] advocated the use of a measure similar to PSI in the patient population, they explored only a special case in which the study population and the target population are identical and the prevalence is known from the study population.
This is a specific condition that seldom occurs. PSI is not limited to this specific situation.
Increasing diagnostic power through new technologies usually serves to increase the sensitivity of diagnostic procedures, with a potential for higher J. But new technologies can also decrease specificity and create false positive findings, resulting in anxiety and unnecessary costs [30][31][32][33][34]. These are not measurable by J in a limited study population. Therefore, PSI should be evaluated for each new diagnostic test to estimate its diagnostic accuracy in the patient population.
When physicians have data on the patient population and can construct the appropriate Table 2 for the clinical (target) population, a clinic-specific PSI can be calculated directly from the patient population. In a clinical setting, 1 J 1 PSI prevalence a c a b c d = + + + + the calculation of PSI does not depend on knowing the prevalence, a statistic that is often not available for specific patient populations. Following up on patients with positive or negative tests will yield directly the PPV and NPV without the need for sensitivity, specificity, or Bayes' theorem. PSI calculations are readily available from the PPV and NPV. For example, sonographers can follow up on fetuses who test positive or negative for malformations and determine the PPV and NPV in their prenatal clinic [29]. Their PSI would serve to indicate the net gain in certainty for their patients (equation 4.13) and the average net gain in information for patients who test positive or negative (4.23). When sensitivity, specificity, and external knowledge of the prevalence are available, Bayes' theorem can be used to calculate the PPV, the NPV, and thereby the PSI.
Moons and Harrell have recently criticized the use of sensitivity or specificity, maintaining that "...sensitivity and specificity are not proper parameters for characterizing diagnostic accuracy research... these parameters are of limited relevance to practice, and their estimation should not necessarily be pursued in diagnostic research" [33]. Note that the diagnostic "gold standard" is itself imperfect [31][32][33]. Our approach emphasizes the need to evaluate test characteristics in the patient population.
Neither the PLR nor J are calculable in the patient (target) population, and they do not convey any additional information beyond sensitivity and specificity. In a previous publication we suggested new ratio measures in the patient population [29]. In this manuscript we recommend the use of a new measure, PSI, as a difference measure that measures the overall clinical utility of a test.
The present paper suggests a similarity between J and the rate difference in a cohort study. Similarly, we suggest an analogy between PSI and the exposure rate difference in a case control study. These analogies have seldom been discussed in the epidemiological literature despite the obvious implications for teaching and understanding the different uses of the 2 × 2 table in etiological and clinical epidemiology.
Thus, analogously to calculations of NNT from the rate difference (RD) in follow-up studies, the reciprocal of J, i.e., , can be interpreted as the number of patients needed to be examined in order to correctly detect (NND) one person with the disease [34] in a study population (Table 1) of persons with and without the known disease. NND can be helpful whenever the sensitivity and specificity, and thus J or PLR are used. But NND is insensitive to variation in prevalence, as shown in the example in Table   3: approximately two examinations are needed among persons with and without a diagnosis of a disease to detect correctly one person who has the disease, irrespective of patient characteristics and the prevalence of the diseases. Patients and physicians have little interest in a statistic that is unaffected by patient characteristics. Thus, NND should have limited clinical utility.
We suggest instead using NNP = to estimate the number of patients needed to be examined in the patient population ( Table 2) in order to correctly predict the diagnosis of one person with a positive test result. This measure, the "number needed to predict," better describes the use of a diagnostic test in patient populations with a different prevalence of the disease (Table 3), and is much more meaningful for patients and physicians. Based on Sackett's example [8] and the data in Table 3, we suggest that NNP can be useful for cost assessment and policy making for specific patients at a specific risk. For example, exercise ECG could be used for patients at high risk for coronary heart disease but not for patients at low risk.
Similarly, NNP estimates the number of patients who need to be tested to identify, with a negative test result, one patient who does not have the disease.
The interpretation of J is based on the assumption that the human and monetary costs associated with a false positive and a false negative in the study population (Table 1) are equal [1,25,26]. Similarly, the interpretation of PSI is based on the assumption that the human and monetary costs associated with a false positive and a false negative in the target population ( 1 J

PSI
There are excellent computer programs available for computing J, notably PEPI [38]. PEPI programs also calculate the gain in certainty in the study population. Similar programs that calculate PSI could be developed with great benefit.

Appendix A
J is the total net gain in certainty when a test is applied to the study population, and is also the average net gain in certainty in the study population for persons with or without a disease.
Assuming that the value gain in certainty for the two study populations (persons with and without the disease) is similar, and that the false positives, fpr, are as undesirable as the false negatives, fnr, the unweighted average of the two measures is J:

Proof
Rearranging the algebraic terms: Adding and subtracting 1:

Appendix B
PSI is the total net gain in certainty when a test is applied to the target population and is also the average net gain in certainty for persons with a positive or negative test: Assuming that the value gain in certainty for the two target populations (persons with a positive test and without a positive test) is similar, and that the false positives (FPR) are as undesirable as the false negatives (FNR), the unweighted average of the two measures is PSI.

Proof
Rearranging the algebraic terms: