Classification and Error Estimation for Discrete Data

Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.


INTRODUCTION
In high-throughput Genomics applications, the objective is often to classify different phenotypes based on a panel of gene expression biomarkers, or to infer underlying gene regulatory networks from gene expression data. It is often advantageous to discretize gene expression data, for data efficiency and classification accuracy reasons. Classification of discrete data is a subject with a long history in Pattern Recognition [1][2][3][4][5][6][7]. In Genomics applications, this methodology has been applied both in classification of discretized gene expression data [8,9], and in discrete gene expression prediction and the inference of boolean genomic regulatory networks, via the binary coefficient of determination (CoD) [10][11][12].
The most often employed discrete classification rule is the discrete histogram rule [1,3,4,6,13]. This classification rule has many desirable properties. For example, it can be shown that it is strongly universally consistent, meaning that, regardless of the particular distribution of the data, this rule can eventually learn the optimal classifier from the data, as the sample size increases, with probability one. In addition, the discrete histogram rule is simple enough to allow the exact analytical study of many of its properties.
Once a classifier is obtained from sample data, its performance must be evaluated. The most important criterion for performance is the classification error, which is the probability of making an erroneous classification on a future sample. The classification error can be computed exactly only if the underlying distribution of the data is known, which is almost never the case in practice. Robust error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available *Address correspondence to this author at the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77845, USA; E-mail: ulisses@ece.tamu.edu data. An error estimator is a sample-based statistic, the bias and variance (and thus the root mean square error, RMS) properties of which determine how consistently the error estimator is near the true classification error, considering all possible sample training data sets from a given population. More generally, all statistical questions regarding the accuracy of the error estimator can be answered through the joint sampling distribution of the error estimator and true probability of error [14]. From an epistemological perspective, error estimation has to do with the fundamental question of the validity of scientific knowledge [15]. The quality of the error estimate determines the accuracy of the predictions that can be performed by the inferred model, and thus its scientific content.
Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. With a small training sample set, the designed classifier will be, on average, more dissimilar to the optimal classifier, and thus have a larger classification error. In addition to that, in a small-sample setting, one must use the same data to both design the classifier and assess its validity, which requires data-efficient error estimators, and this in turn calls for careful study of performance.
It is the goal of the present paper to present a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics. The paper is organized as follows. Section 2 illustrates the application of discrete classification in Genomics through a pair of simple examples. Section 3 formalizes the problem, with particular emphasis on the discrete histogram classification rule. Section 4 reviews the most common error estimators used in discrete classification, commenting briefly on their properties. Sections 5 through 7 contain the bulk of the literature review on the subject. Section 5 reviews results on the small-sample performance of discrete classification; these are analyses that must hold for a given finite number of samples. This section reviews exact and approximate expressions for performance metrics of the actual and estimated errors for the discrete histogram rule; complete enumeration methods that can deal with intractable cases such as conditional performance metrics; distribution-free results on small-sample performance, with emphasis on the pioneering work of G.F. Hughes; as well as recent analytical results that indicate that ensemble classification methods may be largely ineffective in discrete classification. Section 6, by contrast, focuses on the large-sample performance of discrete classification; this is a more technical section, which reviews asymptotic results on whether optimal performance is reached, and how fast, as the sample size increases. Finally, Section 7 reviews the binary coefficient of determination (CoD).

DISCRETE CLASSIFICATION IN GENOMICS
The objective of classification is to employ a set of training data, consisting of independently observed known cases, and obtain a fixed rule to classify, as accurately as possible, unknown cases from the same population. The training data consists of carefully measured values of predictive variables and a response variable for each case. The response variable in classification is always discrete, i.e., it assumes a finite number of values; in fact, it is often binary, indicating one of two states, such as distinct cell phenotypes, disease severity, and so on.
If the predictor variables are also discrete, then one is in the context of discrete classification, also known as multinomial classification [13] and discrete discriminant analysis [7]. Additionally, in Statistics, the term categorical data analysis is often employed to refer to the statistical analysis of discrete data [16]. In Genomics, the predictor variables often correspond to the expression of a set of suitably selected genes; for discrete classification, gene expression must first be discretized into a finite number of intervals ---methods to accomplish this are described in [8,9]. Note that the finite values taken on by the discrete predictors could be numeric (e.g., the mid-point value of an expression range), or purely categorical, as is often done by alluding to "up-regulated" and "down-regulated" genes. This distinction is immaterial in the case of the most commonly used discrete classification rule, known as the discrete histogram rule [1,3,4,6,13]. The discrete histogram rule simply assigns to each combination of possible values of the predictor variables a response value that is decided by majority voting among the observed response values. As will be seen in this paper, this simple rule has many desirable and interesting properties.  1) depicts an example of how the discrete histogram rule would function in the case of cell phenotype classification based on the discretized expression values of two genes. Classification is between the phenotypes of "treated" and "untreated" cells (e.g., presence or absence of some drug in the culture, presence of enough nutrients vs. starvation, normal vs. abnormal cells, and a host of other possible conditions), and gene expression is discretized in ternary values, corresponding to down-regulated, basal, and up-regulated values. There are therefore 9 = 3 2 possible combinations of observable values, or "bins", which can be organized in this case into a 3 3 matrix. In this example, the observed training data set contains a total of 40 cases, with an equal number of cases in each of the "treated" and "untreated" categories (sometimes called a "balanced experimental design" in Statistics). The counts of observed response values over each of the bins are shown in the figure. The majority class is underlined in each case, and this would be the class assigned to a future case by the discrete histogram rule if the observed gene expression values fall into that particular bin. Note that there are two particular cases that require attention: there could be a tie between the counts over a bin, and no values might be observed in the training data over a bin (missing data). These cases can be resolved by randomly picking one of the classes or, if one wants to avoid random classifiers, one can break ties, in a fixed manner, in favor of one of the classes; for example, one might classify such cases as "untreated". Based on the resulting discrete classifier in this particular example, one might posit that up-regulation of both genes is associated with treatment of the cells. (2) depicts another example, which illustrates how the discrete histogram rule would be applied in the case of discrete gene expression prediction; this constitutes the basic building block for the inference of gene regulatory networks [10,11]. Gene expression values have been discretized into binary values, indicating activation or not of the particular gene, and the expression of three genes (the predictor variables) is used to predict the expression of a fourth gene (the response, or target, variable). Note that the number of . The bins are represented side by side in Fig. (2), rather than organized into a matrix as in Fig. (1). This clearly makes no difference to the discrete histogram rule (an important point we will return to in the next Section). Note that the values of all variables (predictors and target) are coded into 0 and 1, and that ties in this example are broken in favor of the class 1, that is, highexpression. As can be seen, prediction is based on a total of 40 cases, i.e., 40 instances of the 4-tuple consisting of the three predictor genes and the target gene. Note that the values 0 and 1 for the target gene are not represented equally, so the design is "unbalanced." It will be rarely the case in gene prediction that the design is balanced, since here one cannot possibly or meaningfully specify in advance the target classes for the observations; this is a very important difference with respect to the previous case of phenotype classification, where it is often possible and meaningful to do so. The validity of any scientific conclusions made based on the previous classification models depends, naturally, on the accuracy of the obtained classifiers. In addition, it critically depends on the reliable estimation of said accuracy, based on the available data. These issues will be approached in the sequel.

DISCRETE CLASSIFICATION RULES
Examples of discrete classification rules include the discrete histogram rule, mentioned in the previous section, as well as the maximum-mean-accuracy rule of [5], and many other examples of discrete rules used in Data Mining [17]. Among these, the discrete histogram rule is by far the most used one in practice. The discrete histogram rule is "natural" for categorical problems, not only due to its simplicity as majority voting over the bins, but also because it corresponds to the plug-in rule for approximating the optimal Bayes classifier, as we discuss below. In this section, we formalize the problem of discrete classification, which allows us to examine the properties of the discrete histogram rule, including classification accuracy and its estimation from data.   Fig. 1). As remarked in connection with Fig. (2), for the discrete histogram rule, the space can be reorganized in any way one likes. Therefore, we adopt a (bijective) mapping between the original feature space and the sequence of integers b , 1,… , and may equivalently assume, without loss of generality, a single predictor variable X taking on values in the set … . The value b is the total number of bins into which the data are categorized ---this parameter provides a direct measure of the complexity of discrete classification.
The properties of the discrete classification problem are completely determined by the (discrete) joint probability distribution between the predictor X and the target Y : . Therefore, the classifier * that achieves the minimum probability of misclassification )) ( ( X Y P , known as the Bayes classifier [13], is given by It can be shown that if there are two or more discrete features in the original feature space (such as in Fig. 1), and these features are independent conditionally to Y , i.e., within each class, then the Bayes classifier * is a linear function of those features [13, p.466].
The minimum probability of misclassification, or Bayes error, achieved by the Bayes classifier, can be written as Here, A I is an indicator variable, which is 1 when condition A happens, and 0, otherwise. Since } , The Bayes error is a measure of distance between the classes, and it provides a lower bound on classification performance. For discrete histogram classification, the predictor variables in the original feature space should be chosen so that the Bayes error is as small as possible.
In practice, one almost never knows the model parameters completely, and therefore one does not know the Bayes classifier. One must rely instead on designing a classifier from sample training data; one hopes that such a sample-based classifier is close in some sense to the Bayes classifier. The classifier produced by the discrete histogram rule becomes indeed very close to the Bayes classifier, as sample size increases, in a few important senses; this will be discussed in Section 6. . , 1, = , 0, When a specific training sample  , The discrete histogram rule is the "plug-in" rule for discrete classification, that is, if one plugs the standard maximum-likelihood (ML) estimators of the unknown model parameters in the expression for the Bayes classifier in (1), one obtains precisely the histogram classifier in (3). Since the standard ML estimators in (4) are consistent, meaning that they converge to the true values of the parameters as the sample size increases, one would expect the discrete histogram classifier to approach the optimal Bayes classifier as more samples are acquired, which is indeed the case; we come back to this issue in Section 6.
The most important performance criterion for the designed classifier n is its accuracy on independent (e.g., future) data, which are assumed to come from the same population as the given training data. This accuracy is measured by the probability of misclassification S . This is known as the classification error. It is clear Being a function of the random variables i S has an important meaning in the context of classification rules. It does not depend on a particular set of training samples, but it gives the average classification error over all possible training data; therefore it is an intrinsic performance measure of the classification rule for the particular problem (i.e., joint distribution of X and Y ) and sample size n .

ERROR ESTIMATION FOR DISCRETE CLASSIFICATION
In practice, the underlying probability model is unknown, and the classification error n has to be estimated from the sample data using an error estimator n ˆ. An error estimator is a function of the classification rule n and the sample data n S . Therefore, it is a random variable through dependency on the random training data n S . If the error estimator depends on any additional random factors, sometimes called internal factors, it is called randomized, otherwise it is said to be nonrandomized. Examples of the latter include the apparent error or resubstitution [18], and leave-one-out [19] error estimators, whereas popular examples of randomized error estimators include crossvalidation [19][20][21][22] and all bootstrap-based error estimators [23][24][25].
As the classification error n itself, a nonrandomized error estimator n ˆproduces a fixed value once the training data set n S is specified ("running the estimator again" on the data never alters the result), which is not the case for a randomized error estimator. Internal random factors introduce internal variance that adds to the total variance of an error estimator, which measures how dispersed its estimates can be for varying training data from the same population. Note that the internal variance is zero for nonrandomized estimators. Randomized estimators typically reduce the unwanted extra internal variance through averaging based on intensive computation. See [26,27] for a detailed discussion of issues regarding randomized and nonrandomized error estimators, and internal and full variance.
The variance of the error estimator, by itself, does not address its relationship to the quantity to be estimated, namely, the actual classification error. Relevant performance metrics that do so are discussed next.  . Small bias is of small significance if the deviation variance is large; this would mean that on average the error estimator is close to the true error, but that in fact the estimate for any particular sample set is likely to be far away from the true error. The root mean-square error (RMS) conveniently combines both the bias and the deviation variance into a single measure, and is widely adopted for comparison of error estimator performance. Additional performance measures include the tail probabilities ) |> (| n n P , for 0 > , which concern the likelihood of outliers, as well as the consistency of the error estimator; the conditional bias , which give bounds on the true error corresponding to a given precision , the observed error estimate, and the sample size. Confidence intervals express statistical power in error estimation ---more powerful methods will produce shorter confidence intervals for the true error at the same sample size. A very important fact is that all of the aforementioned performance metrics, and in fact any others, can be determined if one has knowledge of the joint sampling distribution of the vector of actual and estimated errors ) , ( n n . Section 5 reviews exact analytical methods to compute these performance metrics, as well as complete enumeration methods that allow the computation of the joint sampling distribution of actual and estimated errors.
The resubstitution error estimator r n ˆ [18] is the simplest data-efficient alternative; it is simply the apparent error, or the proportion of errors the designed classifier makes on the training data itself. Clearly, For example, in Fig. (2), the resubstitution estimate for the classification error is . It is easy to see that plugging the ML estimators of the model parameters in (4) into the formula for the Bayes error (2), results in expression (8). Therefore, resubstitution for the discrete histogram rule is the plug-in estimator of the Bayes error in discrete classification. The resubstitution estimator is clearly nonrandomized, and it is very fast to compute. This estimator is however always optimistically biased in the case of the discrete histogram rule, in the  (9) so that the average resubstitution estimate bounds from below even the Bayes error; this fact seems to have been demonstrated for the first time in [1] (see also [2]). Observe though that this is not guaranteed to apply to any individual training data and classifier, but only to the average over all possible training data. The optimistic bias of resubstitution tends to be larger when the number of bins is large compared to the sample size; in other words, there is more overfitting of the classifier to the training data in such cases. On the other hand, resubstitution tends to have small variance. In cases where the bias is not too large, this makes resubstitution a very competitive option as an error estimator. In fact, the next Section contains results that show that resubstitution can have smaller RMS than even complex error estimators such as the bootstrap, provided that sample size is large compared to number of bins. In addition, it can be shown that as the sample size increases, both the bias and variance of resubstitution vanish (see Section 6). Finally, it is important to emphasize that these observations hold for the discrete histogram rule; for example, the resubstitution estimator is not necessarily optimistically-biased for other (continuous or discrete) classification rules.
The leave-one-out error error estimator l n ˆ [19] removes the optimistic bias from resubstitution by counting errors committed by n classifiers, each designed on 1 n points and tested on the remaining left-out point, and dividing the total count by n . A little reflection shows that For example, in Fig. (2), the leave-one-out estimate for the classification error is . This is higher than the resubstitution estimate of 0.3 . In fact, by comparing (8) and (10) , making this estimator almost unbiased. As it turns out, this bias reduction is accomplished at the expense of an increase in variance [26]. The leave-one-out estimator is however nonrandomized.
A randomized estimator is obtained by selecting randomly k folds of size k n n / , counting the errors committed by k classifiers, each designed on one of the folds and tested on the remaining points not in the fold, and dividing the total count by n . This yields the well-known k -fold cross-validation estimator [19][20][21][22]. The process can be repeated several times and the results averaged, in order to reduce the internal variance associated with the random choice of folds. The leave-one-out estimator is a crossvalidation estimator with n k = ; therefore, cross-validation is not randomized in this special case (it is also This is known as the 0.632 bootstrap error estimator, and is quite popular in Machine Learning applications [17]. It has small variance, but can be very slow to compute. In addition, it will fail when the resubstitution estimator is too optimistic. A variant called the 0.632+ bootstrap error estimator was introduced in [25], in an attempt to correct this problem. All cross-validation and bootstrap error estimators tend to be computationally intensive, due to the large number of classifier design steps involved and the need to reduce internal variance by averaging over a large number of iterations.

SMALL-SAMPLE PERFORMANCE OF DISCRETE CLASSIFICATION
The fact that the distribution of the vectors of bin counts is multinomial (see Section 3), and thus easily computable, along with the simplicity and parallel among equations (2), (5), (8), and (10), for the Bayes error, actual error, resubstitution error, and leave-one-out error, respectively, allow the detailed analytical study of the small-sample performance of the discrete histogram classification rule and the associated resubstitution and leave-one-out error estimators.

Analytical Study of Actual Classification Error
From (5) it follows that the expected error over the sample is given by The computation of the probability ) > ( whereas in the stratified sampling case, i U is independent of i V , and each is binomially distributed with parameters ) , ( 0 i p n and ) , ( 1 i q n , respectively, so that To obtain the variance 2 This expression involves second-order bin probabilities, e.g., , which can be computed in a similar fashion to the first-order bin probability in (13) and (14), by using the fact that, in the full sampling case, the vector ) , , , (  (15), leading to a very simple expression for the variance, which involves only first-order bin probabilities: It is proved in [28] that, under a mild distributional assumption, the expression in (16) is asymptotically exact as the number of bins grows to infinity, for fixed sample size. . This is an example of the "peaking phenomenon" that affects the expected classification error (see Section 5.4). As for the variance, one can see that it also decreases with increasing sample size, as expected. Except for the anomalous case 2 = b , the variance seems to be insensitive to bin size. One can also appreciate that the approximation to the variance given by (16) is quite accurate, particularly at larger sample sizes. The good accuracy of the approximation is obtained at a huge savings in computation time. As an example, for 16 = b and 60 = n , it takes more than 30 minutes and less than 1 second to compute the exact and approximate expressions for the variance, respectively, using state-of-theart computing technology.

Analytical Study of Error Estimators
Similar exact expressions can be derived for the expectation and variance of the resubstitution and leave-outerror estimators, as well as their correlation with the actual error; see [29,30]. These exact expressions allow one to compute exactly the bias, deviation variance, and RMS of both resubstitution and leave-one-out. This is illustrated in Fig. (4), where results for resubstitution (resub), leave-one- show that resubstitution is the most optimistically biased estimator, with bias that increases with complexity, but it is also much less variable than all other estimators, including the bootstrap ones. The cross-validation estimators are the most variable, but are nearly unbiased. The bootstrap estimator is affected by the bias of resubstitution when complexity is high, since it incorporates the resubstitution estimate in its computation, but it is clearly superior to the cross-validation estimators in RMS. Perhaps the most remarkable observation is that, for very low complexity classifiers (around b=4), the simple resubstitution estimator becomes more accurate than the cross-validation error estimators, and as accurate as the 0.632 bootstrap error estimator, according to RMS, despite the fact that resubstitution is typically much faster to compute that those other error estimators (in some cases considered in [26], hundreds of times faster). In our experiments, we observed that this is true for small sample sizes ( 30 < n ), low complexity, and low to moderate expected classification errors. This has an important consequence for the inference of genomic boolean regulatory networks: if the number of boolean predictors for a particular gene is small (on the order of 2 or 3), then it is more advantageous to use resubstitution to estimate prediction accuracy than more complicated error estimation schemes.
Analytical exact expressions for the correlation between actual and estimated errors can also be derived [30]. This is illustrated in Fig. (5), where the correlation for resubstitution and leave-one-out error estimators is plotted versus sample size, for different bin sizes. In this example, we assume full sampling and the Zipf parametric model mentioned We can observe that the correlation is generally low (below 0.3). We can also observe that at small sample sizes, correlation for resubstitution is larger than for leave-one-out cross-validation, and, with a larger difficulty of classification, this is true even at moderate sample sizes. Correlation generally decreases with increasing bin size; in one striking case, the correlation for leave-one-out becomes negative, at the critical small-sample situation of 20 = n and 32 = b . This behavior of the correlation for leave-one-out mirrors the behavior of deviation variance of this error estimator, which is known to be large under complex models and small sample sizes [13,26,31], and is in accord with (6).

Complete Enumeration Methods
As mentioned previously, all the performance metrics of interest for the actual error n and any given error estimator  subsections, due to the complexity of the expressions involved.
However, due to the finiteness of the discrete problem, it turns out that the joint sampling distribution of actual and estimated errors in the discrete case can be computed exactly by means complete enumeration. Such methods have been extensively used in categorical data analysis [16,[32][33][34][35]; complete enumeration has been particularly useful in the computation of exact distributions and critical regions for tests based on contingency tables, as in the case of the wellknown Fisher exact test and the chi-square approximate test [32,33].
Basically, complete enumeration relies on intensive computational power to list all possible configurations of the discrete data and their probabilities, and from this exact statistical properties of the methods of interest are obtained. The availability of efficient algorithms to enumerate all possible cases on fast computers has made possible the use of complete enumeration in an increasingly wider variety of settings.
In the case of discrete classification, recall that the random sample is specified by the vector of bin counts ) , , , , , Even though the configuration space n D is finite, it quickly becomes huge with increasing sample size n and bin size b . In [29] an algorithm is given to traverse n D efficiently, which leads to reasonable computational times to evaluate the joint sampling distribution when n and b are not too large. , and a Zipf probability model of intermediate difficulty (Bayes error = 0.2). One can observe that the joint distribution for resubstitution is much more compact than for leave-one-out cross-validation, which explains in part its larger correlation in small-sample cases.
This approach can be easily modified to compute the conditional sampling distribution . This was done in [14] in order to find exact conditional metrics of performance for resubstitution and leave-one-out error estimators. Those included the conditional expectation This is illustrated in Fig. (7), where the aforementioned conditional metrics of performance for resubstitution and   . The curves for the conditional expectation rise with the estimated error; they also exhibit the property that the conditional expected actual error is larger than the estimated error for small estimated errors and smaller than the estimated error for large estimated errors. A point to be noted is the flatness of the leave-one-out curves. This reflects the high variance of the leave-one-out estimator. Note that the 95% upper confidence bounds are nondecreasing with respect to increasing estimated error, as expected. The flat spots observed in the bounds result from the discreteness of the estimation rule (this phenomenon is more pronounced when the number of bins is smaller).

Distribution-Free Analysis of Performance
Note that the model parameters i p and i q must be nonnegative and satisfy the constraints so that the accuracy margin over the no-information value of 0.5 vanishes as b 1/ . This implies that the decrease is exponential in , as can be gleaned from Fig. (8).
Note that peaking ceases to occur as n , which corresponds to the Bayes accuracy (see the next Section). This must be the case, since the Bayes accuracy is known to be nondecreasing in the number of features. The expression for the average Bayes accuracy in the case 0.5 = 0 c is simple; as shown in [4], this is given by with an asymptotic value (as b ) of 0.75 (it is shown in [4] that, for general 0 c , this asymptotic value is equal to . This relatively small value highlights the conservative character of Hughes' distribution-free approach; for example, in practice, where one deals with a fixed distribution of the data, the optimal number of features would typically be larger than the ones observed in Fig. (8), so that sample size recommendations based on this analysis tend to be pessimistic ---a fact that was pointed out in [37]. Nevertheless, the qualitative behavior of the analysis is entirely correct. Finally, we remark that

Performance of Ensemble Methods in Discrete Classification
In [38], Braga-Neto and Dougherty carried out an analysis of the performance of ensemble classification methods [39,40] when applied to the discrete histogram rule, which provided evidence that such ensemble methods may be largely ineffective in discrete classification. Part of the analysis is similar to the work of Hughes', discussed in the previous subsection, in the sense that it examines the average performance over the model space, assuming equally-likely models. Two methods were considered, namely, the jackknife and bagging ensemble classification rules obtained from the discrete histogram rule. Briefly, ensemble methods are based on perturbing the training data, designing an ensemble of classifiers based on the perturbed data sets using a given base classification rule (in this case, the discrete histogram rule), and aggregating the individual decisions to obtain the final classifier. Data perturbation is often accomplished by resampling methods such as the jackknife [41] and bootstrap [23] ---the latter case being known as "bagging" [40] ---whereas aggregation is done by means of majority voting among the individual classifier decisions. For the jackknife majority-vote classification rule, it was shown in [38] that, under full sampling and equallylikely classes, the best gain in performance (i.e., decrease in expected classification error) over all models in the model space ) ( 0 c is smaller than the worst deficit (i.e., increase in expected classification error). Any discrepancy in performance however disappears as sample size increases; in particular the following bound is shown to hold: Regarding the bagging case, it is shown in [38] that, given the training data, and for any sample size, number of cells, or distribution of the data, the random bagging classifier converges to the original discrete histogram classifier with probability 1 as the number of classifiers in the ensemble m increases, and, furthermore, it also gives the following exponential bound on the absolute difference Repository. The expected classification error for the bagging classifier is found by means of a Monte-Carlo computation using 100,000 simulated training sets, assuming full sampling. The Monte-Carlo computation introduces the wobble visible in the plots (even at this very large number of simulated training sets). Also indicated are the exact expected errors of the base discrete histogram classification rule, by means of dashed horizontal lines. We can see that in all cases bagging leads to a larger expected classification error than the base classification rule, although the deviation quickly converges to zero in each case, in agreement with equation (20) above.

LARGE-SAMPLE PERFORMANCE OF DISCRETE CLASSIFICATION
Large-sample analysis of performance has to do with behavior of classification error and error estimators as sample size increases without bound, i.e., as n . From a practical perspective, one expects performance to improve, and eventually reach an optimum, as more time and cost is devoted to obtaining an increasingly large number of samples. It turns out that not only this is true for the discrete histogram rule, but also it is possible in several cases to obtain fast (exponential) rates of convergence. Critical results in this area are due to Cochran and Hopkins [1], Glick [6,42], and Devroye, Gyorfi and Lugosi [13]. We will review briefly these results in this Section.
Recall the bin counts i U and i V introduced in Section 3.
By a straightforward application of the Strong Law of Large Numbers (SLLN) [43], we obtain that that is, the discrete histogram classifier designed from sample data converges to the optimal classifier over each bin, with probability 1. This is a distribution-free result, so it is true regardless of the joint distribution of predictors X and target Y , as the SLLN itself is distribution-free. One says then that the discrete histogram rule is universally strongly consistent [13].
The exact same argument, in connection with eqs. (2), (5) and (8), shows that n lim n = n limˆ n r = * with probability 1. (22) so that the classification error, and also the apparent error, converge to the optimal Bayes error as sample size increases.  These results are all based on the SLLN (and are thus distribution-free). The question arises as to the speed with which the limits are attained, as the SLLN can yield notoriously slow rates of convergence. This is not only a theoretical question, as the usefulness in practice of such results may depend on how large a sample size needs to be to guarantee that the discrete classifier or error estimator is close enough to optimality. The answer is that exponential rates of convergence can be obtained, if one is willing to drop the distribution-free requirement. Otherwise, polynomial rates of convergence can be established. These results are briefly reviewed below.
Regarding the discrete histogram rule, with a proviso that ties in bin counts are assigned a class randomly (with equal probability), it is shown in [ where the constant 0 > c is distribution-dependent: Interestingly, the number of bins does not figure in this bound. The speed of convergence of the bound is determined by the minimum (nonzero) difference between the probabilities i p c 0 and i q c 1 over any one cell. The larger this difference is, the larger c is, and the faster convergence is.
Conversely, the presence of a single cell where these probabilities are close slows down convergence of the bound.
On the other hand, a distribution-free bound is provided by [13, (26) provided that there is no cell over which  (26)  The previous results on the discrete histogram rule concern expectation and bias. In [13], (distribution-free) results on variance and RMS are also given, both for resubstitution and leave-one-out (here, the convention we have adopted of breaking ties in the direction of class 0 is again in effect). For the resubstitution error estimator, one has the following bounds [ (27) and n b MS r n 6 ) ( R (28) In particular, both quantities converge to zero as sample size increases. For the leave-one-out error estimator, one has the following bound [ (29) This guarantees, in particular, convergence to zero as sample size increases.
An important factor in the comparison of the resubstitution and leave-one-out error estimators for discrete histogram classification resides in the different speeds of convergence of the RMS to zero. Convergence of the RMS bound for the resubstitution estimator is on the RMS of leave-one-out. Therefore, in the worst case, the RMS of leave-one-out to zero is guaranteed to decrease as 1/4 n , and therefore is certain to decrease slower than the RMS of resubstitution. Note that the bad RMS of leave-oneout is due almost entirely to its large variance, typical of the cross-validation approach, since this estimator is essentially unbiased.

BINARY COEFFICIENT OF DETERMINATION (COD)
In classical regression analysis, the coefficient of determination (CoD) gives the relative decrease in unexplained variability when entering a variable X into the regression of the dependent variable Y , in comparison with the total unexplained variability when entering no variables:  (30) where Y SS and XY SS are the sums of squared errors associated with entering no variables and entering variable X to predict Y , respectively. The term Y SS is proportional to the total variance 2 Y , which is the error around the mean Y μ (so that entering no variables in the regression corresponds to using the mean as the predictor).
In classification, a very similar concept was introduced in using feature vector X to predict Y . By convention, one assumes 1 = 0/0 in the above definition. This binary coefficient of determination measures the relative decrease in prediction error of a target variable when using predictor variables, relative to using no predictor variables; notice the remarkable similarity between (30) and (31).
The binary CoD was perhaps the first predictive paradigm utilized in the context of microarray data, the goal being to provide a measure of nonlinear interaction among genes [10]. Even though the binary CoD, as defined in (31), has general application in classification, it has been extensively used in the case of discrete classification and prediction, particularly in problems dealing with gene expression quantized into discrete levels [8,44] ---see the examples given in Section 2 ---and its use in the inference of gene regulatory networks [11,12]. As its classic counterpart, the binary CoD is a goodness-of-fit statistic that can be used to assess the relationship between predictor and target variables (e.g., how tight the association between a set of predictor genes and a target gene is).
Even though the definition above employs Bayes errors, the CoD can be likewise defined in terms of the classification error of predictors designed from sample data, using for example the discrete histogram rule. In addition, the actual classification errors will typically need to be computed through error estimation techniques; e.g., one may speak of resubstitution and leave-one-out CoD estimates. All the issues discussed in previous sections regarding classification and error estimation for discrete data generally apply here.
A recent paper [45] defined and studied the concept of intrinsically multivariate predictive (IMP) genes using the binary CoD. Briefly, IMP genes are those the expression of which cannot be predicted well by any subset of binary predicting gene expressions, but is predicted very well by the entire set. In [45], the properties of IMP genes were characterized analytically, and it was shown that highpredictive power, small covariance among predictors, a large entropy of the joint probability distribution of predictors, and certain logics, such as XOR in the 2-predictor case, are factors that favor the appearance of IMP. In addition, quantized gene-expression microarray data were employed to show that the gene DUSP1, which exhibits control over a central, process-integrating signaling pathway, exhibits IMP behavior, thereby providing preliminary evidence that IMP can be used as a criterion for discovery of canalizing genes, i.e., master genes that constrain ("canalize") large geneexpression pathways [46].

CONCLUSION
The importance of discrete classification in Genomics lies in its broad application in problems of phenotype classification based on panels of gene-expression biomarkers and inference of gene regulatory networks from geneexpression data, where data discretization is often employed for data efficiency and classification accuracy reasons. This paper presented a broad review of methods of classification and error estimation for discrete data, focusing for the most part on the discrete histogram rule, which is the classification rule most employed in practice for discrete data, due to its excellent properties, such as low complexity and small data requirement (under small number of cells), and universal consistency. The most important criterion for performance is the classification error, which can be computed exactly only if the underlying distribution of the data is known. In practice, robust error estimation methods must be employed to obtain reliable estimates of the classification error based on available sample data. This paper reviewed analytical and empirical results concerning the performance of discrete classifiers (in terms of the classification error) as well as of error estimators for discrete classification. Those results were categorized into small-sample results ---small-sample data being prevalent in Genomics applications ---and largesample (i.e., asymptotic) results. The binary Coefficient of Determination was also reviewed briefly; it provides a measure of nonlinear interaction among genes and is therefore very useful in the inference of gene regulatory networks. Progress in classification and error estimation for discrete data, particularly the analysis of performance in small-sample cases, has a clear potential to lead to genuine advances in Genomics and Medicine, and therefore the study of such methods is a topic of considerable research interest at present.