Overview of Machine Learning Process Modelling

Much research has been conducted in the area of machine learning algorithms; however, the question of a general description of an artificial learner’s (empirical) performance has mainly remained unanswered. A general, restrictions-free theory on its performance has not been developed yet. In this study, we investigate which function most appropriately describes learning curves produced by several machine learning algorithms, and how well these curves can predict the future performance of an algorithm. Decision trees, neural networks, Naïve Bayes, and Support Vector Machines were applied to 130 datasets from publicly available repositories. Three different functions (power, logarithmic, and exponential) were fit to the measured outputs. Using rigorous statistical methods and two measures for the goodness-of-fit, the power law model proved to be the most appropriate model for describing the learning curve produced by the algorithms in terms of goodness-of-fit and prediction capabilities. The presented study, first of its kind in scale and rigour, provides results (and methods) that can be used to assess the performance of novel or existing artificial learners and forecast their ‘capacity to learn’ based on the amount of available or desired data.


Introduction
Ever since the advent of machine-stored data, there have been problems with the amount of data and the ability to store and process the data. In his seminal paper, E. F. Codd introduced the concept of relational databases because it was needed for "protecting users of formatted data systems from the potentially disruptive changes in data representation caused by growth in the data bank and changes in traffic [1]." Back in 1970s, Codd defined a "large" database as one having tables with 30 or more attributes.
Twenty years later, the concept of 'data mining' was introduced as a method of knowledge discovery in databases [2,3]. There was a general recognition that there is untapped value in greater collections of data and that such structures are indeed useful not only as repositories of atomic pieces of information, but rather that the database as a whole provides a lot of information, which can be used to guide business decisions and ultimately lead to a competitive advantage.
The general recognition was that novel approaches need to be implemented to mine the value from the data vaults. Typically, machine learning and artificial intelligence tools were employed. Researchers and practitioners examined various methods that were available and used them on a single dataset-the one they were trying to conquer. Very little research was done on the general applicability of these methods, which is why researchers were choosing the appropriate method by a trial and error approach. Once the problem was successfully solved, authors rarely investigated further possible improvements in a systematic way.
The possible improvements of a machine learner's performance could come in several ways. Firstly, by improving and optimizing the algorithm itself. Secondly, by changing the internal parameters of a selected algorithm. Thirdly, the performance could be further improved by employing larger amounts of data. Ideally, the algorithm's output could be analytically determined as a function of these three tactics. Different theoretical approaches provide estimates for the size of the confidence interval on the training error under various settings of the learning-from-examples problem. Vapnik-Chervonenkis (VC) theory [4] is the most comprehensive description of learning from examples. VC-theory provides guaranteed bounds on the difference between the training and generalization error. However, it has serious limitations, such as that it is applicable only to simple algorithms with a fixed 'capacity', and requires an oracle that is never wrong. Hence, it was never used in real-life implementations. On the other hand, standard numerical (and other statistical) methods become unstable when using large datasets [5]. Theoretical approaches are unable to provide answers as to how learning algorithms learn with a given input, thus creating a research gap. Additionally, there are no methods developed to describe an algorithm's performance on unseen data.
In this paper, we systematically explore the influence of the amount of data on the output of several machine learning algorithms and give a comprehensive description of their general performance. The research question is formulated as follows: (a) which of the models, in general, best describes the learning process of artificial learning algorithms, and (b) which of the models can most accurately predict the future performance?
The results of our study are significant for practitioners and developers of machine learning algorithms alike. Practitioners can use the results to verify if there is room for improvement of the generated model's performance if more data were available and estimate the costs associated with additional data acquisition and preparation. Developers can use the methodology presented in this paper when they compare their novel algorithm's performance with the existing ones in a systematic and rigorous way. To the best of our knowledge, the present study is the first one using several machine learning algorithms on such a large array of different datasets to obtain their performance envelopes.
The rest of the paper is organized as follows: the following section summarizes the existing research done in the field of learning curve approximation and prediction. In Section 3, the experimental setup is presented, while the results are discussed in Section 4. The paper is concluded with final remarks in Section 5.

Related Works
Mathematical descriptions of human cognitive abilities have already been the subject of substantial research. The idea behind this approach is based on the assumption that an existing mathematical function can be used to describe an individual learning curve, obtained from a given dataset. That is achieved by fitting the underlying parametric model to the learning curve in order to estimate it. Various mathematical functions have been studied extensively in literature in order to find the best parametric model to (a) interpolate the learning curve over the span of observed values, and (b) extrapolate the remainder of the curve beyond the range of known values.
Existing studies largely disagree on the most appropriate parametric model to describe and predict the learning process. Earlier studies often employed linear functions as benchmarks in their comparisons against other potential mathematical functions [6][7][8]. Although they were largely considered insufficient in their ability to describe the acquisition of new knowledge, there were nonetheless isolated cases in which they were demonstrated to provide the best goodness-of-fit. Logarithmic function was shown to be more promising. In [7], the best fit was achieved on four datasets. However, the measure was bound to the first portion of the learning curve, and was expected to perform worse for new points of data due to the function's inflexibility.
The exponential function is often considered to be an established way of describing the acquisition of new knowledge [7][8][9]. As a result, the term power law has appeared [10]. In the last twenty years, however, some studies [11][12][13] have been suggesting that the power law arose as a result of averaging exponential curves. Using various simulations, the authors of the mentioned papers showed that if we monitor the progress of several students, their collective learning curve will be more similar to the power law, even if individuals learn according to the exponential law [11]. The same applies to non-trivial learning tasks, which can be divided into several differently demanding sub-tasks (i.e., when learning a foreign language, we are dealing with words and grammatical concepts of varying complexity). The mentioned research papers claim that while the progress of individual sub-tasks corresponds to the exponential law, the final progress of the entire learning task is in accordance with the power law, due to the effect of averaging.
In the existing literature on the machine learners, the power law has been most commonly considered the parametric model to offer the best fit. Frey and Fisher trained decision trees and found that on a total of 12 out of 14 datasets, the power law achieved the highest goodness-of-fit [6]. In [9,14], a three-parameter power function was compared to several simple and complex mathematical functions, and discovered that the former performed the best in most cases of comparison. Extended power law has been empirically shown to yield a well-fitting learning curve for the analysis of various parameters such as error and data-reliance in deep networks [15]. Recently, the power law has been employed for the purpose of learning curve fitting in the deep learning [16], natural language [17], medicine [18] and renewable energy domains [19].
However, other mathematical functions have also been successfully utilized in literature. Inverse power law was fit to a learning curve constructed on a small amount of data [20]. The authors then explored how well the estimated learning curve fit the entire learning curve on three large, imbalanced datasets, showing that the inverse power law is a suitable fitting method for big data. An exponential model has been used to follow and predict the spread of COVID-19 [21]. A weighted probabilistic learning curve model composed of several individual parametric models (including exponential and logarithmic) was empirically demonstrated to successfully extrapolate the performance for the purpose of deep neural network hyperparameter optimization [22].
Such cases suggest that the power law is not necessarily the most appropriate parametric model in all settings. The empirical evidence suggests that the choice may be dependent on the dataset and its properties, the classification learning method [23], the learning curve construction and fitting parameters, and other activities, such as pretraining and fine-tuning [15]. Generally speaking, the defined problem is the one that determines the shape of the learning curve; and while for most problems, it is possible to determine the best-fitting model, there are special cases for which the shape of the curve is difficult to characterize [24]. For the time being, there is no ideal parametric model that would be generally applicable in all situations, particularly in such cases as ill-behaved learning curves. However, it might be possible to identify a parametric model with a sufficient flexibility, and predictive ability [25].
One branch of research focuses on how a chosen parametric model can be adjusted in order to better fit the learning curve. For example, Jaber et al. improves the traditional power law model by taking into account the variable degree of memory interference that occurs across the repetitions that represent the learning-forgetting process [26]. In a study by Tae and Whang, a framework called Slice Tuner was proposed, which iteratively updates learning curves with acquisition of new data in order to improve model accuracy and fairness [27].
Another potential alternative approach is to empirically analyse the performance of an individual learning algorithm on as many datasets as possible. A series of statistical analyses can then be performed on the obtained results so that conclusions can be drawn from them. Frey and Fischer [6] measured the performance of decision trees and found that the shape of the learning curve can be described by the power law. Although many authors are in agreement with their findings [8,28], there are some that reject these claims [7]. In this paper, we improve and extend the empirical research carried out in [29]. The description, implementation and results of the experiment will be described in the following subsections.

Experiment Design
As evident from the related work section of the paper, the two most commonly used models for describing learning curves are the power and exponential models. In [29], four models were used in total, namely the power, linear, logarithmic, and exponential models. However, the results showed that the linear model was not appropriate for describing the learning curve due to the basic shape of the linear function. Based on that, the linear model was excluded from this experiment. This experimental decision also reduced the overall complexity of the experiment, as fewer pairwise comparisons had to be conducted in the statistical analysis.
A few important improvements were introduced to the original experimental design [29]. The initial collection of datasets was expanded to 130 in total. The full list of datasets can be found in the Appendix A. The work in [29] focused on finding the best-fit learning curve for the C4.5 algorithm, which is a well-known implementation of decision trees. In this study, however, three additional classification algorithms were utilized: neural networks, Naïve Bayes, and support vector machines (SVM).
A more appropriate filtering of the constructed learning curves was also introduced. Since many learning curves were ill-behaved as a result of too fine granularity, a larger step increase had to be used. However, a coarser divide decreases the number of data points, which can be problematic when fitting learning curves constructed from smaller datasets. A balance between a fine and coarse divide was sought. The initial 10 instances [30] to be added to the next fold was shown to be producing too coarse learning curves, hence it was increased to 25. This choice reduced the number of ill-behaved learning curves to a minimum, while at the same time allowing for smaller datasets to be employed in the experiment. A more coarse divide into folds also resulted in a slightly lower computational complexity when generating the learning curves.
Next, a modified version of the coefficient of determination R 2 was employed. The coefficient of determination R 2 measures the goodness-of-fit of a statistical model. Its value determines the proportion of variance of a dependent variable that can be explained or predicted by the independent variables. A higher value means a higher goodness-of-fit [31]. This coefficient, sometimes referred to as the R-square, is usually used to fit linear models, but it can also be used to fit nonlinear models. Depending on the purpose of use, the procedures for calculating its value also differ, and in some cases the value R 2 does not necessarily represent a square of a given value. Consequently, the values of this metric can also be negative.
Since different nonlinear functions were being fit with a different number of parameters, an adapted coefficient of determination R 2 , initially proposed by Theil [32], was chosen, instead. The equation for calculating the coefficient is as follows: The values d f t and d f e represent degrees of freedom: d f t = n − 1 and d f e = n − p, where n represents the number of instances in the population, and p represents the number of parameters of the fitted mathematical function (including the constant). The value SS res represents the sum of squares of residuals, and the value SS tot represents the total sum of squares. The two values can be calculated as: where Y i represents the actual value, Y represents the average of the actual values, andŶ i represents the predicted value within the given model.
The use of the mean square error (MSE) remained unchanged. The MSE is a measure for estimating the differences between the true value Y i and the predicted valueŶ i . It is defined as the mean of the square of the difference between the two values [33]: The key difference between the metrics MSE and R 2 is that the former measures the exact deviation between the true and the predicted value, while the latter merely estimates the proportion of variance. It is recommended to use MSE for pairwise comparisons and statistical analyses, while R 2 is easier to understand and is more suitable for interpretation and presentation of the results. Several changes were also made to the process of learning curve construction. The most important was the introduction of stratification. In this sampling method, the share of individual classes is calculated for the entire dataset. These proportions must then be maintained when creating subsets. Throughout the incremental addition of new instances, stratification ensures that each fold is a good representation of the entire dataset. It avoids uneven distribution of instances into classes, which can happen in some cases when random sampling is employed.
In order to measure the accuracy, each dataset had to be divided into learning and test sets. Earlier studies employed k-fold cross-validation [28,29,34], with the number of folds typically set to 10. This approach measures the error rate in a 10-fold run, and averages the result over all 10 folds. In this study, the datasets were divided in the 80/20 ratio. This is the simplest and least computationally demanding approach, which has also proven to be considerably more stable compared to the k-fold CV [35], as shown in Figure 1.  Individual learning curves were fitted to the following parameterized mathematical functions: power ( f pow (x)), logarithmic ( f log (x)), and exponential ( f exp (x)).
It can be seen from the equations that the number of parameters p i differs between individual functions. All of them have the intercept parameter p 1 and the linear parameter p 2 , while the exponential parameter p 3 is present only for the power and exponential functions. An example of fitting a learning curve with a power function is shown in Figure 2.
In terms of statistical analysis of data, more appropriate statistical methods were employed compared to [29]. Initially, the distribution of the data was verified using Kolmogorov-Smirnov and Shapiro-Wilk normality tests which showed that most datasets were not normally distributed. Instead of the classic t-tests and ANOVA, which assume normal distribution of data, we opted for their nonparametric alternatives, namely the Wilcoxon signed-rank test and Friedman's test. We decided against using the Pearson's χ 2 test to determine the goodness-of-fit because this statistical test is not suitable for noncategorical data. Holm-Bonferroni correction was used instead of Bonferroni correction to correct for type I errors [36].

Experiment Execution
The first step in our experiment was to build the learning curves. Due to the large number of datasets and machine learning algorithms used, the learning curve construction process had to be fully automated.
For this purpose, a dedicated Java application that employed machine learning using the Weka Java API [37,38] was created. The application also took care of the preparation (stratification) and division of datasets into smaller (incremental) folds. An individual learning curve-for a specific dataset and a specific machine learning algorithm-was saved to a CSV file.
The construction of an individual learning curve was carried out according to the following procedure:

1.
All instances in a given dataset are randomly rearranged.

2.
The dataset is stratified before it can be divided in the 80/20 ratio. 3.
The first 80% of the instances are separated from the main set to become the learning set . The remaining 20% of the instances comprise the test set.

4.
All instances in the learning set are randomly rearranged.

5.
The learning set is stratified before it can be divided into k folds. The number k is obtained by dividing the number of instances in the learning set by 25 and rounding the result down. 6.
For each fold i ∈ {2, 3, 4 · · · k}, the following is executed: (a) The first n = 25 · i instances are separated from the learning set and named the learning subset.

7.
All recorded values are saved to a CSV file.
After successfully creating all of the learning curves, the process of fitting the curves could begin. For this purpose, another dedicated Java application was developed so that the entire process could be fully automated. Apache Commons Mathematics Library was employed for this purpose. Their implementation of fitting nonlinear curves is based on the Levenberg-Marquardt algorithm, which works on the Least Squares principle [39].
Fitting of the individual learning curves was performed according to the following procedure: The learning curve is read from the CSV file.

2.
For each section of the learning curve i ∈ {1, 2, 3, 4}, the following is performed: All recorded values are saved to a CSV file.
It is apparent from the above procedure that each learning curve was fitted in quartiles, thus simulating the incremental addition of knowledge in four major steps. For each quartile, the metrics MSE and R 2 were calculated twice. The first calculation was performed on the same points that were fitted, thus measuring the quality of the fit; the second calculation was performed on the entire learning curve, thus measuring the quality of the extrapolation of the learning curve (i.e., the prediction of the rest of the learning curve). Herein, it is necessary to point out that in the fourth quartile, the calculations are performed on the whole learning curve, which means that the prediction of the remaining learning curve was not feasible. In such cases, prediction could hypothetically be performed for scenarios in which we would like to know the future performance of the classifier if more data had been available. Based on that, it is possible to estimate the amount of data required to get the desired performance. Due to the division into four quartiles, additional requirements regarding the choice of the learning curves were set. Each learning curve had to contain at least 20 points, or in other words-the dataset needed to have at least 500 instances in total. In this case, the learning curves in the first quartile would have at least five points, which is two more than the absolute minimum necessary to fit a mathematical function with three parameters (such as exponential and power functions). Due to this limitation, the number of datasets employed in the experiment ultimately varied between 79 and 130. The largest number of datasets was employed when an entire dataset was used for calculating the learning curve, without having to be split into quartiles (Filter = none).
Nonetheless, the incremental fit of learning curves in quartiles is not always successful. The algorithm used is not exact and may terminate in an error due to parameter limitations and exceeding the maximum number of iterations. As a result, the data to be used in the statistical analyses is further reduced. Because of that, the number of individual learning curves employed in statistical analyses was marked accordingly (column 'N' in Tables 1 and 3). When no filter was applied (Filter = none), all 130 datasets were used.
However, when the filters were applied, some datasets did not have the required number of instances to produce enough data points for building the learning curve. We ended up with 79 datasets that provided enough data points for all filters, and provided answers for all four algorithms. Due to each mathematical model's specifics, some calculations of models' parameters diverged and no learning curve was produced. Such a case is presented in Table A3, for algorithm A in quartile 2 (see row 2), where power curve was not calculated. In some cases, we were unable to calculate any model for a specific algorithm. For example, when no filter was applied, there were a total of 130 datasets × 3 models = 390 potential learning curves. However, due to algorithms diverging and/or terminating, only 352 learning curves were successfully calculated. The datasets that did not evaluate one or more algorithms were used in the analysis in order to produce as many learning curves as possible, thus allowing multiple comparisons.
The complete data for the incremental fitting of learning curves for all algorithms and quartiles is given for a selected few datasets in the table in the Appendix B. The table shows the raw values of the metrics Missing entries indicate that the fitting was not successful for that configuration.

Results
In terms of fitting a single learning curve, the fit results of the selected mathematical models are interdependent. In other words, the results of the obtained metrics (MSE and R 2 ) are interdependent within one learning curve and can be compared using pairwise (dependent) tests.
Since the obtained results do not satisfy the assumptions required for parametric tests, nonparametric tests in statistical analyses were used. Friedman's test was employed for simultaneous comparison of all three mathematical models, followed by pairwise post-hoc tests using the Wilcoxon test of predetermined ranks.
Fitting the learning curves with different mathematical models was observed from two different perspectives. Initially, the goodness-of-fit, which shows how well a particular model can describe a part or the entirety of a learning curve that was examined. Then, its ability to predict, which shows how well a particular model can predict (or extrapolate) the remainder of the learning curve was investigated. Figure 3 shows the extrapolation of the learning curve using the power and exponential model. Both models were fitted on the first quarter of the learning curve, while the remainder was extrapolated-the milestone between interpolation and extrapolation is marked by a vertical line. It can be seen from the figure that the power model proved to be better at predicting the remainder of the learning curve.

Goodness of Fit
When comparing selected models in terms of goodness-of-fit, the MSE (i) was compared first, followed by R 2 (i) . A more favorable value of an individual metric-lower MSE (i) and higher R 2 (i) -means higher goodness-of-fit.
The Friedman test was used to compare MSE (i) of both models simultaneously. The results are shown in Table 1. Comparisons were performed for all four quartiles (Filter = i), as well as the full dataset (Filter = none). "N" represents the number of instances used in statistical comparisons. Due to limitations outlined in the previous section, the incremental fit of learning curves by quartiles was conducted on a limited number of instances. Conversely, the fitting on the full dataset was carried out on all available instances. Type I error corrections were performed for all five p-values in the table. After analyzing the MSE (i) , we proceeded with analyzing the R 2 (i) . Following the same procedure as before, the Friedman test was performed first, followed by pairwise comparisons using the Wilcoxon test of predicted ranks. The results of both statistical procedures are shown in Tables 1 and 2. Since the value of R 2 (i) is generally restricted to the interval [0, 1], the statistical distribution of values on the box-and-whisker plot were also shown (see Figure 4). The power model had the highest median, followed by the exponential model, and finally the logarithmic model. With the exception of the first quartile (quartile = 1), the power model proved to be the most appropriate. It is followed by the exponential, and finally, the logarithmic model.

Prediction
To compare mathematical models in terms of their prediction capabilities, a statistical analysis of the MSE predict -means a greater ability to predict unknown data.
The statistical analyses in this subsection are analogous to the ones performed in the previous subsection, so they were not described in more detail. The results of the Friedman comparison tests for MSE (i) predict and R 2(i) predict can be found in Table 3. Table 4 contains pairwise comparisons of MSE (i) predict and R 2(i) predict using the Wilcoxon predicate rank test. The statistical distribution of the R 2(i) predict for the selected mathematical models is shown on the box-and-whisker plot portrayed in Figure 5. Similarly to the goodness-of-fit measure, the power model had the highest median, followed by the exponential, and the logarithmic model.   With the exception of the first quartile (quartile = 1) in the analysis of the metric MSE (i) predict , the power model again proved to be the most appropriate model for the prediction (extrapolation) of learning curves. It was followed by the exponential, and finally, the logarithmic model.
Since the model was fit to a portion of the learning curve, only the data for the first three quarters is shown in the mentioned figures and tables. That is because the fourth quartile represents the entire learning curve, for which any further predictions can no longer be validated using the existing data.
Type I error correction was performed on all three p-values for the Table 3, and all nine p-values for the Table 4.

Conclusions
When presenting the results from both aspects (fit quality and ability to predict), it was apparent that, in general, the power model proved to be the most appropriate choice for describing learning curves and thus machine learning algorithms' performance. The results of the conducted research are consistent with the findings of authors in the area of machine learning, e.g., Frey and Fischer [6], Last [8] and Provost et al. [28].
Interestingly, the results contradict the findings of Heathcote et al. [11] who were modeling and observing human cognitive performance and found out that the exponential law is the best to describe an individual learner and that the power law may be observed only at the generalization level. However, the power law was again better at describing a combined motor-cognitive task [40]. There is additional research needed to explain why and when human and machine learners might be different in their performances.
The novelty of our research is in providing a systematic and concise answer regarding the shape of learning curves produced by artificial learning algorithms. No previous study has utilized a broad set of datasets and statistically validated the results. The studies mentioned here and in the related works have been working with mostly singlular machine learners and at best with a few datasets. As opposed to other studies, we have systematically investigated the performance and ability to predict of four commonly used machine learning algorithms over a substantial number of datasets, employing rigorous statistical methods.
The prevailing power law should be researchers' first choice when measuring the performance of a learner at the individual level (a single machine learning algorithm) or at the generalized level (several algorithms). However, consistent with the observations of [15,23], a combination of decisions taken during the machine learning process (e.g., combination of datasets, selected classifiers, fitting parameters, pretraining, fine-tuning, etc.) determine the shape of the learning curve.
Our results can serve as important input to the practitioners who try to improve their results by changing the internal parameters of the machine learning algorithm used. The question for the practitioners is whether these changes lead to shifting the learning curve, or to a better generalization. Determining whether or not the change(s) affect the power-law exponent can lead to immense accuracy improvements. These can be implemented early in the process.
We have shown that for most problems it is possible to determine the best-fitting model and the best predicting model, but that there are special cases where the learning curve is difficult to characterize.
Future work should examine these cases in greater detail with the intention to identify and describe combinations of characteristics for which the power law is not the most suitable descriptor. A prominent area of the future studies is the impact of using data processing techniques (e.g., filtering, augmentation, cleaning) on the learning curves. Additionally, further studies should seek to find out which model is best for a specific algorithm.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://archive.ics.uci.edu/.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: The datasets used in our experiment were obtained from the UCI Machine Learning Repository [41]. There was a total of 184 datasets that focused on multivariate classification problems. However, some of the identified datasets were not appropriate for the task of constructing and fitting the learning curves, which is why they had to be removed. The primary criteria to determine the suitability of an individual dataset included: the number of instances, the availability of the dataset, and the format of the data.
If the dataset was already divided into learning and test sets, both sets were combined prior to the experimental procedure. Some of the published datasets were not dataset at all, but data created by random generators. Such cases were excluded as well.
Data format was also important. The tool that was used to implement machine learning, Weka, supports several forms of input data, however its native ARFF format proved to be the best for our purposes. Since the vast majority of datasets in the UCI repository were not available in this format, alternative solutions were sought. Several third-party repositories were found online, containing most of the collections from the UCI repository. The remaining datasets that could not be found in online repositories were manually converted. The few datasets that could not be converted to the desired format were discarded.
Finally, since datasets were split into quartiles, it was important to ensure that the learning curves contained enough data points. For that purpose, the required minimum number of instances in the collection was set to 500. The final number of datasets that met all the requirements was 79. Table A1 displays a list of all datasets that provided enough data points for analyses to be conducted on individual quartiles. The meaning of the columns is as follows. The Dataset column indicates the name of the ARFF file, which in most cases matches the name of the dataset uploaded to the UCI repository [41]. The Number of attributes column indicates the number of attributes that represent potential decision criteria for classification. The Number of instances column marks the number of valid instances included in a given dataset.   Additional analyses were carried out for learning curves that were constructed from the full datasets (i.e., when no filters were applied). Since the full datasets contained enough data points in all cases, the minimum number of instances requirement was not relevant. Table A2 lists the remaining 51 datasets that contained fewer than 500 instances. The analyses that were conducted on the datasets which were not split, employed all datasets listed in Tables A1 and A2 (79 + 51 = 130 datasets).   -tumor  18  339  prnn_fglass  10  214  prnn_synth  3  250  rmftsa_propores  5  289  schizo  15  340  seeds  8  210  sonar  61  208  spect  23  267  spectf  45  349  usp05  17  203  V1  16  435  VO  17  435  vote 17 435

Appendix B
The Table A3 shows the experimental results of a comparison of eight selected mathematical models used to describe the shape of the learning curves. The meaning of the columns is as follows. The Dataset row preceding the tables and the Algorithm and Quartile columns show a combination of the selected dataset, the classification algorithm, and the size of the learning set. For each of the selected metrics (MSE (i) , R 2 (i) , MSE