A Comprehensive Machine-Learning-Based Software Pipeline to Classify EEG Signals: A Case Study on PNES vs. Control Subjects

The diagnosis of psychogenic nonepileptic seizures (PNES) by means of electroencephalography (EEG) is not a trivial task during clinical practice for neurologists. No clear PNES electrophysiological biomarker has yet been found, and the only tool available for diagnosis is video EEG monitoring with recording of a typical episode and clinical history of the subject. In this paper, a data-driven machine learning (ML) pipeline for classifying EEG segments (i.e., epochs) of PNES and healthy controls (CNT) is introduced. This software pipeline consists of a semiautomatic signal processing technique and a supervised ML classifier to aid clinical discriminative diagnosis of PNES by means of an EEG time series. In our ML pipeline, statistical features like the mean, standard deviation, kurtosis, and skewness are extracted in a power spectral density (PSD) map split up in five conventional EEG rhythms (delta, theta, alpha, beta, and the whole band, i.e., 1–32 Hz). Then, the feature vector is fed into three different supervised ML algorithms, namely, the support vector machine (SVM), linear discriminant analysis (LDA), and Bayesian network (BN), to perform EEG segment classification tasks for CNT vs. PNES. The performance of the pipeline algorithm was evaluated on a dataset of 20 EEG signals (10 PNES and 10 CNT) that was recorded in eyes-closed resting condition at the Regional Epilepsy Centre, Great Metropolitan Hospital of Reggio Calabria, University of Catanzaro, Italy. The experimental results showed that PNES vs. CNT discrimination tasks performed via the ML algorithm and validated with random split (RS) achieved an average accuracy of 0.97 ± 0.013 (RS-SVM), 0.99 ± 0.02 (RS-LDA), and 0.82 ± 0.109 (RS-BN). Meanwhile, with leave-one-out (LOO) validation, an average accuracy of 0.98 ± 0.0233 (LOO-SVM), 0.98 ± 0.124 (LOO-LDA), and 0.81 ± 0.109 (LOO-BN) was achieved. Our findings showed that BN was outperformed by SVM and LDA. The promising results of the proposed software pipeline suggest that it may be a valuable tool to support existing clinical diagnosis.


Introduction
Psychogenic nonepileptic seizures (PNES) are sudden behavioral changes mimicking epileptic seizures without ictal electroencephalography (EEG) changes [1,2]. PNES have been linked to dysfunction in the processing of psychological or social distress, abuse during childhood, or severe recorded 10 to 20 min before the diagnostic EEG. Recordings were performed with a Micromed Brain Quick system (Micromed SpA, Mogliano Veneto (TV), Italy) with a sampling rate of 512 Hz, high-pass filter at 0.5 Hz, low-pass filter at 70 Hz, plus a 50 Hz notch filter with a slope of 12 dB/Oct at 512 bit/second. All of the EEG signals were recorded using a montage with the following channel layout: Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6, Fz, Cz, and Pz and reference in G2 (located between electrodes Fz and Cz). All the electrode skin impedance values were kept below 5 KΩ. The EEG data were recorded in a resting condition for 20 min.
Participants were comfortably seated in a reclining chair with their eyes closed. The technicians kept the subjects alert to prevent drowsiness. At the end, EEGs were downsampled to 256 Hz, segmented into 20 min long records, and stored on an optical disc in the American Standard Code for Information Interchange (ASCII) format for further processing. The EEG recordings were later manually reviewed by experts in order to cancel the segments affected by artifacts.

EEG Software Pipeline
In this section, the proposed ML pipeline is briefly described. In Figure 1, the architecture of the proposed approach is pictorially described. Our database included PNES and CNT subjects that were processed in multiple steps as follows.
In each submap, the mean (m), standard deviation (d), skewness (v), and kurtosis (k) were estimated. The selected features have been successfully validated in [16][17][18]. The employed features have been extensively described in previous sections. At the last vector size is 5 (# sub-bands) × 4 (# features) = 20 features. The 5xn value was stacked into 1 × n feature vector X T , with n = 5 × 120. At the end, for each subject, we had a X T size equal to channel × epoch × feature. This consisted of 19 × 120 × 4 × 5 = 45,600 features for each subject under analysis. The power spectral density is defined as the Fourier transform (FT) of the signal's autocorrelation function. In this paper, the Welch method was applied along with a Hamming window [19]. The Welch approach splits the times series into overlapping chunks, computing a modified periodogram of each chunk, and the PSD estimates are then averaged. The PSD estimates were obtained in delta, theta, alpha, beta, and the whole frequency band for each EEG sensor. Furthermore, the µ, σ, k, and v from the spectral map was obtained for each band. Due to the entry EEG sequence x i (n), we chose the follow representation: where n is the number of windows, h is the window's length, and M represent the maximum number of segments. The output of periodogram can be represented as follows: where U gives the normalization factor. The Welch power spectrum is as follows: The P xx is intended for continuous spectra. The PSD carries out the average energy embedded in the signals in each frequency band under analysis. Then, integrating the PSD value, the power spectral density can be computed. To do this, we used Matlab function (cit.).
[P xx , f ] = psd(x, nfft, fs, window) where "x" specifies the input sequence, and "nfft" specifies the length of the fast Fourier transform (FFT) to perform on each EEG segment. The EEG segment length was made up of 256 samples. The length of the "window" must be less than or equal to "nfft". The Hamming window was set with the same length as "nfft". The function output returned the power spectrum of the input signal "x".

EEG Feature Classification
Given an ith EEG epoch, in order to classify it as CNT or PNES, three different ML classifiers, namely, SVM, LDA, and BN, were used, the parameters and results of which are presented in the next section. Support Vector Machine SVM [12] is a flexible and powerful statistical learning tool for binary classifier. SVM works by construction of a N-dimensional hyperplane that optimally separates the data into two categories. SVM maps the data into a higher dimensional space and then constructs an optimal separating hyperplane in this space. SVM can deal with large feature spaces. An input classifier row of feature set is called a vector. The most important training stage in SVM is the hyperplane definition.
The aim of classification by the SVM algorithm is to find an optimal hyperplane that separates clusters of vectors into two different nonoverlapping classes. Training an SVM is relatively easy [20]. It works relatively well for high-dimensional feature sets. Complexity and error trade-off can be controlled manually. The choice of a kernel is most important in the trade-off between computational performance and speed of execution. In this way, a SVM kernel function maps the data into a different nonlinear region by a more complex hyperplane. The kernel separates two input vectors by projecting it into higher dimensional space. Later, the support vector method was extended for solving function estimation problems. Given a training set of N data points y k , x k N k=1 , where x k ∈ R n is the k th input pattern and y k ∈ R n is the k th output pattern, the SVM classifier model can be expressed as follows: where a k and b are positive real constants.
In this study, we used a commonly adopted kernel called the radial basis function (RBF). The scalar product of the two data points x and y under the feature map implied by the RBF kernel is computed as follows: where σ is a free parameter. Once a kernel function is selected, the SVM algorithm works by identifying a hyperplane in a feature space that optimally separates the two classes in the training data, giving the maximum margin between the images in feature space of the points in the two classes. Often it is desirable to allow a few misclassifications in order to achieve a wider margin of separation; this trade-off is controlled by another parameter called the training error cost, which is usually denoted by C.

Bayesian Networks
A Bayesian network is a network with directly linked node and probability, random variable, and finite number of state functions attached to each one. The edges represent the relationships between the nodes. The node with no link will contain a small probability of having a parent node.
We can represent the learning problem as follows: where each node X i=1 corresponds to S (direct acyclic graph of a X conditional variable set) at the probability set If S h is a hypothesis of structure and θ s are the hypothesized parameters, then for given a set D = {x 1 . . . . . . . . . x n } with a random sample p(X θ s , S h ) , θ s and S h are the true parameters and structure hypothesis, respectively. We can compute the probability as follows: where c is a normalization constant. This Bayesian network method has been discussed by some authors [16]. Each node in the graph represents a random variable, whereas the edges between the nodes represent probabilistic dependencies among the corresponding random variables. The graph dependencies are estimated using known statistical and computational methods. Hence, BNs provide a simple definition of independence between any two distinct nodes. BN is a directed acyclic graph (DAG). The DAG structure is defined by two sets: the set of nodes (vertices) and the set of directed edges. The nodes represent random variables and are drawn as circles labelled by the variable names.

Linear Discrimination Analysis
LDA is a well-known algorithm for feature extraction and dimension reduction. It has been widely used in many applications, such as face recognition, image retrieval, microarray data classification, etc. LDA tries to provide more class separability and draws a decision region between the given classes. The LDA classifier is a dimension reduction method, which finds an optimal linear transformation that maximizes the class separability. LDA creates a linear combination of data sets, which yields the largest mean differences between the desired classes. It performs well when the feature vector is multivariate normally distributed in each class group, and different groups have a common covariance.

Use of Classifiers for the Discrimination between CNT and PNES
In the proposed classification framework, we compared PNES patients with healthy controls. The following classification algorithms were applied and compared: (i) LDA, (ii) SVM, and (iii) BN.
The basic idea of the SVM is to construct a hyperplane that has the maximum margin between CNT and PNES samples.
For each of these classification methods, an estimator for the misclassification error, such as accuracy, sensitivity, specificity, and receiver operating characteristic (ROC), was computed through two methods: (i) random split (RS) and (ii) leave-one-out (LOO) cross-validation. LOO cross-validation is often used to estimate the generalization ability of the classifiers. We used leave-one-out cross-validation to achieve greater classifier robustness and to improve the classifier's ability to generalize new data [15]. In multiple data splits, the likelihood of erroneous results was reduced. Let us define M as the number of chunk and N as the number of epochs, where (N i=1 = 120 epoch = 1 subject). Cross-validation was performed by dividing the data into M = 17 splits.
This approach repeatedly trains the classifier on the N i=M−1 point and then tests the remaining last one. In the random split approach, our database was split between a training and a test set where training and test sets are respectively 70% and 30%, respectively. A training set was designed as stacked PNES and CNT epochs. We used M = 7, which was randomly extracted from our database. M × N chunks were randomly extracted from the PNES and CNT cohorts. At the end, we designed a training set of 2 (number of class) × 7 M × N = 1,680 epochs. Then, we trained M i=7 through separate SVM, LDA, and BN models. The remaining trial was then used as a blind test. The test classification outcome results then came from the accuracy of classifying the respective unseen 6 × N, and the score was averaged across the N i=1:6 splits.
All classification outcomes are reported here as the average of all M cross-validation splits. Classification performance can be achieved as a percentage of correctly classified class. A ROC plot [21] is commonly used as a summary for assessing the trade-off between sensitivity and specificity, and the area under the curve (AUC) is used to depict sensitivity and specificity as an indicator for the quality of separation. In this study, for each step, the classification outcome was compared to a threshold, and the subject's epoch was classified as CNT or PNES as a function of threshold. Using simulated data, Figure 2 shows a comparison of the ROC curve and AUC value obtained from the three different classifiers for PNES and CNT. The ROC of a perfect classifier will go from the bottom left corner via the top left to the top right corner. This approach repeatedly trains the classifier on the N point and then tests the remaining last one. In the random split approach, our database was split between a training and a test set where training and test sets are respectively 70% and 30%, respectively. A training set was designed as stacked PNES and CNT epochs. We used M = 7, which was randomly extracted from our database. M × N chunks were randomly extracted from the PNES and CNT cohorts. At the end, we designed a training set of 2 (number of class) × 7 M × N = 1,680 epochs. Then, we trained M through separate SVM, LDA, and BN models. The remaining trial was then used as a blind test. The test classification outcome results then came from the accuracy of classifying the respective unseen 6 × N, and the score was averaged across the N : splits.
All classification outcomes are reported here as the average of all M cross-validation splits. Classification performance can be achieved as a percentage of correctly classified class. A ROC plot [21] is commonly used as a summary for assessing the trade-off between sensitivity and specificity, and the area under the curve (AUC) is used to depict sensitivity and specificity as an indicator for the quality of separation. In this study, for each step, the classification outcome was compared to a threshold, and the subject's epoch was classified as CNT or PNES as a function of threshold. Using simulated data, Figure 2 shows a comparison of the ROC curve and AUC value obtained from the three different classifiers for PNES and CNT. The ROC of a perfect classifier will go from the bottom left corner via the top left to the top right corner. LDA, SVM, and BN require many observations as variables. They implement a shallow representation with low computational cost. SVM is the most powerful classification method [22] but prone to overfitting. For the SVM classifier, the following parameters were chosen (kernel = radial, degree = 3, gamma = 0.01, coef = 0, epsilon = 0.1). In contrast, we used the default parameters for LDA and BN. All the classification stages were implemented in Python 3.7 with support from the sklearn library.
The LDA classifier generated probability discrimination values as PNES patients or CNT. To test the proposed ML approach, we used AUC, sensitivity, specificity, and accuracy. We followed a leaveone-out cross-validation approach for each subject. We used random split validation for the feature database to randomly split between a training and a test set with the proportions of 70% and 30%, LDA, SVM, and BN require many observations as variables. They implement a shallow representation with low computational cost. SVM is the most powerful classification method [22] but prone to overfitting. For the SVM classifier, the following parameters were chosen (kernel = radial, degree = 3, gamma = 0.01, coef = 0, epsilon = 0.1). In contrast, we used the default parameters for LDA and BN. All the classification stages were implemented in Python 3.7 with support from the sklearn library.
The LDA classifier generated probability discrimination values as PNES patients or CNT. To test the proposed ML approach, we used AUC, sensitivity, specificity, and accuracy. We followed a leave-one-out cross-validation approach for each subject. We used random split validation for the feature database to randomly split between a training and a test set with the proportions of 70% and 30%, respectively. The training set was used to determine the classification model, and the test set was used to evaluate it.

Results
In this study, a dataset of 20 EEG signals (10 PNES and 10 CNT) was used. We classified the EEGs as either CNT or PNES based on three different ML models. Given the ith EEG time series, it was first preprocessed, split into 5 s nonoverlapping EEG epochs, and then PSD-transformed into PDS map (where s depended on the EEG length, i.e., 1280 samples). Then, given one PSD map under analysis, 20 statistical features were extracted (as described in Section 2). Afterwards, we designed the whole dataset as the number of epochs × number of features = 9600 × 304, with each of the 120 epoch groups corresponding to one subject.
For both datasets (CNT and PNES), the classification performance for each classifier was defined as the percentage of correct and incorrect classified patients and healthy controls, considering sensitivity, specificity, F1, and recall, as summarized in Table 1. Table 1. The classification performance of psychogenic nonepileptic seizures (PNES) vs. health controls (CNT) in terms of precision (PR), recall (RC), and F1 score. We tested the discrimination performance among class labeled as CNT and PNES. The results obtained are quite comparable between the validation methods. The classifier performance provides us with values for each class under analysis. Our results demonstrated that all the classification algorithms performed nearly equally well with a remarkable sensitivity of up to 80% in task discrimination among CNT and PNES. In CNT vs. PNES class separation, the NB with LOO, as show in Table 1, was outperformed by LDA and SVM. Indeed, the AUC in Figure 2E is a little bit low compared to the ROC curve obtained by means of a random split (see Figure 2B). The ROC curve was formed by averaging the model's outcome for each iterative step. However, the low error rate in misclassifications is noteworthy.

Class SVM Classifier (LOO)
In this study, a binary epoch CNT vs. PNES separation by means of ML algorithms was implemented. The classification performances were evaluated, such as accuracy (ACC), F1 score, recall (RC), and precision (PR) metrics, which are defined as follows: Accuracy = (TP + TN) (TP + FP + TN + FN) ; where true positives (TP) and true negatives (TN) indicate test samples correctly classified as PNES or CNT subjects, whereas false positives (FP) and false negatives (FN) indicate the number of test examples that were wrongly detected as subjects with disease and no disease, respectively. Precision is a measure to quantify the number of correct identifications of PNES patients. Recall is a measure of the ability of the classifier to find all the CNT samples. Accuracy is the average of the sensitivity and specificity of the classification. We compared the performance of the previously described classifiers through the use of two validation methods, i.e., LOO and random split validation. Table 1 report the evaluation score of the classifier performance. Indeed, SVM reported quite similar scores for the two validation methods. As can be seen in Table 1, the SVM with random split achieved a PR of 0.95 ± 0.03, RC of 0.99 ± 0.001, and ACC of 0.97 ± 0.013. Meanwhile, as can be seen in Table 1, the LOO SVM showed a PR of 0.98 ± 0.061, RC of 0.98 ± 0.126, and ACC of 0.98 ± 0.0233. We statistically evaluated the significant differences in class discrimination among the three classifiers, and the results are reported here as mean ± standard deviation. The SVM classifier showed a likelihood ratio in accuracy for LOO and random split validation equal to 0.975 ± 0.018. Similar evaluation was achieved for LDA with 0.99 ± 0.008 and BN with 0.83 ± 0.05. Furthermore, in order to assess and confirm the ability to correctly detect the EEG epochs of PNES and CNT, the AUC was also estimated. Specifically, Figure 2A-F reports the ROC curves and AUC values of each discrimination technique. The best performance was observed with the LDA classifier (AUC score of 0.990 ± 0.002), as shown in Figure 2C.

Discussion and Conclusion
In this work, we propose a novel ML pipeline to automatically classify the rest EEG data of PNES patients and healthy subjects. We carried out many tests in order to determine whether the proposed machine learning tool is robust and flexible. In EEG classification, SVM, LDA, and BN have previously been shown to have good performance in many contexts [14]. For this reason, we sought to determine whether EEG data in our experiment could be automatically classified using the proposed ML pipeline. We exploited the potential of the three different machine learning algorithms to differentiate EEG segments (i.e., epochs) of CNT and PNES. Data-driven ML framework based on PSD representation of EEG recording was proposed. Here, the PSD was used on segmented EEGs (1280 sample in each segment), and features such as the mean (m), standard deviation (d), skewness (v), and kurtosis (k) were extracted from five nonoverlapping EEG rhythms. The extracted feature vector was input as the proposed shallow ML algorithm (SVM, LDA, and BN) to perform binary EEG epoch classification of CNT vs. PNES. Experimental results showed that the LDA achieved better classification performance compared to SVM and BN. This was possible due to the fact that feature extraction was used to reduce redundancy of high-dimensional input data. Here, an input vector of only 380 features for each EEG partition was used to feed the ML classifier. However, it has to be noted that it is difficult to compare our results with other outcomes due to the paucity of other recent studies on EEG epoch-based classification of PNES data. The proposed feature extraction method and the developed ML classifier achieved very good discrimination accuracy score. Our outcomes are quite similar to the result shown in [15], in which Gasparini et al. used a deep learning model to achieve around 90% sensitivity and specificity in PNES discrimination. However, their model is computationally more complex and time-consuming, while their stacked multilayer architecture needs larger datasets to be validated to avoid overfitting and reduced performance. If the ultimate goal is to transfer an algorithm for early and effective diagnosis of PNES to chips, then shallow ML architectures can be used. They avoid bottlenecks in the calculations and reduce cost and calculation time, guaranteeing lower energy expenditure for diagnostic devices to support standalone clinical activity. Nevertheless, the proposed pipeline has some limitations. One of the main limitations is related to the limited patient cohorts. This causes a constrain in classification training/testing performance. A second limitation may be the limited number of features used. A possible improvement could be to increase the dataset dimension and the number of features, for example, by extracting many other features, such as magnitude squared cross power spectral density and mutual information between pairs of sensors, in all the frequencies under analysis. Moreover, this is a cross-sectional study as each subject underwent a single EEG. Longitudinal studies could be performed to test the robustness, stability, and performance of our ML pipeline. Funding: Regione Calabria, 2017 budget, chapter n. U6101041401. The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Conflicts of Interest:
All the authors declare no conflict of interest.