Automatic Evaluation of the Lung Condition of COVID-19 Patients Using X-ray Images and Convolutional Neural Networks

COVID-19 represents one of the greatest challenges in modern history. Its impact is most noticeable in the health care system, mostly due to the accelerated and increased influx of patients with a more severe clinical picture. These facts are increasing the pressure on health systems. For this reason, the aim is to automate the process of diagnosis and treatment. The research presented in this article conducted an examination of the possibility of classifying the clinical picture of a patient using X-ray images and convolutional neural networks. The research was conducted on the dataset of 185 images that consists of four classes. Due to a lower amount of images, a data augmentation procedure was performed. In order to define the CNN architecture with highest classification performances, multiple CNNs were designed. Results show that the best classification performances can be achieved if ResNet152 is used. This CNN has achieved AUCmacro¯ and AUCmicro¯ up to 0.94, suggesting the possibility of applying CNN to the classification of the clinical picture of COVID-19 patients using an X-ray image of the lungs. When higher layers are frozen during the training procedure, higher AUCmacro¯ and AUCmicro¯ values are achieved. If ResNet152 is utilized, AUCmacro¯ and AUCmicro¯ values up to 0.96 are achieved if all layers except the last 12 are frozen during the training procedure.


Introduction
The Coronavirus disease 2019 (COVID-19) caused by Severe Acute Respiratory Syndrome virus 2 (SARS-CoV-2) is a viral, respiratory lung disease [1]. The spread of COVID-19 has been rapid, and it has affected the daily lives of millions across the globe. The dangers it poses are well-known [2], with the most important ones being its relatively high severity and mortality rate [3] and the strain it exhibits on the healthcare systems of countries worldwide [4,5]. Another problematic characteristic that COVID-19 exhibits is a wide variation in severity across the patients, which can cause issues for healthcare workers who wish to determine an appropriate individual treatment plan [6]. Early determination of the severity of COVID-19 may be vital in securing the needed resources-such as planning the location for hospitalization of the patient or respiratory aids in case they may be necessary.
There is a dire need for systems that will allow for the strain put on the resources of the healthcare systems, as well as healthcare workers, by allowing easier classification of patient case severity in early stages of hospitalization. Artificial intelligence (AI) techniques have already been proven to be a useful tool in the fight against COVID-19 [7,8], so the possibility exists of them being applied in this area as well. Existence of such algorithms may lower the strain on the potentially scarce resources, by allowing early planning and allocation. Additionally, they may provide decision support to overworked healthcare professionals.
Internet of Medical Things (IoMT) is a medical paradigm that allows for integration of modern technologies in the existing healthcare system [9]. The algorithms developed as a part of the presented research can be made available to health professionals using IoMT [10]. Models obtained using the described methodology can be integrated inside a pipeline system in which an X-ray image will automatically be processed using the developed models, and the predicted class of the patient whose image has been obtained will immediately be delivered to the medical professional examining the X-ray. Such automated diagnosis methods have already been applied in many studies, such as in histopathology [11], neurological disorders [12], urology [13], and retinology [14]. All the researchers agree that not only can such AI-based support systems provide an extremely precise diagnosis, but can also be integrated in automatic systems to provide assistance to medical experts in determining the correct diagnosis. The obtained models are suited for such an approach. While the training of the models is slow due to the backpropagation process, the classification (using forward propagation) is fast and computationally moderate [15,16], allowing for easy integration into existing in-hospital systems.
The machine learning diagnostic approach has been successfully applied to X-ray images a number of times in the past. For example, Lujan-Garcia et al. (2020) [17] demonstrated the application of CNNs for the detection of pneumonia using chest X-ray images using Xception CNN, which was pre-trained using a ImageNet dataset for initial values. The evaluation was performed using precision, recall, F1 score, and AUROC, with the achieved scores being 0.84, 0.99, 0.91, and 0.97, respectively. Kieu et al. (2020) [18] demonstrated the Multi-CNN approach to the detection of abnormalities on the chest X-ray images. The approach presented in the paper demonstrates the use of multiple CNNs to determine the class of the input image, with the hybrid system presented achieving an accuracy of 96%. Bullock et al. (2019) [19] presented XNet-a CNN solution designed for medical, X-ray image segmentation. The presented solution is suitable for small datasets, and achieves high scores (92% accuracy, F1 score of 0.92 and AUC of 0.98) on the used dataset. Takemiya et al. (2019) [20] demonstrated the use of R-CNNs (Region with Convolutional Neural Network) in the detection of pulmonary nodules from the images of chest X-ray images. The proposed method utilizes the Selective Search algorithm to determine the potential candidate regions of chest X-rays and applied the CNN to classify the selected regions into two classes-nodule opacities and non-nodule opacities. The presented approach achieved high classification accuracy. Another example is by Stirenko et al. (2018) [21], in which the authors applied the deep learning, CNN approach to the X-ray images of patients with tuberculosis. The CNN is applied to a small and non-balanced dataset with the goal of segmentation of chest X-ray images, allowing for classification of images with higher precision in comparison to non-segmented images. In combination with data augmentation techniques, the achieved results are better. Authors conclude that data augmentation and segmentation, combined with dataset stratification and removal of outliers, may provide better results in cases of small, poorly balanced datasets.
There was research that utilized transfer learning methodologies in order to recognize respiratory diseases from chest X-ray images. In [22], the authors proposed a transfer learning approach in order to recognize pneumonia from X-ray images. The proposed approach, based on utilization of ImageNet weights has resulted with high accuracy of pneumonia recognition (96.4%). Another transfer learning approach has been implemented for pneumonia detection in [23]. By utilizing such an approach, a highly accurate multiclass classification can be achieved, with accuracy ranging from 93.3% to 98%.
Wong et al. [24] (2020) noted that radiographic findings do indicate positivity in COVID-19 patients, the conclusions of which are further supported by Orsi et al. [25] (2020) and Cozzi et al. [26] (2020). Borghesi and Maroldi [27] (2020) defined a scoring system for X-ray COVID-19 monitoring, concluding that there is a definite possibility of determining the severity of the disease through the observation of X-ray images. Research has been done in the application of AI in the detection of COVID-19 in patients. Recently, classification of patients for preliminary diagnosis has been done from cough samples. Authors Imran et al. [28] (2020) applied and implemented this into an app, called AI4COVID-19.
Bragazzi et al. [29] (2020) demonstrated the possible uses of information and communication technologies, artificial intelligence, and big data in order to handle the large amount of data that may be generated by the ongoing pandemic. Further reviews and comparisons of mathematical modeling, artificial intelligence, and datasets for the prediction were done by multiple authors, such as Mohamadou et al. [30] (2020), Raza [31] (2020), and Adly et al. [32] 2020. All aforementioned authors concluded the possibility of application of AI in the current and possibly forthcoming pandemics. Most promise in AI applications being applied in this field has been shown in the field of epidemiological spread. Zheng et al. [33] (2020) applied a hybrid model for a 14-day period prediction, Hazarika et al. [34] (2020) applied wavelet-coupled random vector functional neural networks, while Car et al. [35] (2020) applied a multilayer perceptron neural network for the goal of regressing the epidemiology curve components. Ye et al. [36] (2020) demonstrated a α-Satellite, AI-driven system for risk assessment at a community level. Authors demonstrate the usability of such a system in combat against COVID-19, as a system that displays risk index and the number of cases among all larger locations across the United States. Authors in [37] have proposed a method for forecasting the impact of COVID-19 on stock prices. The approach based on stationary wavelet transform and bidirectional long short-term memory has shown high estimation performances.
Still, a large amount of work was also done in the image classification and detection of COVID-19 in patients. Wang et al. [38] (2020) demonstrated the use of high-complexity convolutional neural networks in the application of COVID-19 diagnosis. Their COVID-Net custom architecture reached high sensitivity scores (above 90%) in the detection of the COVID-19 in comparison to other infections and a normal lung state. Narin et al. [39] (2020) also demonstrated a high-quality solution using deep convolutional neural networks on X-ray images. Through the application of five different architectures (ResNet50, ResNet101, ResNet152, Inception V3, and Inception-ResNetV2) high scores were achieved (accuracy 95% or higher) by the authors. Ozturk et al. [40] (2020) developed a classification network for classifying the inflammation, named DarkCovidNet. DarkCovidNet reached an impressive score in binary classification at 98.08% in the case of binary classification, but a significantly lower score for 87.02% for multi-label classification. In the presented case, a multi-label classification was conducted with the aim of differentiating X-ray images of the lungs of healthy patients, patients with COVID-19, and patients with pneumonia. Abdulaal et al. [41] (2020) demonstrated the AI-based prognostic model of COVID-19, achieving accuracy levels of 86.25% and AUC ROC 90.12% for UK patients.
There have been studies proposing a transfer-learning approach to COVID-19 diagnosis from X-ray images of the chest. The study presented in [42] used pre-trained CNNs in order to automatically recognize COVID-19 infection. Such an approach has enabled high classification performances with an accuracy level of up to 99%. The research presented in [43] proposed a similar approach in order to differentiate pneumonia from COVID-19 infection. Transfer learning has enabled higher classification accuracy with utilization of simpler CNN architecture, such as VGG-16.
While a lot of work suggests that neural networks may be used for the detection of COVID-19 infection, there is an apparent lack of work that tests the possibility of finding the severity of COVID-19 through patients' lung X-rays. Such an approach would allow for automatic detection and prediction of case severity, allowing healthcare professionals to determine the appropriate approach and to leverage available resources in the treatment of that individual patient. Development of an AI basis for such a novel system is the goal of this paper. From a literature overview, it can be noticed that all presented research has been based on a binary classification of X-ray images (infected/not infected) or differentiating COVID-19 infection and other respiratory diseases.
To summarize the novelty, this article, unlike the articles presented in the literature review, deals with a multi-class classification of X-ray images of positive COVID-19 patients with the aim of estimating the clinical picture. All the examples have used a large number of images (larger than 1000) in the training and testing processes of the neural network. While the number of COVID-19 patients is high, data collection, especially in countries with lower quality healthcare systems, may be problematic due to the strain exhibited by the coronavirus. Because of this, it is important to test the possibility of algorithm development combined with data augmentation operations, which is the secondary goal of the presented research.
According to presented facts and the literature overview, the following questions arise: • Is it possible to utilize CNN in order to classify COVID-19 patients according to X-ray images of lungs? • Which CNN architecture achieves the highest classification performance? • Which are the best-performing configurations in regards to the solver, number of iterations, and batch size? • How do transfer learning and layer freezing influence the performances of the best configurations?

Dataset Construction
In this section, a brief description of the used dataset will be provided, together with examples of each class. Furthermore, a data augmentation technique will be presented. At the end, divisions in the training, validation, and testing sets will be presented.

Dataset Description
The dataset used in this research was obtained from the Clinical Centre in Kragujevac [44] and consists of 185 X-ray images that represent the lungs of 21 patients diagnosed with COVID-19. The dataset consists of 7 female and 18 male patients, and age of patients in the form of mean ± standard deviation was 58.9 ± 11.1 years. Images have been divided into four groups according to the clinical picture of the patient. Classification to a clinical picture was performed according to the clinical data that contained parameters such as: An overview of image classes has been presented in Figure 1, where each class is represented with a X-ray image. For the purposes of this research, X-ray images collected during treatment have been used to create a dataset. The dataset was created with respect to the clinical picture of the patient, where each X-ray image was classified to the appropriate class. Data distribution according to classes is presented in Figure 2.

Description of Data Augmentation Technique and Resulting Dataset
Due to a small amount of images in the dataset, a process called augmentation has been utilized in order to increase the classification performances [45]. The augmentation procedure was performed with the aim of artificially increasing the training dataset [46], while the testing dataset remained the same. This procedure is often used in fields such as bio-medicine, due to the fact that a large amount of bio-medical data is often unavailable [47]. In this particular case, a set of geometrical operations was utilized in order to increase the dataset. The aforementioned geometrical operations are: In addition to the above list, brightness augmentation was also performed. All the images obtained by the geometrical transformations given above were further modified by multiplying all image pixel values with factors 0.8, 0.9, 1.1, and 1.2 in addition to the original brightness.
The 90-degree rotation presents an operation that rotates the original image (presented with Figure 3a) by 90 degrees in a clockwise direction around the sagittal axis, as presented in Figure 3b. Following the presented logic, rotations for 180 and 270 degrees were performed as well, as presented in Figure 3c,d. Images rotated by 90 and 270 degrees were resealed in order to have the same dimensions as the original image. Image generation by 180-degree rotation around the longitudinal axis was performed in such a way that the new image represented a mirrored projection of the original image, as presented in Figure 3e. The mirrored image was rotated around the sagittal axis, forming three new variations, as presented in Figure 3f-h. As the final approach to image augmentation, a process of multiplication of all image pixels with a certain factor is proposed. In this case, four different factors (0.8, 0.8, 1.1, and 1.2) were used. The described transformations have been presented on an original image with Figure 3i-l. It is important to notice that such transformations were applied on an augmented set that was created by using all described geometrical transformations. By using such an approach, the new augmented dataset was four times larger than the dataset created by using just geometrical transformations.
Only geometrical transformations and multiplication of all image pixels with a certain factor were used for data augmentation in order to keep the entire data of the image. Other techniques, such as scaling, could remove parts of the image, so they were considered inappropriate due to the nature of the problem, which observes the entire image as it is delivered from the hospital X-ray system. By using the augmentation process described in the previous paragraphs, a new augmented dataset of 5400 images was constructed. The class distribution of the new set is presented in Figure 4a. It is important to notice that for the creation of the augmented dataset, only images contained in the original training set were used. In other words, images used for classifier testing were not used for the creation of the augmented set. According to the presented fact, the training set of 881 images was divided into training and validation sets in a 75:25 manner, as a ratio common in machine-learning practice. The presented sets were used for the training of CNNs, while the original testing set was used for the evaluation of their classification performances. The above-described dataset division is presented in Figure 4b.

Description of Used Convolutional Neural Networks
In this subsection, an overview of CNN-based methods for image classification will be presented. The CNNs used in this research are, in fact, standard CNN architectures widely used for solving various computer vision and image recognition problems [48]. Such an algorithm, alongside its variations, is widely used for various tasks of medical image recognition [49]. For the case of this research, four different CNN architectures were used, and they are: ResNet.
All of the above-listed CNN architectures have predefined layers and activation functions, while other hyper-parameters, such as batch size, solver, and number of epochs could be varied. The above-listed architectures were chosen due to the history of their high classification performances in similar problems. It has been shown that ResNet architectures have achieved high classification performances when used for multi-class classification of X-ray chest images [50]. Furthermore, ResNet architectures were used in various tasks of medical data classification ranging from tumor classification [51,52], trough recognition of respiratory diseases [53], to fracture diagnosis [54,55].
Extensive searches for the optimal solution through the hyper-parameter space can also be called the grid-search procedure. Variations of hyper-parameters used during the grid-search procedure for CNN-based models are presented in Table 1.  -16  50  --75  --100  --125  --150  --175  --200 --In order to determine the influence of overfitting, the number of epochs were varied with the aim of determining the number with the highest performances on the test dataset. With respect to theoretical knowledge, it can be defined that when training with a large number of epochs, the model is often over-fitted. For this reason, it is necessary to find the optimal number of training epochs [15]. Solvers used in this research were selected due to their performance on multiple multi-label datasets [59]. In the following paragraphs, a brief description and mathematical models will be provided for each solver.

Adam Solver
The Adam optimization algorithm represents one of the most-used algorithms for tasks of image recognition and computer vision. By using the Adam optimizer, weights are updated by following [56]: wherem t is defined as:m (3) m t is defined as a running average of the gradients, and it can be described with: Furthermore, v i is defined as the running average of squared gradients, or: G can be defined with: where C(arg) represents a cost function. Parameters of the Adam solver used in this research are presented in Table 2.
AdaMax Solver AdaMax solver follows the logic similar to the Adam solver-in this case, the weights update was performed as [58]: wherem t is defined as:m Furthermore, v t can be defined as: and m t is defined as: As it is in the case of the Adam solver, parameters used in this research are presented in Table 2.

Nadam Solver
The third optimizer used in this research in Nadam. As with the AdaMax algorithm, Nadam is also based on Adam. Weights in these case updates are as [58]: wherem t is defined with:m m t andĝ t are defined as:m As it is in the case of the Adam and AdaMax optimizers, the parameters of the Nadam solver are presented in Table 2. The presented parameters will be used for training the CNNs, and the classification performances of all trained models will be evaluated by using the testing data set. In the following paragraphs, a brief overview of the used CNN architectures will be presented.

AlexNet
AlexNet represents one of the classical CNN architectures that are used for various tasks of image recognition and computer vision. This architecture is one of the first CNNs that are based on deeper configuration [60]. AlexNet won the ImageNet competition in 2012. The success of such a deep architecture has introduced a trend for designing even deeper CNNs that can be noticed today [61]. AlexNet is based on a configuration of nine layers, where the first five layers are convolutional and pooling layers, and the last four are fully connected layers [62]. The detailed description of AlexNet architecture in provided in Table 3.

VGG-16
The described trend of deeper CNN configuration resulted in improvements of the original AlexNet architecture. One of such architectures is VGG-16, presented in the following year. VGG-16 represents a deeper version of AlexNet, where the nine-layer configuration is replaced with a 16-layer configuration, from which the name is derived [63]. A main advantage of VGG-16 is the introduction of smaller kernels in convolutional layers, in comparison with AlexNet [64]. The detailed description of VGG-16 layers is provided in Table 4.

ResNet
According to the presented networks, the trend of designing deeper networks can be noticed [65]. This approach can be utilized to a certain level, due to the vanishing gradient problem [66]. It can be noticed that deeper configurations will have no significant improvements in terms of classification performances. Furthermore, in some cases, deeper CNNs can show lower classification performances than CNNs designed with a smaller number of layers. For these reasons, an approach based on residual blocks is proposed. The residual block represents a variation of a CNN layer, where a layer is bypassed with an identity connection [67]. The block scheme of such an approach is presented in Figure 5. By using the presented residual approach, significantly deeper networks could be used without the vanishing gradient problem. This characteristic is a consequence of identity bypass utilization because identity layers do not influence the CNN training procedure [68]. For these reasons, deeper CNNs designed with a residual block will not produce the higher error in comparison with shallower architectures. In other words, by stacking residual layers, significantly deeper architectures could be designed. For the case of this research, three different architectures based on the residual block will be used, and these are: ResNet50 [69], ResNet101 [70], and ResNet152 [71]. The aforementioned architectures are pre-defined ResNet architectures that are mainly used for image recognition and computer vision problems which require deeper CNN configurations.

Research Methodology
As presented in the previous sections, this research is based on a comparison of multiple methods of image recognition that will be used in order to estimate the severity of COVID-19 symptoms according to X-ray images of patients' lungs. All methods have been compared and evaluated from a standpoint of classification performances. In this case, AUC micro and AUC macro are used.

Description of AUC micro and AUC macro
Image classifiers are evaluated using standard classification measures, such as the Area under the ROC curve (AUC). Such an approach is based on construction of the ROC curve by using a false-positive rate (FPR) and true-positive rate (TPR). TPR can be described as a ratio between the number of correct classifications in one class (A C ) and the sum of total members of that class. Such a number includes the number of correct classifications and the number of incorrect classifications (A I ). The aforementioned ratio can be defined as: On the other hand, FPR can be defined as a ratio of the number of incorrect classifications in the first class (B I ) and the total number of members of the second class (B I + B C ). The aforementioned ratio can be written as: By using TPR and FPR, the ROC curve can be constructed and the AUC value can be determined. The challenge, in this case, lies in the fact that this measure is designed to evaluate the binary classifier. In the case of this research, the classification is performed in four classes. For this reason, a standard ROC-AUC procedure must be adapted to evaluate multi-class classification performances [49]. This approach is achieved by using AUC micro and AUC macro measures.

AUC micro
The definition of AUC micro is based on the calculation of TPR micro and FPR micro . TPR micro can be calculated as a ratio between the number of correct classifications and the total number of samples. This relation can be written as: where A C represents the number of correct classifications in the class A, B C the number of correct classifications in the class B, C C the number of correct classifications in the class C, and D C the number of correct classifications in the class D, where N represents the total number of samples. Following a similar methodology, FPR micro can be calculated as a ratio between the total number of incorrect classifications and the total number of samples. According to the above-stated notation, this ratio can be written as: When the described TPR micro and FPT micro are used for ROC curve construction, the area underneath is called AUC micro . This area represents a discrete micro-average value for the evaluation of multi-class classifier performances.

AUC macro
Similar to AUC micro , AUC macro can be used for performance evaluation of a multi-class classifier. In this case, average TPR is calculated as an average of TPR values that represent individual classes. For example, the TPR value for the class A can be calculated as a ratio between the number of correct classifications in the class A (A C ) and the total number of class A members (N A ). Such a ratio can be written as: When the presented formalism is applied to all classes, TPR macro can be calculated as follows: where M represents the total number of classes. Following the presented procedure, FPR macro can be calculated as an average of individual FPR values: where the individual value can be calculated as a ratio between the number of incorrectly classified images as members of a particular class and the total number of images that are members of the same class. By using these measures, AUC macro can be calculated.

Overfitting Issue
Due to the large CNN models used in this research, it is necessary to include steps to overcome overfitting. Over-fitted CNN shows high classification performances on the training dataset, while the performances on the testing dataset are quite poor. In order to prevent overfitting, some steps must be taken. According to [58], there are several mechanisms used to overcome overfitting in image classifiers. The mechanisms used in this research are: • Image augmentation; and • Early stopping.
Image augmentation, as one of the key techniques for handling the overfitting issue, was addressed earlier in the article. In order to perform early stopping, an analysis regarding the change of AUC micro and AUC micro over the number of epochs was performed. Data obtained with this analysis will be used to determine the optimal number of training epochs for each CNN architecture. By using this approach, selected networks will be trained for the number of epochs which will allow for full training, while avoiding overfitting.

Freezing Layers
In order to increase classification performances of proposed networks, an approach of layers freezing during training procedure will be used. Such an approach will be performed on the CNN configurations that have already achieved the highest performances. The procedure of freezing layers will be performed in an iterative manner from the bottom of the network towards the higher layers until the maximal classification performances are achieved. Such a procedure is selected in order to fine-tune only specific layers of a CNN architecture, pre-trained with ImageNet, while other layers remain frozen during a training procedure. By using such an approach, issues regarding unscientific datasets are overcome to some extent. An example of a freezing layers methodology is presented on a ResNet architecture in Figure 6, where the first, second, third, and half of the fourth block are frozen during training, while other layers remain unfrozen.

Results Representation
In order to define the network that achieves the best classification performances, maximal AUC micro and AUC macro achieved with all networks will be compared. As a first step, the influence of the number of epochs and the batch size on maximal AUC micro and AUC macro will be examined. Furthermore, the configuration that produces the highest result will be presented for all CNNs. As a final step, all maximal AUC micro and AUC macro values achieved with each CNN will be compared in order to determine the architecture with the highest classification performances. A schematic representation of the research methodology is presented in Figure 7.

Results and Discussion
In this section, an overview of results achieved with each of the proposed CNN architectures will be presented. For each aforementioned architecture, diagrams that describe the change of maximal AUC micro , AUC macro value in dependence of number of epochs and batch size will be provided. At the and of the section, a comparison of the achieved results will be presented and discussed.

Results Achieved with AlexNet
As the first of the results achieved with AlexNet architecture, the change of AUC macro over the number of training epochs is presented in Figure 8. When the results are compared, it can be noticed that AUC macro achieved its maximum at 50 and 75 training epochs, regardless of the solver utilized. Furthermore, it can be noticed that in the higher number of epochs, a significant fall of AUC macro value occurs in the case of all solvers. Such a fall in classification performances could be recognized as a consequence of overfitting.
1XPEHURIHSRFKV The change of AUC micro is presented in Figure 9, where a similar trend as in the case of AUC micro can be noticed. In this case, maximal performances are also achieved with 50 and 75 consecutive training epochs. Furthermore, it can be noticed that AUC micro values are, at the same point, slightly higher than AUC macro . The trend of overfitting on a larger number of epochs is also noticeable.
1XPEHURIHSRFKV When the influence of batch size on AUC macro is observed, it can be noticed that there is no configuration that achieves a AUC macro value higher than 0.8. This property is in correlation with the case described with Figure 8. It is interesting to notice a significant fall of AUC macro values in the case when batches of size 16 are utilized. For this case, AUC macro is set around a value of 0.7. This characteristic can be noticed for all three solvers utilized, as presented in Figure 10. Presented results are in correlation with previous knowledge regarding a regularizing effect of smaller batch sizes [72]. Such an approach has enabled overcoming of the overfitting issue.
%DWFKVL]H As the final evaluation of AlexNet's classification performances, the influence of batch size on AUC micro will be observed. Similar to the case presented in Figure 9, an AUC micro slightly higher than 0.8 was achieved if batches of four and eight were used. In the case of a batch size of 16, significantly lower AUC micro around 0.7 was achieved. The described property can be noticed regardless of the solver utilized, as presented in Figure 11

Results Achieved with VGG-16
The results, similar to the results achieved with AlexNet, are achieved with VGG-16, as presented in Figure 12 Similar behavior of AUC micro is presented in Figure 13, where maximal performances could be noticed when the network was trained for 50 or 75 consecutive epochs. When the network was trained for a higher number of epochs, a significant fall of AUC micro could be noticed. Such a result is a consequence of overfitting on the larger number of epochs.
1XPEHURIHSRFKV The influence of batch size on AUC micro for the case of VGG-16 is presented in Figure 14. When AUC micro is measured, it can be noticed that the highest values are achieved when batches of four and eight are used. In the case when larger batches of 16 are used, AUC micro value is positioned around a value of 0.7, as presented in Figure 15. In this case, a gap between AUC micro and AUC macro can also be noticed

Results Achieved with ResNet Architectures
In the following sub-section, an overview of results achieved by using ResNet architectures will be presented. All results will be presented and described in a similar manner as in the case of AlexNet and VGG-16.

Results Achieved with ResNet50
The change of AUC macro over the number of epochs is presented in Figure 16. From the presented results, it can be noticed that the maximal AUC macro values are achieved when the network is trained for 100 epochs. This characteristic can be noticed only for the case of the Adam and Adamax solvers, while for the case of the Nadam solver, the maximal AUC macro is achieved when the network is trained for 50 consecutive epochs. If the CNN is trained for a larger number of epochs, a significant drop of AUC macro can be noticed. Such a result is pointing towards the fact that the overfitting issue occurs if ResNet50 is trained for a larger number of epochs.
Furthermore, when Figures 16 and 17 are observed, a similar trend can be noticed for the case of AUC micro . A significant drop of AUC micro occurs if ResNet50 is trained for a higher number of consecutive epochs, while the AUC micro value tops when the network is trained for 75 epochs with an Adam solver or 100 epochs for the AdaMax and Nadam solvers. The lower performances at the higher number of epochs are pointing towards overfitting. 1XPEHURIHSRFKV When the influence of batch size on AUC micro and AUC macro is examined, the results presented in Figures 18 and 19 are achieved. It can be noticed that the highest results are achieved when larger batches of 16 are used. In this case, the maximal AUC macro will go up to 0.9 only if the AdaMax solver is utilized. In the case of smaller batches, the AUC macro values between 0.7 and 0.8 are achieved, regardless of solver utilized. Similar conclusions could be drawn when AUC micro values are compared. In this case, the only significant difference is a significant underperformance of networks trained with the Adam solver by using smaller batches, as presented in Figure 19.

Results Achieved with ResNet101
The change of AUC macro over the number of epochs achieved with ResNet101 is presented in Figure 20. From the presented results, it can be noticed that the highest performances are achieved when the CNN is trained for 150 epochs. Such a property can be noticed for all solvers, with an exception of Nadam, which has achieved similar results at 50 epochs. Furthermore, the significant drop of AUC macro value can be noticed at a higher number of epochs. Such a fall can be attributed to overfitting. 1XPEHURIHSRFKV The influence of batch size on AUC macro is presented in Figure 22. A similar conclusion could be reached if classification performances are evaluated by using AUC micro . It can be noticed that the highest AUC macro values will be achieved if the network is trained by using larger batches, as presented in Figure 23.

Results Achieved with ResNet152
The last CNN used in this research is ResNet152. The change of AUC macro over number of epochs is presented in Figure 24. From the presented result, it can be noticed that the highest AUC macro value is achieved when ResNet152 is trained for 125 epochs. Such a property can be noticed regardless of the solver utilized. An exception can be noticed in the case of the AdaMax solver. In this case, AUC macro values over 0.9 are also achieved when the network is trained for 75 epochs. Furthermore, an influence of overfitting can be noticed when the network is trained for a larger number of epochs. Due to this property, it is important to train the network for a lower number of consecutive epochs in order to prevent overfitting and, consequently, lower classification performances.
1XPEHURIHSRFKV Similar results are achieved when the change of AUC micro over different number of epochs is observed, as presented in Figure 25. In this case, the highest AUC micro values are achieved when the network is trained for 100 and 125 consecutive epochs. It is important to notice that a significant fall of AUC micro occurs when the CNN is trained for 175 and 200 consecutive epochs. Such a trend can be noticed regardless of solver utilized, and it points toward an occurrence of overfitting. Due to these results, it can be concluded that it is necessary to avoid training for such a large number of epochs in order to prevent overfitting and to achieve higher classification performances.
1XPEHURIHSRFKV When the influence of batch size on AUC macro is observed, it can be noticed that by using a larger batch size of 16, higher AUC macro will be achieved. The significantly lower AUC macro values are achieved when ResNet152 is trained by using smaller batches of four. This property can be noticed for the case of all three solvers, as presented in Figure 26. Similar results can be noticed when the influence of batch size on AUC micro is observed. The only significant difference lies in the fact that for a batch size of four, somewhat higher values are achieved, as presented in Figure 27. Regardless of higher value, AUC micro , in this case, is still too low to be taken into consideration for practical application.

Comparison of Achieved Results
When the result achieved with all CNN architectures is compared, it can be noticed that in the case of AlexNet and VGG-16, the highest AUC macro values are achieved if networks are trained by using smaller batches for a lower number of epochs. On the other hand, ResNet architectures show better performances when trained by using larger batches for a higher number of consecutive epochs. Configurations that have achieved the largest AUC macro values are presented in Table 5. Similarly to the above-presented architectures, the highest AUC micro values are achieved when AlexNet and VGG-16 are trained by using smaller batches for a lower number of consecutive epochs. Furthermore, for the case of ResNet architectures, the highest AUC micro values are achieved when CNNs are trained by using larger batches for a larger number of epochs. The described configurations are presented in Table 6. Finally, when the highest AUC macro and AUC micro achieved with each CNN architecture are compared, it can be noticed that ResNet architectures are achieving dominantly higher classification performances. On the other hand, it can be noticed that by using deeper ResNet architectures, a rising trend of AUC macro and AUC micro is present, as presented in Figure 28. The achieved results are pointing to the conclusion that by using ResNet152 architecture, the highest AUC macro and AUC micro values of 0.93 and 0.94 are achieved. Given the results achieved, the possibility of using CNN for automatic classification of patients with COVID-19 with respect to lung status should be considered. Furthermore, when layer freezing is considered, it can be noticed that by freezing higher layers of CNNs during the training procedure, higher classification performances are achieved. The distribution of frozen and unfrozen layers is presented in Table 7 for each CNN architecture utilized. When comparing the achieved classification performances with classification performances in the case when all layers are fine-tuned, it can be noticed that in the case of freezing layers, slightly higher performances are achieved with each of the proposed CNN architectures, as presented in Figure 29. Furthermore, it can be noticed how the order of the architectures from the one with the best classification performance to the one with the worst is the same as in the previous case.

AlexNet
When the results achieved with transfer learning are compared to the results achieved on similar problems, it can be noticed that higher classification performances are achieved if transfer learning is utilized. Such a correlation can be noticed when the achieved results are compared with the results of research dealing with both COVID-19 [42,43] and other respiratory issues [22,23]. These results are pointing towards the utilization of transfer learning in order to increase the accuracy of evaluation of the clinical picture of COVID-19 patients from X-ray lung images.
These results show that ResNet152, in combination with transfer learning, is the network that achieves the best results in the case of evaluation of the clinical picture of COVID-19 patients using X-ray lung images.

Conclusions
The results achieved with this research are pointing towards the conclusion that CNNbased architectures could be used in estimation of the clinical picture of a COVID-19 patient according to the X-ray lung images. It is important to notice that deep CNN architectures have the tendency to overfit when they are trained with a higher number of consecutive epochs. Due to this property, it is concluded that steps such as early stopping and image augmentation must be used in order to prevent overfitting. According to the presented results and stated research hypothesis, the following conclusions could be drawn:

•
It is possible to utilize CNN for automatic classification of COVID-19 patients according to X-ray lung images; • The best results are achieved if ResNet152 architecture is utilized; • The best results are achieved if the aforementioned architecture is trained by using larger batches of data for an intermediate number of consecutive epochs by using Nadam solver; and • It can be noticed that by utilization of transfer learning and freezing layers, higher classification performances are achieved.
Due to the presented results and conclusions, a possibility for utilization of such an algorithm in battle against COVID-19 and its application in clinical practice should be taken into account. The main limitation of this research was the small amount of X-ray images, which could be overcome, to some extent, by augmentation techniques, and another limitation was class imbalance. Regardless of the presented limitations, the presented approach has shown promising results which point to further research on a larger and more balanced data set.