A Method for Detecting and Analyzing Facial Features of People with Drug Use Disorders

Drug use disorders caused by illicit drug use are significant contributors to the global burden of disease, and it is vital to conduct early detection of people with drug use disorders (PDUD). However, the primary care clinics and emergency departments lack simple and effective tools for screening PDUD. This study proposes a novel method to detect PDUD using facial images. Various experiments are designed to obtain the convolutional neural network (CNN) model by transfer learning based on a large-scale dataset (9870 images from PDUD and 19,567 images from GP (the general population)). Our results show that the model achieved 84.68%, 87.93%, and 83.01% in accuracy, sensitivity, and specificity in the dataset, respectively. To verify its effectiveness, the model is evaluated on external datasets based on real scenarios, and we found it still achieved high performance (accuracy > 83.69%, specificity > 90.10%, sensitivity > 80.00%). Our results also show differences between PDUD and GP in different facial areas. Compared with GP, the facial features of PDUD were mainly concentrated in the left cheek, right cheek, and nose areas (p < 0.001), which also reveals the potential relationship between mechanisms of drugs action and changes in facial tissues. This is the first study to apply the CNN model to screen PDUD in clinical practice and is also the first attempt to quantitatively analyze the facial features of PDUD. This model could be quickly integrated into the existing clinical workflow and medical care to provide capabilities.


Introduction
A drug use disorder, including drug abuse and drug dependence, is the persistent use of drugs despite substantial mental, physical, or behavioral harm. These disorders lead to adverse consequences, more commonly caused by illicit drugs (including stimulants, depressants, and hallucinogens), physiological withdrawal symptoms, and the inability to reduce or stop consuming drugs [1]. Drug use disorders caused by illicit drug use are significant contributors to the global burden of disease, and directly led to 20 million disability-adjusted life-years (DALYs) in 2010-accounting for 0.8% of global all-cause DALYs [2]. The Global Burden of Disease Study showed that 35 million suffered from drug use disorders and required treatment services, and 750,000 people died as a result of illicit drug use in 2017 [3]. Therefore, it is vital to recognize the early signs of drug use disorders, and provide early intervention before addiction takes hold, which is essential to ensure the most robust chances of successful recovery.
With the increase in patients using illicit drugs, primary care clinics and emergency departments (EDs) are facing challenges. Less than 20% of primary care physicians claimed

Related Works
Although some research has been completed on the detection of PDUD using deep learning, the relative research on drug use has provided new ideas. Snorting illicit drugs could cause permanent damage to a person's nose [10]. Some illegal drugs, such as cocaine, act as powerful stimulants that suppress appetite and lead to undernourishment for a long period of time [11]. Rapid weight loss could cause the body to begin consuming muscle tissue and facial fat, accelerating biological aging, and leading to face distortion [12,13]. Thus, abnormalities in some/whole face areas might also be indicators of PDUD. A previous study has pointed out that a significant increase in facial asymmetry in methamphetamine abusers [14].
Deep learning has been used actively in medical imaging, such as disease detection, medical image segmentation. As traditional methods reach their performance limits on images, CNN have started to dominate because of their good results on varying image classification tasks [15]. Shankar [16] proposed a deep learning algorithm based on the assessment of color fundus photographs to predict diabetic retinopathy (DR) progression in patients, and the clinical trial showed its potential in early identification of patients at the highest risk of DR, allowing timely referral to retina specialists and initiation of treatment. William [17] trained a CNN model to improve breast cancer detection on screening mammography, and it achieved an area under the curve (AUC) of 0.927 in the training dataset. The model could accurately locate the clinically significant lesions and base predictions on the corresponding portions of the mammograms. It was also effective in reducing false positives and false negatives in clinical screening. In the end-to-end training method, CNN models can directly convert input data into an output prediction without constructing complicated hand-craft features, and the parameters of the intermediate layers are automatically learned, and feature extraction is done during the training process. In this paper, a large-scale image dataset was prepared, and a CNN model with higher accuracy for screening patients with drug use disorders was proposed, making it promising for clinical applications.

Study Design and Procedure
Our study consisted of three main processes, as shown in Figure 1. First, 2416 images and 256 videos of 71 PDUD, and 103 videos of 103 GP were collected. PDUD were collected from a mobile health (mHealth) app (detailed information of the app can be found in Appendix A). The time range of the data in the mHealth app was from 30 October 2017 to 31 January 2020. The videos of GP were collected from the Internet. Video data captured a frame every 3 s and was saved as an image. Second, the images of PDUD (10,447) and the GP (21,666) in the dataset were preprocessed to obtain a clear facial image, and invalid or blurred ones would be removed. Third, to eliminate external distracting information, such as the background, clothes, or accessories, face cropping was performed on them to remove the images of face occlusion. To facilitate the batch processing of the CNN model, all images were resized to 224 × 224 pixels. After the above preprocessing, the images of PDUD (9870) and the GP (19,567) were merged ( Figure 2). Based on the 70/30 principle, these images were shuffled and randomly divided these images into a training dataset and a test dataset. The CNN model was trained in the training dataset and calculated the accuracy, sensitivity, and specificity in the test dataset (Appendix B Figure A1). accuracy for screening patients with drug use disorders was proposed, making it promising for clinical applications.

Study Design and Procedure
Our study consisted of three main processes, as shown in Figure 1. First, 2416 images and 256 videos of 71 PDUD, and 103 videos of 103 GP were collected. PDUD were collected from a mobile health (mHealth) app (detailed information of the app can be found in Appendix A). The time range of the data in the mHealth app was from 30 October 2017 to 31 January 2020. The videos of GP were collected from the Internet. Video data captured a frame every 3 s and was saved as an image. Second, the images of PDUD (10,447) and the GP (21,666) in the dataset were preprocessed to obtain a clear facial image, and invalid or blurred ones would be removed. Third, to eliminate external distracting information, such as the background, clothes, or accessories, face cropping was performed on them to remove the images of face occlusion. To facilitate the batch processing of the CNN model, all images were resized to 224 × 224 pixels. After the above preprocessing, the images of PDUD (9870) and the GP (19,567) were merged ( Figure 2). Based on the 70/30 principle, these images were shuffled and randomly divided these images into a training dataset and a test dataset. The CNN model was trained in the training dataset and calculated the accuracy, sensitivity, and specificity in the test dataset (Appendix B Figure B1).  Before considering using the model for clinical prediction, it is essential that its performance be empirically evaluated in datasets that were not used to develop the model [18]. Therefore, external validation datasets were prepared to evaluate the trained model. Another nine videos of nine PDUD and 50,000 images of 50,000 GP were collected, among which the PDUD data were provided by the local administrative department of Jinan City, Shandong Province, China, and the GP data were collected from the public database [19]. The videos of PDUD also underwent a similar video processing flow. After the above video processing, the images of PDUD and GP were 1925 and 50,000, respectively. Those images were filtered out, which were unclear or blurred images, and when preprocessing was complete, there were 1677 images of PDUD and 50,000 images of GP, respectively. The external validation datasets included validation 1 dataset and validation 2 to validation 7 datasets. For the validation 1 dataset, the data distribution was consistent with the training/test dataset, which was used to evaluate the performance of the trained model in the face of new data. In addition, considering the prevalence of drug use disorders in the clinic, the number of images in the validation 2 dataset was calculated as the required minimum sample number based on the prevalence in China (1.80%, power = 0.90, α = 0.05) [20]. Moreover, to further evaluate the performance of our trained model under the real-world scenario, they expanded the 1.5, 2, 2.5, 3, and 5 times based on the sample's number in the validation 2 dataset, and obtained the validation 3-7 datasets, respectively ( Figure 3). With reference to the number of images required in each validation dataset, they were randomly selected from these images (51,677) of PDUD and GP. Finally, the performance of the trained model was evaluated on these seven validation datasets. The above preprocessing of images was completed by the Dlib library, and the sample size was calculated by PASS version 11.0 [21][22][23][24]. Before considering using the model for clinical prediction, it is essential that its performance be empirically evaluated in datasets that were not used to develop the model [18]. Therefore, external validation datasets were prepared to evaluate the trained model. Another nine videos of nine PDUD and 50,000 images of 50,000 GP were collected, among which the PDUD data were provided by the local administrative department of Jinan City, Shandong Province, China, and the GP data were collected from the public database [19]. The videos of PDUD also underwent a similar video processing flow. After the above video processing, the images of PDUD and GP were 1925 and 50,000, respectively. Those images were filtered out, which were unclear or blurred images, and when preprocessing was complete, there were 1677 images of PDUD and 50,000 images of GP, respectively. The external validation datasets included validation 1 dataset and validation 2 to validation 7 datasets. For the validation 1 dataset, the data distribution was consistent with the training/test dataset, which was used to evaluate the performance of the trained model in the face of new data. In addition, considering the prevalence of drug use disorders in the clinic, the number of images in the validation 2 dataset was calculated as the required minimum sample number based on the prevalence in China (1.80%, power = 0.90, α = 0.05) [20]. Moreover, to further evaluate the performance of our trained model under the realworld scenario, they expanded the 1.5, 2, 2.5, 3, and 5 times based on the sample's number in the validation 2 dataset, and obtained the validation 3-7 datasets, respectively ( Figure  3). With reference to the number of images required in each validation dataset, they were randomly selected from these images (51,677) of PDUD and GP. Finally, the performance of the trained model was evaluated on these seven validation datasets. The above preprocessing of images was completed by the Dlib library, and the sample size was calculated by PASS version 11.0 [21][22][23][24].

CNN Construction and Training
CNN models were trained in the training dataset and tested in the test dataset to extract valid facial information from a large sample of images. Since the labels of the dataset had binary labels, the task was designed as a binary classification.
To find the appropriate model architecture, we analyzed the mainstream CNN mod-

CNN Construction and Training
CNN models were trained in the training dataset and tested in the test dataset to extract valid facial information from a large sample of images. Since the labels of the dataset had binary labels, the task was designed as a binary classification.
To find the appropriate model architecture, we analyzed the mainstream CNN models with transfer learning: Vgg-19, Inception, and Resnet-18 [25][26][27]. Then, the attention technique and the pre-trained model were introduced for training to improve the accuracy of the CNN model. The attention technique was used to make our CNN learn and focus more on the important information of images [28]. A pre-trained model was a saved network that was previously trained on a large dataset, typically on a large image dataset similar to the training target [29]. Therefore, a pre-trained Resnet-18 model in MS-Celeb-1M was chosen, which was a database for large-scale face recognition [30]. In addition, we tried different freezing layers configurations to compare the performance of models during transfer learning. The configurations include: (1) Training a CNN from scratch; (2) freezing all the layers but training the last fully connected layer; and (3) freezing all the layers but training the last five ones. Next, the training of the CNN model involved multiple hyperparameters, and the performance of the CNN models on the test dataset could be improved by adjusting different parameters. To obtain better parameters, different experiences were designed to adjust various parameters, while avoiding over-fitting and under-fitting problems on the dataset ( Table 1). The adjusted strategy included: (1) Different learning rates (LR); (2) whether to use batch normalization (BN); (3) whether to use a pretrained model; (4) whether to initialize the weights in the layers of the models. Moreover, the optimization algorithms stochastic gradient descent (SGD) and adaptive moment estimation (Adam) were applied to select better training algorithms, respectively [31] (Detailed training information can be found in Appendices B and C). When the loss of the models on the training dataset no longer decreased, the training ended. By comparing the accuracy of models with different parameters on the test dataset, the model with the best accuracy was chosen as the final CNN model. The sensitivity and specificity of the test dataset were calculated to evaluate the model comprehensively. In the external validation datasets, the best-performance model was used to calculate the accuracy, sensitivity, and specificity of the seven external validation datasets. The entire code of image analysis was done with open-source Python 3.6, and the construction of the CNN networks was implemented based on the PyTorch 1.3 [32] (Appendix B Algorithm A1). All networks were trained on an NVIDIA GeForce GTX 2080Ti.

Quantitative Analysis of Facial Features and Visualization
The interpretability of the CNN model is useful to explain why it predicts what it predicts. The feature map referred to the result of output captured by the filter on the output of the previous layer of the network. The gradient weighted class activation mapping (Grad-CAM) technique was applied to visualize the high-dimension information of the CNN model [33]. Then we quantitatively analyzed whether there were significant differences in features between PDUD and the GP in each facial area. The analysis process that automatically counted the number of facial features in different facial areas in the input images was constructed. The complete analysis process was divided into the following six steps ( Figure 4). First, a facial mark detector was applied to produce 68 coordinates that were mapped to the structure of the face, and then the entire face was divided into six different areas: The left and right eyes, the nose, the left and right cheeks, and the mouth ( Figure 4B and Appendix B Figure A2). The image with heatmap as the input data was converted into a binary image. In the binary image, the heatmap area tended to be white, and the other positions tended to be the opposite black. (Figure 4C). In the fourth step, the Gaussian Blur operation was performed on the binary images with a Gaussian kernel size of 3 × 3, and then the threshold operation was performed on the binary images. Finally, the contours in the binary images were marked, which were the facial features ( Figure 4D). The contours those length or width were less than 10 pixels would be discarded because they were too small to be valid facial features. In the fifth step, the number of times each facial feature appeared in the six facial areas was counted, and the proportion of different areas were calculated ( Figure 4E). The sixth step shows the result of the demonstration ( Figure 4F). The above steps were carried out by OpenCV-Python [34]. Finally, the characteristics of PDUD and the GP in different facial areas were compared by the chi-square test using SPSS version 22.0 (IBM Corporation, Armonk, NY, USA).

CNN Model Training and Performance
After excluding 2676 images because of the lack of clear or complete facial images, the training dataset consisted of 6,871 PDUD images and 13,734 GP images, and the test dataset consisted of 2999 PDUD images and 5833 images of the GP (Figure 2).
For the three models, the Resnet-18 achieved a better result, and it was only trained in the last five convolutional layers ( Table 2). Compared with Vgg-19 and Inception, there was an increase of 2-23 percentage points in Resnet-18. Therefore, Resnet-18 was the suitable architecture in this study. In addition, the results showed that both attention Figure 4. The quantitative facial feature analysis process. Note: The picture is for demonstration purposes only. The person in the picture has no connection with any use of illicit drugs. The data are from the BP4D dataset [35,36].

CNN Model Training and Performance
After excluding 2676 images because of the lack of clear or complete facial images, the training dataset consisted of 6871 PDUD images and 13,734 GP images, and the test dataset consisted of 2999 PDUD images and 5833 images of the GP (Figure 2).
For the three models, the Resnet-18 achieved a better result, and it was only trained in the last five convolutional layers (Table 2). Compared with Vgg-19 and Inception, there was an increase of 2-23 percentage points in Resnet-18. Therefore, Resnet-18 was the suitable architecture in this study. In addition, the results showed that both attention technique and pre-trained model would help Resnet-18 to achieve better performance, with an increase of about 4 and 10 percentage points, respectively (Appendix B Table A1). Therefore, the pre-trained model had more advantages for transfer learning. Regarding the different optimization algorithms, our results show that SGD was better than Adam in improving the model scores (Appendix B Table A2).  Table A1). Finally, the Resnet-18 model with the best accuracy of 84.68% was selected, and the parameters included the learning rate of 0.1, using batch normalization technology, and using the pre-trained model in the MS-Celeb-1M dataset ( Figure 5). Moreover, regarding the test dataset, the sensitivity and specificity of the model were 87.93% and 83.01%, respectively (Table 3 and Appendix B Figure A1). According to the sensitivity and specificity of the test dataset, the minimum number of samples was 13 images of PDUD and 7209 images of GP. Then we selected the corresponding amount of data and prepared seven external validation datasets (Figure 3). The performance of the model in seven groups external validation datasets was: The accuracy was higher than 83.69%, the highest was 90.10%, the sensitivity was higher than 80.00%, the highest was 86.67%, and the specificity was higher than 84.25%, the highest was 90.10% (Table 3, Appendix B Figures A1 and A4).

Typical Facial Features of PDUD and Visualization
In the activation heat map on the images, colors highlighted these apparent facial features extracted by the CNN model. In the examples of visualized images of PDUD, rows A, B, and C, respectively, represent the output features in the cheeks, nose, and mouth areas (Appendix B Figure A4). The concentration of facial features recognized by the CNN model was different between PDUD and GP. The proportions of the GP in the six facial feature areas were similar (35.92% in left-eye, 43.31% in right-eye, 40.97% in mouth, 29.02% in the nose, 34.36% in left-cheek, 35.47% of right-cheek). However, the recognizable facial features of drug users were more distinctive and mainly concentrated in the nose (42.98%) areas, left cheek (44.91%), and right cheek (44.85%), and these proportions were much higher than that of the GP (p < 0.001) ( Table 4). Note. * refers to the difference between the people with drug use disorders and the general population was significant (p < 0.001), and there is a higher proportion of the general population; ** refers to the difference between the people with drug use disorders and the general population was significant (p < 0.001), and there is a higher proportion of the people with drug use disorders.

Discussions
This study developed and validated an image-based CNN model for screening PDUD. As the most popular CNN architecture in computer vision, the Resnet network showed higher performance with the simple, but effective residual block. Freezing the last five layers also benefited transfer learning and reduced computation time. In addition, the attention technique and pre-trained model in transfer learning were introduced in experiments. Overall, the pre-trained model contributed more to the final scores than the attention module. Considering that the attention mechanism module still needed iterative training to extract the related feature information, but the pre-trained model already contained rich facial information, which also enabled the model to quickly extract facial feature cues of PDUD. Therefore, the Resnet-18 with a pre-trained model was selected as the transfer learning scheme. Based on this, the model achieved a high accuracy of 84.68% on the test dataset by fine-tuning parameters.
The external validation datasets were inspired by the authors of [37], and thus, based on real scenarios. These scenarios were built to evaluate the performance of the model, and the prevalence of PDUD in them was consistent with that in the real scenario. The results showed that our CNN model still maintained a better score, which meant that this model was promising in practical clinical screening. The rapid screening efficiency, simple operation process, and low medical cost enable our model to be quickly integrated into the existing clinical workflow and medical care. The related study found that most primary care physicians were not yet ready to prepare for drug abuse [38]. Our method can be applied to primary care clinics to provide screening services for patients, especially when patients first visit the clinic. The screening can be done prior to the medical encounter or in the waiting room. In addition, our model can also be flexibly deployed on mobile apps. The screening can be done through the patient portal while the patient is at home, and the results can be integrated into the electronic health record to assist primary care physicians in providing appropriate preventive care. This not only alleviates the discomfort of patients during face-to-face screening, but also protects individual privacy. This electronic screening was also supported by patients [8]. On the other hand, our model can be embedded in the admission system of ED to provide the capability to detect drug use disorders quickly. In the routine emergency treatment process, the newly acquired screening capability can help doctors know the patient's drug use condition and determine further intervention or referral to drug use treatment.
The visualization of the feature maps of our CNN model showed that drug use affected the face of the patients, which was consistent with previous case report studies [39].
The statistical results showed that the significance of PDUD in the nose and cheek areas revealed the potential relationship between drug use patterns and mechanisms of action of drugs and changes in facial tissues, which is the first quantitative analysis of facial features in the related studies of PDUD. On the one hand, the characteristics of the nose area suggested that this may be related to specific drug use patterns. Snorting, sniffing (intranasal delivery), or smoking drugs, is a drug use pattern often chosen by drug users to avoid injection use. Because the mucosa inside the nose is easily accessible, the drugs are quickly absorbed in the form of powder, liquid, or aerosol, which can irritate or infect the nasal tissue [40]. Frequent snorting, sniffing, or smoking illicit drugs will cause a lack of oxygen and nutrients in the nasal passages. The death of nasal tissue cells can cause damage to the nose of drug users, resulting in changes in the facial area [41]. Therefore, our model captures this local feature. On the other hand, the facial features of PDUD in the cheek areas were directly related to the rapid loss of facial fat caused by illicit drug use, which is consistent with previous research [11,42]. The inhibitory effect of drugs on human appetite can lead to malnutrition, and the distribution of superficial fat on the face is mainly on the medial cheek fat and middle cheek fat [9]. Therefore, this feature of the change in facial fat distribution caused by drug use was also extracted and recognized by our model. The discovery of these facial features provides ideas for basic medical research, including mechanisms of drug action, facial anatomy characteristics, and physiology mechanism of PDUD.
There are several limitations to this approach. Limited by our research data, our study did not meticulously categorize PDUD in terms of the two attributes of illicit drug type and the time of suffering from drug use disorders. Moreover, the image information collected through the mHealth app would be affected by the hardware of mobile devices. The stability of the mHealth app will be optimized in further study. Nevertheless, our work is highly innovative in related fields with high feasibility and accessibility, especially in detecting and analyzing facial features in PDUD.

Conclusions
Drug use disorders continue to attract attention-however, there is a lack of simple and efficient tools in clinical screening, especially in primary care clinics. This paper is, to the best of our knowledge, the first study to apply the CNN model using transfer learning to screen PDUD by using facial images. Large-scale datasets were prepared, and various experiments were designs to optimize this model. The performance of this model was evaluated in real scenarios, and the results maintained high accuracy, sensitivity, and specificity. Therefore, this study is promising for clinical practice, which would help clinicians find potential PDUD and provide timely intervention and target treatment. This is also the first study to quantitatively analyze the facial features of PDUD. This contributes to the exploration of facial anatomy characteristics and physiological mechanisms of PDUD.
Author Contributions: Z.J. and Y.L. designed the study. Y.L. and Z.J. collected the data. X.Y., B.Z., and H.S. cleaned the data. Y.L., X.Y., and Z.W. analyzed the data. Y.L., X.Y., B.Z., and Z.W. explained the results. Y.L. and X.Y. wrote the initial draft of the manuscript. Y.L., X.Y., Z.W. revised the report from preliminary draft to submission. All authors read and approved the final manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has also been obtained from the patient(s) to publish this paper.

Data Availability Statement:
The datasets during this study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest. in skipping the ideal minimum or an unstable training process ( Figure A6). We chose 0.10 and 0.01 as the initial value for the LR and compared the training speed and accuracy on the dataset. Moreover, during the model training process, the parameter LR was dynamically adjusted. The LR decayed with a multiplicative factor of 0.10 was applied after every ten epochs. Therefore, the model's parameters were updated with a larger learning rate in the early stage of training, and a lower learning rate in the later stage can help converge the optimization process. BN was designed to automatically standardize the inputs to a layer in a deep learning neural network for each mini-batch. BN had the effect of stabilizing the learning process and accelerating the model's training process, and it can improve the performance of the CNN model. The accuracy of the three models with/without BN on the dataset was compared. The aim of weight initialization was to prevent layer activation outputs from exploding or vanishing during a forward pass through a deep neural network. At the beginning of training the model, the weights of each layer in the network were initialized with random numbers, which caused the loss gradient to be too large or too small during the training process, which meant that the accuracy of the model was difficult to keep improving as the training time increases. The Kaiming Initialization was chosen as the model initialization method. Compared with the random initialization weight in the layer, the Kaiming Initialization method ensured that the loss of the model continues to decrease during the training process. Finally, the cross-entropy function was configured as the loss function for models.

Appendix B.3. The Optimization Algorithms Analysis
In the training process, the Adam and SGD (with momentum 0.90) algorithms were chosen as the optimizer to reduce the loss and accelerate the convergence speed of three CNN models. The accuracy of models with different optimizers on the test dataset was also compared.

Appendix B.4. The Data Augmentation of Images
During the training process, three data augmentation techniques were applied to the images to improve model robustness and avoid over-fitting the model, which included image color jitter, image flipping, and image standardization. The brightness, contrast, and saturation of the images in the dataset were randomly changed. Then, the images randomly with 0.50 probability were horizontally flipped. Finally, a random affined transformation was applied to the images. ( Figure A7).
In order to better understand the entire training process, the pseudocode was provided in Algorithm A1. In the training process, the Adam and SGD (with momentum 0.90) algorithms were chosen as the optimizer to reduce the loss and accelerate the convergence speed of three CNN models. The accuracy of models with different optimizers on the test dataset was also compared.

The Data Augmentation of Images
During the training process, three data augmentation techniques were applied to the images to improve model robustness and avoid over-fitting the model, which included image color jitter, image flipping, and image standardization. The brightness, contrast, and saturation of the images in the dataset were randomly changed. Then, the images randomly with 0.50 probability were horizontally flipped. Finally, a random affined transformation was applied to the images. ( Figure B7).
In order to better understand the entire training process, the pseudocode was provided in Algorithm 1.     Green is the eye area, red is the cheek, yellow is the mouth, and purple is the nose. Note. The picture is for demonstration purposes only. The data are from the BP4D dataset [35,36]. Green is the eye area, red is the cheek, yellow is the mouth, and purple is the nose. Note. The picture is for demonstration purposes only. The data are from the BP4D dataset [35,36]. Figure B3. The results of sensitivity and specificity of the seven external validation datasets. Figure A3. The results of sensitivity and specificity of the seven external validation datasets. Figure B3. The results of sensitivity and specificity of the seven external validation datasets.  Resampling data was obtained by only examining some of the samples from a large number of images from one category, in order to balance the number of images of the two categories. Figure B6. The influence of different learning rates on CNN model training.
Learning rates that are too small can cause the training process to be too slow, making it difficult to reach the minimum point of the loss function. Whereas, learning rates that are too large can result in the minimum point of loss function being skipped, making the Resampling data was obtained by only examining some of the samples from a large number of images from one category, in order to balance the number of images of the two categories.  Resampling data was obtained by only examining some of the samples from a large number of images from one category, in order to balance the number of images of the two categories. Learning rates that are too small can cause the training process to be too slow, making it difficult to reach the minimum point of the loss function. Whereas, learning rates that are too large can result in the minimum point of loss function being skipped, making the training process unstable. Learning rates that are too small can cause the training process to be too slow, making it difficult to reach the minimum point of the loss function. Whereas, learning rates that are too large can result in the minimum point of loss function being skipped, making the training process unstable.

Convolutional Neural Network
A convolutional neural network (CNN) is an implementation of a neural network for machine learning that specializes in processing large-scale data, such as images, which is widely used in medical images applications. The typical CNN usually consists of input, feature extraction and classification, and output. Among them, feature extraction is the core component. It includes the convolutional layers, pooling layers, and non-linear activation units. Each convolutional layer contains various filters called kernels. The filter is a matrix of integers that are used on a subset of the input pixel values, the same size as the kernel. Each pixel is multiplied by the corresponding value in the kernel, and a single value is added to the result to simply represent the grid unit (such as a pixel) in the output channel/feature map [44].  Appendix C.

Appendix C.1. Convolutional Neural Network
A convolutional neural network (CNN) is an implementation of a neural network for machine learning that specializes in processing large-scale data, such as images, which is widely used in medical images applications. The typical CNN usually consists of input, feature extraction and classification, and output. Among them, feature extraction is the core component. It includes the convolutional layers, pooling layers, and non-linear activation units. Each convolutional layer contains various filters called kernels. The filter is a matrix of integers that are used on a subset of the input pixel values, the same size as the kernel. Each pixel is multiplied by the corresponding value in the kernel, and a single value is added to the result to simply represent the grid unit (such as a pixel) in the output channel/feature map [44].