Stochastic Decision Fusion of Convolutional Neural Networks for Tomato Ripeness Detection in Agricultural Sorting Systems

Advances in machine learning and artificial intelligence have led to many promising solutions for challenging issues in agriculture. One of the remaining challenges is to develop practical applications, such as an automatic sorting system for after-ripening crops such as tomatoes, according to ripeness stages in the post-harvesting process. This paper proposes a novel method for detecting tomato ripeness by utilizing multiple streams of convolutional neural network (ConvNet) and their stochastic decision fusion (SDF) methodology. We have named the overall pipeline as SDF-ConvNets. The SDF-ConvNets can correctly detect the tomato ripeness by following consecutive phases: (1) an initial tomato ripeness detection for multi-view images based on the deep learning model, and (2) stochastic decision fusion of those initial results to obtain the final classification result. To train and validate the proposed method, we built a large-scale image dataset collected from a total of 2712 tomato samples according to five continuous ripeness stages. Five-fold cross-validation was used for a reliable evaluation of the performance of the proposed method. The experimental results indicate that the average accuracy for detecting the five ripeness stages of tomato samples reached 96%. In addition, we found that the proposed decision fusion phase contributed to the improvement of the accuracy of the tomato ripeness detection.


Introduction
The quality of tomatoes depends on appearance (color, size, texture, etc.) and nutritional value (minerals, acidity, antioxidants, etc.). These properties are commonly related to ripeness [1]. As tomatoes ripen, glucose and fructose accumulate, and therefore, antioxidants (e.g., ascorbate, lycopene, β-carotene, rutin, and caffeic acid) increase [2,3], organic acids (e.g., malic acid and citric acid) decrease, and sweetness increases [4]. Furthermore, the surface color changes to red owing to a decrease in chlorophyll and an increase in lycopene, and the flesh firmness decreases owing to a decrease in pectic substances [4,5]. Therefore, determining the appropriate ripening stages of tomatoes for sale before packaging is very important. For example, let us suppose that tomatoes in different ripeness stages are packaged into the same bundle. The tomatoes have different respiration rates, thereby resulting in the acceleration of ripening due to ethylene production. This effect makes quality management a challenge. On the other hand, the commercial value of tomatoes can be maintained longer if they are sorted and packed into the proper ripening stages [6]. The classification of ripening stages is generally conducted by trained laborers. The manual sorting process follows standard guidelines, such as those prescribed by the United States Department of Agriculture (USDA) chart [7]. This manual-based approach has drawbacks, such as relying on the competence of laborers and adding additional costs for training sorters. The biggest challenge that the agricultural industry faces is a decrease in the number of laborers due to an aging population and the rise of labor costs. These problems show that time-consuming and labor-intensive work should be replaced with automated systems in future farm environments.
In recent years, artificial intelligence technologies have become popular in the food and agriculture research field. In several recent studies related to food and agricultural production, machine learning and deep learning applications have shown great success. In the food domain, deep learning has shown promising performance in a variety of tasks, such as food recognition [8], calorie estimation [9], sentiment analysis for cookery channels [10], and fruit quality detection. In this paper, we narrow this scope to the core task of developing smart technology for tomato farms: tomato ripeness detection consisting of the localization of the tomato sample and its ripening stage classification. Fast and accurate detection of ripe tomatoes is an important task in replacing manual laborers with automatic systems. Zhao et al. [11] developed a machine vision system to detect ripe tomato samples in a greenhouse scene by combining the AdaBoost classifier and a contour analysis method. Liu et al. [12] studied an algorithm combining a coarse-to-fine scanning method and a false-color removal method to detect mature tomatoes. To achieve accurate ripe tomato detection, Hu et al. [13] suggested a method that combines a deep learning algorithm and an edge-contour analysis method. Sun et al. [14] proposed an improved feature-pyramid-network-based tomato organ recognition method. These results demonstrate that machine learning, including deep learning, contributes to improving tomato detection and can be further used in commercial applications.
Previous related works on fruit defect or grade detection in computer vision can be categorized into two types: approaches based on hand-crafted features and those using deeply learned features. Most existing studies belong to the former class [11,12,[15][16][17].
Hand-crafted features have the advantages of locality and simplicity, but may lack the semantic and discriminative capacity of extracted features in changing environments, as appropriate features are generally selected based on experience. For example, as the number of ripening stages of a tomato sample increases, the difficulty in designing proper descriptors sufficient to classify these classes using raw images also increases. Furthermore, in the case of classification, it is very time-consuming to determine an optimum combination of the feature extractor and classifiers. In contrast, a deeply learned feature is extracted from the training dataset itself using an end-to-end learning model architecture, so it comes up with a reasonable descriptor for ripe tomato detection. Besides, the time-consuming procedure necessary to find the optimum combination of feature extractor and classifier is not required. For instance, a convolution network has abstracted feature maps that vary depending on the depth of the corresponding layers, so that any feature map can enable the representation of a data-driven descriptor [18]. Kamilaris et al. [19] found that deep learning models achieved higher accuracy compared with those using hand-crafted and shallow approaches. The fine-grained ripeness classification in practical scenarios is another challenge. Previous research has mainly focused on hand-crafted color features on the surface of tomatoes. Li et al. [20] proposed a dominant color histogram matching method to analyze the shape, ripeness level, size, and surface defects of tomatoes. Arakeri et al. [21] developed a tomato sorting software combining a preprocessor for noise filtering of raw RGB images and a color feature extractor to detect surface defects and the ripeness stage of tomatoes. In recent years, researchers have developed machine-learning-based approaches to classify the ripeness of tomatoes. For example, Goel and Sehgal [15] converted color features in RGB space into R-G features and conducted a sorting task using a fuzzybased classifier. El-Bendary et al. [16] determined the ripeness degree using color features in the HSI space, a PCA-based feature extractor, and supervised-learning-model-based classifiers. Furthermore, recent research has shown promising performance for the same task [17]. However, these approaches may lead to intra-class variations at the same stage, such as dynamic viewpoints, illumination conditions, and atypical shapes of surface color distribution. These attributes are similar to the problem of human action recognition in videos. The two-stream convolutional neural networks (ConvNets) outperformed in the task of action recognition [22]. This scheme can be adapted to our problem of tomato ripeness detection by observing a tomato from multiple viewpoints, rather than from a single viewpoint. However, the late fusion of ConvNet streams may lead to performance decay if the fusion strategy is inadequate or the proper parameter settings are omitted. To solve this issue, we propose a novel tomato ripeness detection pipeline based on multi-ConvNet streams with a stochastic decision fusion (SDF) that can not only precisely classify the ripening stage of a tomato sample but also localize the sample in real time.
This paper aims to develop an accurate tomato ripeness detection method based on deep learning, called stochastic decision fusion of convolutional neural networks (SDF-ConvNets). The proposed method is expected to be applied in the form of a sorting system that classifies fruits according to those ripeness degrees in the post-harvest stage. Since the ripeness detection process is conducted by observing images from various viewpoints, excellent synergy can be expected if the sorting module can be designed based on a conveyor structure capable of rotating fruit objects and transporting them. To train and evaluate the model, we constructed a large-scale image dataset collected by our customized image acquisition system. Experiments for evaluating our system have shown superior performance to other methods. In summary, the key contributions of our work are: (a) developing a deep-learning-based robust and accurate ripe tomato detector, (b) increasing the accuracy of ripening stage classification using the stochastic decision fusion method, and (c) collecting a large-scale image dataset that captures the five ripening stages of tomatoes from different viewpoints.

Tomato Image Acquisition
The tomato images used in this study were captured using a JAI 3CCD camera and a light source chamber, as shown in Figure 1. The CCD camera is a 3 × 1/3" CCD color progressive scan camera (up to 120 frames/s with full resolution). The camera was coupled with a C-mount lens module whose focal length is 35 mm and the min/max operation range of the iris is f2.0/f22.0.
(ConvNets) outperformed in the task of action recognition [22]. This scheme can be adapted to our problem of tomato ripeness detection by observing a tomato from multiple viewpoints, rather than from a single viewpoint. However, the late fusion of ConvNet streams may lead to performance decay if the fusion strategy is inadequate or the proper parameter settings are omitted. To solve this issue, we propose a novel tomato ripeness detection pipeline based on multi-ConvNet streams with a stochastic decision fusion (SDF) that can not only precisely classify the ripening stage of a tomato sample but also localize the sample in real time.
This paper aims to develop an accurate tomato ripeness detection method based on deep learning, called stochastic decision fusion of convolutional neural networks (SDF-ConvNets). The proposed method is expected to be applied in the form of a sorting system that classifies fruits according to those ripeness degrees in the post-harvest stage. Since the ripeness detection process is conducted by observing images from various viewpoints, excellent synergy can be expected if the sorting module can be designed based on a conveyor structure capable of rotating fruit objects and transporting them. To train and evaluate the model, we constructed a large-scale image dataset collected by our customized image acquisition system. Experiments for evaluating our system have shown superior performance to other methods. In summary, the key contributions of our work are: (a) developing a deep-learning-based robust and accurate ripe tomato detector, (b) increasing the accuracy of ripening stage classification using the stochastic decision fusion method, and (c) collecting a large-scale image dataset that captures the five ripening stages of tomatoes from different viewpoints.

Tomato Image Acquisition
The tomato images used in this study were captured using a JAI 3CCD camera and a light source chamber, as shown in Figure 1. The CCD camera is a 3 × 1/3″ CCD color progressive scan camera (up to 120 frames/s with full resolution). The camera was coupled with a C-mount lens module whose focal length is 35 mm and the min/max operation range of the iris is f2.0/f22.0. We designed the customized image acquisition system in conjunction with an annotation labeling software that can manually generate ground truth labels for training the proposed deep learning model at the same time as image acquisition. The ground truth labels consisted of the spatial information of tomatoes in the image space as well as the ripening stages. We selected tomatoes of the "Dafnis" variety for building a large-scale image dataset. These were classified into five stages according to the USDA color chart. We designed the customized image acquisition system in conjunction with an annotation labeling software that can manually generate ground truth labels for training the proposed deep learning model at the same time as image acquisition. The ground truth labels consisted of the spatial information of tomatoes in the image space as well as the ripening stages. We selected tomatoes of the "Dafnis" variety for building a large-scale image dataset. These were classified into five stages according to the USDA color chart. The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1. The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase ( Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints. The dataset was collected from a minimum of 500 samples for each ripeness stage to contain as many atypical features as possible, as shown in Table 1.

Accurate Tomato Ripeness Detection Using the SDF-ConvNets
We concentrated on an approach for classifying the ripening stage of tomatoes, considering practical scenarios such as an automated sorting application in the post-harvesting process.
We built a sequential process consisting of an initial tomato ripeness detection phase and a stochastic decision fusion phase (Figure 2). A standard one-stage detector YOLOv3 [23] was used for the initial ripeness detection stage based on tomato images viewed at the stem-/flower-end, respectively. Then, the estimated results were transferred to the stochastic decision fusion phase to improve the final ripening stage classification result of the target tomato sample. This approach was inspired by how human laborers generally judge the ripening stage of tomatoes by observing them from multiple viewpoints.

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main

Initial Tomato Ripeness Detection Based on YOLOv3
The initial tomato ripeness detection task consisted of two main parts: localizing the spatial region of the target tomato sample and classifying those ripening stages from tomato images viewed from the stem-/flower-end. Even when the observed environment is constrained, classifying ripening stages is a difficult fine-grained problem, in which the variation between consecutive stages is low and the variation between tomatoes belonging to the same group is relatively high. Recently, deep-learning-based approaches have been used as a solution to this type of problem. The typical ConvNet architecture consists of two main parts: a set of convolutional layers that perform feature extraction, and classification layers. The frontal layers of the network mainly focus on obtaining deeper domain features of the input, and the extracted features are transferred to the classification layers to discriminate between classes, such as the ripening stages by using fully connected [24] or global average pooling layers [25]. The parameters of the convolutional and classification layers can be trained end-to-end. Several studies on object classification and detection based on ConvNets have already achieved great success in various computer vision areas [26][27][28][29].
We applied a one-stage object detector based on YOLOv3 to resolve the tomato ripeness detection problem. The YOLOv3 starts with an assumption that the entire input image can be divided into S × S grid cells, and B proposal regions are located on each cell. The detector generates an output tensor consisting of five elements, such as the spatial information and class score for each region. If there is a jth bounding box in the ith grid cell, then the spatial information of the target object is depicted as, b ij = b x , b y , b w , b h ∈ R 1×4 and the object class score for the bounding box is depicted as C. The spatial information is the coordinate offset and the size of the bounding box. The scalar variable C refers to whether the confidence score of the predicted box contains an object. Each box also predicts the multi-label score vector P = [p(c m )] m=1,...,M ∈ R 1×M as a result of independent logistic classifiers. The vector represents the conditional probability distribution of M classes, given that an object is contained in the predicted box. Therefore, the total size of the output tensor can be computed as S × S × B × (5 + M). A post-processing step, such as a non-maximum suppression algorithm, is required to obtain the final result based on the output tensor. In the training phase, a loss function based on binary cross-entropy was applied. The loss function L consisted of sub-loss functions L bbox , L con f , and L cls . First, L bbox was obtained by comparing the predicted bounding box b ij with the ground truth boxb, as shown in L con f represents the difference between the predicted confidence score C and the ground truthĈ among {0, 1}, as defined in Equation (2).
Equation (3) defines the loss L cls used in the general multi-label classification problem. y m is the ground truth label, so that if the class is correct it is 1 and if not, 0. If the correct one is the mth class, then y m = 1 and otherwise it is 0.
Finally, the sub-loss functions are defined in Equation (4).
The weights applied to losses (1) and (2) were set as λ bbox = 5 and λ nobj = 0.1, respectively [29]. By optimizing this loss function according to the mini-batch-based stochastic gradient descent algorithm, each ConvNet can be trained.
We tried to arrange the backbone network architecture by piling multiple residual modules with a simple shortcut connection [27]. The ConvNet performance is strongly related to the hyperparameters of the backbone architecture. It is widely known that deeper models decrease bias and increase variance [30]. Considering the bias-variance tradeoff, we mainly focused on the number of convolution layers and scales of the final feature maps. There were 23 shortcut-connection-based residual modules and three heads conducting initial ripeness detection using the global average pooling layer, and each head received a different scaled feature map as input. The first head was directly derived from the backbone and utilize the smallest-scale feature map. The second head branched by using the low-level feature map of the backbone and the convolution layer output of the first head as input. The third head also branched by using the lowest-level feature map of the backbone and the convolution layer output of the second head. The detailed backbone ConvNet architecture of the ripeness detection phase is presented in Figure 3.
We trained the model with the hyperparameter configuration of the YOLOv3 [23]. The optimizer of the training process was a mini-batch stochastic gradient descent with momentum.

•
Learning  Figure 4 represents examples of the initial ripeness detection result based on the deep learning model. In case 1, the predicted stages at both viewpoints are matched, while the result at the flower-end viewpoint in case 2 is different from that at the other viewpoint. Therefore, in case 2, it is difficult to determine which is the correct ripening stage of the target tomato sample. To overcome this limitation, we propose a stochastic decision fusion method.

Stochastic Decision Fusion
To accurately classify the final ripening stages of target tomato samples, we tried to apply two types of weighted-fusion-based approaches. The first was to assign equal weight to both ConvNet stream results. We assigned the equivalent scalar value as the weight for each stream, as shown in Equation (5): where P n ∈ R 1×M is a multi-label score vector representing discrete probability distributions for tomato ripening stages when viewing a tomato sample from the n-th viewpoint. Second, we hypothesized that weighting the superior one among the streams would increase the accuracy of the final decision. In this paper, a multi-label confusion matrix A n = a n1 , . . . , a nm , . . . , a nM ∈ R M×M was used to reflect the performance value of each ConvNet stream in the weight decision process. The column vector a nm ∈ R M×1 of the confusion matrix A n was normalized by the total number of samples belonging to the m-th class. This implies that each element of the a nm ratio of the number of samples classified as each class to the total number of samples belonging to the m-th class. The m-th element is regarded as the precision of the classification result. Precision is an appropriate performance metric for each ConvNet stream for classifying the ripening stages of tomatoes, as decreasing the number of false-positive samples is important for practical applications. Subsequently, the proposed weight decision process was conducted by combining the score vectors P 1 , P 2 and multi-label confusion matrices A 1 , A 2 . The details are described based on examples of score vectors and confusion matrices. First, let us suppose that score vectors obtained from both ConvNet streams for the k-th tomato sample are set   • Scales of final feature map: 8, 16, 32. Figure 4 represents examples of the initial ripeness detection result based on the deep learning model. In case 1, the predicted stages at both viewpoints are matched, while the result at the flower-end viewpoint in case 2 is different from that at the other viewpoint. Therefore, in case 2, it is difficult to determine which is the correct ripening stage of the target tomato sample. To overcome this limitation, we propose a stochastic decision fusion method.

Stochastic Decision Fusion
To accurately classify the final ripening stages of target tomato samples, we tried to apply two types of weighted-fusion-based approaches. The first was to assign equal weight to both ConvNet stream results. We assigned the equivalent scalar value as the weight for each stream, as shown in Equation (5): where ∈ ℝ × is a multi-label score vector representing discrete probability distributions for tomato ripening stages when viewing a tomato sample from the -th viewpoint. Second, we hypothesized that weighting the superior one among the streams would increase the accuracy of the final decision. In this paper, a multi-label confusion matrix = ⌊ , … , , … , ⌋ ∈ ℝ × was used to reflect the performance value of each ConvNet stream in the weight decision process. The column vector ∈ ℝ × of the confusion matrix was normalized by the total number of samples belonging to theth class. This implies that each element of the ratio of the number of samples classified as each class to the total number of samples belonging to the -th class. The -th element is regarded as the precision of the classification result. Precision is an appropriate performance metric for each ConvNet stream for classifying the ripening stages of tomatoes, as decreasing the number of false-positive samples is important for practical applications. Subsequently, the proposed weight decision process was conducted by combin- For the element with P 1 (k), the largest score value belongs to the first ripening stage "T", so the first column vector a 11 of A 1 is responsible for determining the weight of Con-vNet stream 1. In contrast, the element of P 2 (k) belongs to the second stage "P" showing the largest score, so the second column vector a 22 of A 2 is responsible for determining the weight of ConvNet stream 2. Therefore, new weight vectors α 1 , α 2 ∈ R 1×M were computed using Equation (6). α 1 = a 11 a 11 + a 21 , α 2 = a 22 a 12 + a 22 (6) The final ripening stage of the input tomato sample was computed as shown in Equation (7): where ⊗ is element-wise multiplication. These are given by the classification results of both ConvNet streams. Therefore, the final result depends on the configuration of the weight vectors, α 1 , α 2 ∈ R M×1 which are, respectively, responsible for the stem-end and flowerend viewpoints of the tomato. This approach improves the accuracy of the ripening stage classification result by biasing the superior stream. The proposed algorithm is summarized in Figure 5. The results of both decision fusion processes are transformed into an L2normalized vector like the Softmax function. Therefore, it is possible to determine the final ripening stage of the target tomato sample with the index of the maximum valued element of the output vector. In the next chapter, we describe several experiments we conducted to evaluate our proposed approach by comparing it with existing state-of-the-art approaches.

Results
The proposed SDF-ConvNets was verified with our tomato image d experiments in this section. The experiments were conducted on a comp with an Intel ® Core™ i7-4790K 4.00 GHz CPU, 32 GB of RAM, and an NV GTX Titan Xp GPU processor. We utilized the deep learning framework Three metrics were used to evaluate the experimental results: precision, score. In the multi-class classification problem, we calculated the precision score per class in a one-versus-rest manner.   Table 3 represents the experimental results obtained with our test dat age F1-scores of 94.2% and 93.04% were achieved by the single ConvN flower-/stem-end images, respectively. It seems that the single ConvNet ripeness detector without any decision fusion steps performed well for ou

Results
The proposed SDF-ConvNets was verified with our tomato image dataset through experiments in this section. The experiments were conducted on a computer equipped with an Intel ® Core™ i7-4790K 4.00 GHz CPU, 32 GB of RAM, and an NVIDIA GeForce GTX Titan Xp GPU processor. We utilized the deep learning framework Darknet [31]. Three metrics were used to evaluate the experimental results: precision, recall, and F1 score. In the multi-class classification problem, we calculated the precision, recall, and F1-score per class in a one-versus-rest manner.
where TP/FP/FN is the number of true-positive/false-positive/false-negative samples of class c. Then, the per-class F1-score can be computed by Equation (9). Table 3 represents the experimental results obtained with our test dataset. The average F1-scores of 94.2% and 93.04% were achieved by the single ConvNet stream with flower-/stem-end images, respectively. It seems that the single ConvNet-based tomato ripeness detector without any decision fusion steps performed well for our dataset.

Experiments for Stochastic Decision Fusion
In this paper, we used two decision fusion approaches based on stochastic metrics for the final decision on the tomato ripeness stage of the target sample. As a result of comparing the ripeness detection performance according to the proposed decision fusion method with the results in Table 3, it can be seen that the decision fusion strategies contributed to improving the ripeness detection accuracy, as shown in Table 4. In addition, the proposed stochastic decision fusion technique was superior to the simple method of assigning equal weights.

Comparison of State-of-the-Art Algorithms
We also used the precision-recall (PR) curve to compare the SDF-ConvNets to other recent models, such as SVM [32] and YOLOv5 [33]. Three PR curve graphs are plotted in Figure 6. The first and second curves are the ripeness detection results using YOLOv3 and v5, respectively, and the last graph shows the SDF-ConvNets-based detection result. We can verify that the difference in performance between YOLOv3 and YOLOv5 was not noticeable, whereas the detection performance of the SDF-ConvNets was improved through the area under each PR curve.
These experimental results demonstrate that the proposed SDF-ConvNets outperformed other methods. Table 5 compares the recent achievement of related works for the fruit ripeness detection with the performance of the SDF-ConvNets to prove the result. We can see that our approach is superior given the number of classes that need to be classified or the number of images for testing.
We also used the precision-recall (PR) curve to compare the SDF-ConvNets to other recent models, such as SVM [32] and YOLOv5 [33]. Three PR curve graphs are plotted in Figure 6. The first and second curves are the ripeness detection results using YOLOv3 and v5, respectively, and the last graph shows the SDF-ConvNets-based detection result. We can verify that the difference in performance between YOLOv3 and YOLOv5 was not noticeable, whereas the detection performance of the SDF-ConvNets was improved through the area under each PR curve.
(a) PR curve for YOLOv3; (b) PR curve for YOLOv5; (c) PR curve for the SDF-ConvNets These experimental results demonstrate that the proposed SDF-ConvNets outperformed other methods. Table 5 compares the recent achievement of related works for the

Conclusions
In this paper, we proposed an accurate tomato ripeness detection methodology called SDF-ConvNets. The overall ripeness detection pipeline consisted of two major steps: the initial tomato ripeness detection phase based on ConvNet streams, and the stochastic decision fusion phase to obtain a more precise ripeness classification result. Even if the initial ripeness classification fails for the stem-end or flower-end tomato image, the proposed decision fusion phase can compensate for the misclassified stage into the correct stage. To train, test, and verify the proposed method, a large-scale image dataset was collected and labeled. The scale of the tomato image dataset is larger than any dataset used in recent related works. The dataset consisted of 2166 tomato samples for training ConvNets and 546 tomato samples used to evaluate the SDF-ConvNets. The experimental results were obtained by averaging 5-fold cross-validation and evaluated in terms of the three statistical metrics (precision, recall, and F1-score) of the tomato ripeness detection task. The SDF-ConvNets successfully achieved accurate and fine-grained tomato ripeness detection compared with other deep-learning-based approaches. The F1-score of the tomato ripeness detection using the SDF-ConvNets was 96.5%. The proposed method was compared with the recent achievement of related ripeness detection tasks and its superiority was demonstrated.
In future work, a follow-up study will be conducted to develop an integrated framework that can determine the appropriate harvest time and monitor the crop growth status by recognizing and estimating the ripening stage in real-time through the observation of tomatoes before harvest.