Multi-Information Flow CNN and Attribute-Aided Reranking for Person Reidentification

This paper presents a multi-information flow convolutional neural network (MiF-CNN) model for person reidentification (re-id). It contains several specific multilayer convolutional structures, where the input and output of a convolutional layer are concatenated together on channel dimension. With this idea, layers of model can go deeper and feature maps can be reused by each subsequent layer. Inspired by an image caption, a person attribute recognition network is proposed based on long-short-term memory network and attention mechanism. By fusing identification results of MiF-CNN and attribute recognition, this paper introduces the attribute-aided reranking algorithm to improve the accuracy of person re-id further. Experiments on VIPeR, CUHK01, and Market1501 datasets verify the proposed MiF-CNN can be trained sufficiently with small-scale datasets and obtain outstanding accuracy of person re-id. Contrast experiments also confirm the availability of the attribute-assisted reranking algorithm.


Introduction
Person reidentification (re-id) refers to matching and recognizing the identities of pedestrians captured by multicameras with nonoverlapping views, which is significant to improve the efficiency of the security system. Owing to the low resolution of cameras, it is hard to obtain discriminative face features, so the current person re-id methods are mainly based on visual features of pedestrians, such as color and texture [1]. In practice, changes in viewpoint, pose, and illumination among different camera views, as well as partial occlusions and background clutters, pose a great challenge to person re-id [2].
Two principal person re-id methods are feature representation and metric learning. Feature representation seeks to find features with stronger discrimination and better robustness to represent pedestrians. Many kinds of features have been utilized for this, in which appearance features are the simplest and the most popular ones. Color, texture, and shape are the features that can be extracted for human appearance [3]in feature representation, such as HSV color histogram, LBP texture, and Gabor features, and then used for reidentifying people with similarity among pedestrian features. Attribute features are also widely used in person reid. Common attributes include gender, length of hair, and clothing. ese attributes are highly intuitive and understandable descriptors which have proved to be successful in several tasks, such as face recognition and activity recognition [4]. Although attribute features are complicated in terms of extraction and expression, they contain rich semantic information and are more robust to illumination and viewpoint changes. erefore, the combination of attribute features and low-level features can effectively improve the accuracy of person re-id [5]. e metric learning methods employ the machine learning algorithm to learn a good similarity metric, which makes the feature similarity of the same pedestrian greater than that of different pedestrians.
In recent years, deep learning has shown great success in a variety of tasks in image classification and frequency domain [6], where CNN is particularly outstanding. Compared with the traditional methods, CNN has stronger feature learning ability, and the learned features are more intrinsically representative to the original data, so it has better performance in extracting image features. Two types of CNN models are commonly employed in the community. e first type is the classification model as used in image classification and the second is the Siamese model using image pairs or triplets as input [7]. Most of the existing public datasets of person re-id only contain thousands of pedestrian image samples; a small number of training samples can easily lead to overfitting, which limits the performance of person re-id model. In addition, the deep neural networks for person re-id are similar in structure; that is, the feature maps extracted by convolutional layer are directly fed into the next convolutional layer [8][9][10][11][12]. Such structure usually ignores the correlation among features of each layer, thus reducing the mobility of feature information to some extent. In the process of back propagation, as the number of layers in neural network deepens, the gradient update information may attenuate in exponential form and cause vanishing gradient problem.
is work proposes to develop a modified deep neural network model for person re-id that could reduce overfitting caused by the lack of training samples. Moreover, this work aims to improve the identify accuracy of person re-id network with assistance of pedestrian attribute recognition. To this end, contribution of this paper is three-fold: first, this paper designs a multi-information flow convolutional neural network (MiF-CNN) to solve the person re-id problem. e network contains a series of multi-information flow convolution structures which connect the input and output of each convolutional layer together, realizes the reuse of features, and enhances the feature information flow and gradient back propagation of the entire network. Second, this paper designs a person attribute recognition network (PARN) based on long-short-term memory (LSTM) network and attention mechanism. e PARN decodes pedestrian visual features extracted by MiF-CNN into attribute features and outputs the attribute words of each person.
ird, this paper presents an attribute-aided reranking algorithm which rematches attribute features among samples to aid more positive samples rank higher in rank list so as to improve the identify accuracy further. e rest of this paper is organized as follows. Section 2 reviews the state of the art for person re-id. Section 3 introduces the details of MiF-CNN. Section 4 shows the principle of the PARN. e proposed attribute-aided reranking algorithm is detailed in Section 5. e experimental results and analysis are given in Section 6. Finally, conclusion and future works are discussed in Section 7.

Related Works
e early person re-id methods extract the manually designed features to represent pedestrians. Farenzena et al. divided pedestrian images into multiple areas and extracted three complementary kinds of features, weighted color histograms, maximally stable color regions, and recurrent high-structured patches. en, match these features and measure the similarity between pedestrian pair [13]. Yang et al. proposed a novel salient color names based color descriptor (SCNCD), which was utilized to guarantee that a higher probability will be assigned to the color name near to the color. Based on SCNCD, color distributions in different color spaces were fused into feature representation for person [14]. Bazzani et al. proposed asymmetry-based HPE descriptor, which accumulated HSV histogram of multiple pedestrian images as a global appearance feature and detected patches portraying highly informative recurrent ingredient in local regions as local feature [15]. Wu et al. designed a novel gradient self-similarity (GSS) feature based on HOG to capture the patterns of pairwise similarities of local gradient patches. e combination of HOG and GSS achieved improvement in person re-id accuracy [16].
Apart from manually designed low-level features, attribute features that represent mid-level semantic information apply to person re-id as well. Compared with low-level descriptors, attributes are more robust to image translations [7]. Layne et al. labeled 15 binary attributes for the VIPeR dataset and trained SVM to detect attributes.
ey also learned a weighted L2-norm distance metric to fix each attribute and fused them with low-level visual features [17]. Wang et al. predicted complete attribute vector by exploiting both visual feature and marked attributes and obtained the overall ranking list by fusing the rank result from visual features and attribute vectors separately [18]. Chen et al. learned attribute of person by part-specific CNN and merged them with another identification CNN embedding in a triplet structure for person re-id task [19]. Wang et al. proposed a deep neural network that contains an auto-encoder model to learn hidden attributes of person from visual feature in an unsupervised manner, which alleviated the requirement of massive annotation [20].
Deep learning has become popular for solving person reid problems in recent years. Ahmed et al. present a deep CNN architecture for person re-id. e architecture computed differences in feature values across the two views around a neighborhood of each feature location to add robustness to positional differences in corresponding features of the two input images [12]. Cheng et al. proposed a novel multichannel CNN. After the first layer of CNN, features were divided into four equal parts that aimed to learn features for the respective body part. e proposed CNN was trained with improved triplet loss function [21]. Lin et al. used ResNet [22] as the base network to learn lowlevel features and attributes jointly, and trained network with combining the person re-ID loss and attribute prediction loss [23]. Yan et al. proposed an attention block which learned par-level attention on different local regions, and integrated the proposed block into existing CNN structures for training with the identify loss [24]. Inspired by above works, this paper proposes a multi-information flow convolutional neural network to extract discriminative pedestrian features. In addition, this paper designs a person attribute recognition network based on LSTM and attention mechanism for improving person re-id results with assistance of attributes recognition.

Multi-Information Flow Convolutional
Neural Network e proposed MiF-CNN solves the person re-id problem with classification thought. e overall network structure is shown in Figure 1. e structure includes 2 shallow convolutional layers, 3 multi-information flow convolutional structures with novel connection pattern, fully connected layers, max pooling layers, and classification output layers. Low-level features of pedestrian images are extracted first by 2 shallow convolutional layers. After deeper multiinformation flow convolutional structures, MiF-CNN extracted higher level features. e final discriminative pedestrian feature vectors are obtained after reducing dimensions by pooling layers and integrating by fully connected layers.

Features Extraction.
In MiF-CNN, all convolutional filters are 3 × 3 with stride 1. Batch normalization and ReLU activation function are applied after each convolutional layer. e operation process of convolutional layer can be formulated as where j is the j-th feature map from the l-th convolutional layer, w (l) j is the filter on the j-th feature map in the l-th convolutional layer, and ⊗ represents the convolutional operation. e process of convolutional layers extracting features is that neuron on the j-th feature map in the l-th convolutional layer sum each feature map after connecting and convolution by filter w (l) j , and map the extracted features on j-th feature map in the l-th convolutional layer. σ(·) is the ReLU activation function, which is formulated as σ(x) � max(0, x). Because of batch normalization, bias is ignored.

Multi-Information Flow Convolutional
Structure. In this structure, both output and input of the current convolutional layer are concatenated together and fed into the next convolutional layer; i.e., the input of each layer is the connection combination of outputs from all previous layers.
e detail of multi-information flow convolutional structures is shown in Figure 2.
is connection pattern makes feature maps of each layers be reused by all subsequent layers in forward propagation process, which makes the whole CNN model learn more feature information of pedestrian images. It can be considered as a special "Data Augmentation" in feature maps so as to enhance the information mobility of the network. In back propagation process, gradient of input in each layer contains derivative of loss function with respect to input, which makes propagation of gradient more effective and network easier to be trained. In multi-information flow convolutional structure, the number of feature maps that each layer outputs is a constant value ρ, so the number of feature maps that l-th layers outputs is ρ 0 + ρ(l − 1), where ρ 0 is the number of feature maps in the initial layer. Supposing the feature map of l-th channel in the initial layer is x (0) i , where i ∈ (1, ρ 0 ), then the feature map of j-th channel that initial layer outputs can be expressed as where w (1) j is the weight of the initial layer. e output after activation function is where j ∈ (1, ρ). e feature map that the l-th layer outputs can be expressed as where p ∈ (1, ρ 0 + ρ(l − 1)), q ∈ (1, ρ). e output of the l-th layer after concatenate operation is where r ∈ (1, ρ 0 + ρ · l), [·, ·] represents the concatenate operation on channel dimension. In the process of back propagation, supposing Δx (l) r is the derivative of loss function with respect to x (l) r . Due to x (l) r containing x (l−1) p and a (l) q , it produces two parts of gradient as shown below: where Δa (l) q is the gradient of output from the l-th layer after activation function. Δx (l−1) p is the gradient of output from the (l − 1)-th layer. e gradient of weight in the l-th layer is where Δz (l) q is the gradient of the convolution result in the l-th layer, σ ′ (z (l) q ) is the derivative of the activation function with respect to z (l) q , and Δw (l) q is the gradient of weights in the l-th layer. e network utilizes Δw (l) q to update weights of each layer, which is formulated as

Computational Intelligence and Neuroscience
where η is the learning rate. Gradient keeps back propagation to the (l − 1)-th layer. e gradient of x (l−1) As shown in equations (6) and (9), the loss function produces two flows of gradient with respect to outputs of each convolutional layer, which makes error information propagating more effective in network and restrains the vanishing gradient to a certain extent. e proposed MiF-CNN includes three multiinformation flow convolutional structures. Between every two multi-information flow convolutional structures, a middle pooling layer is applied to compares the redundancy features. Hyperparameter ρ is set with a small value. When it comes to a new multi-information flow convolutional structure, ρ is doubled. Such a design makes each convolution layer learn a small quantity of features and reduce the redundant features so as to optimize the efficiency of network. With deeper layers, the network can learn more highlevel and complex pedestrian features and improve the final identification accuracy.

Loss Function.
Current deep learning algorithms usually use cross-entropy loss as cost function, which is formulated as where y (i) is the ground truth of pedestrian categories in training set, θ is the parameter of the last fully connected layer, x (i) is the feature vector of training samples, k is the number of pedestrian categories in the training set, and M is the batch size. However, in practice, when using cross-entropy loss merely, if the quality of extracted features is not good enough, it will lead to intraclass distance being greater than interclass distance. Aiming at this problem, Wen et al. proposed center loss in 2016 [25]. Combination of crossentropy loss and center loss can enhance the discrimination and generalization ability of the network. Center loss is defined as follows: where c j is the center of the j-th pedestrian feature and x i is the feature vector of pedestrian. Center loss minimizes the distance between feature and its center in order to reduce the intraclass distance. Center c j is updated with equation (12): where β is the update rate, δ(y i � j) is 1 if prediction equals to ground truth, otherwise is 0. at is to say, center is updated only when network predicts correctly.
1st layer 2nd layer l-th layer

Person Attributes Recognition Network
e rank list of MiF-CNN is shown in Figure 3. In the incorrect identification results of rank 1 (pedestrian B and pedestrian C), there is a big difference in attribute features between the top-ranking negative samples and the query image, including gender, clothing, whether carrying handbag or not, and so on. Hence, this paper recognizes person attributes for improving accuracy of person re-id.
Based on Encoder-Decoder idea, Xu et al. [26] proposed a neural network model that can learn and generate the content of images. e model utilized CNN as encoder to extract features of images which were then fed into a recurrent neural network (RNN) for decoding into language captions of images. Inspired by that, this paper presents a person attribute recognition network (PARN) with LSTM and attention mechanism. e proposed PARN takes pedestrian features that are extracted by MiF-CNN as input and outputs the attributes information of pedestrian images.
e architecture of PARN is demonstrated in Figure 4.

Input of PARN.
In PARN, the input is the feature maps that are before the last fully connected layer in the MiF-CNN. e input feature is split into n feature vectors, each of which corresponds to a part of the image. Each feature vector is a D X -dimensional vector which is represented as Referring to the natural language processing method, PARN transforms words in person attribute labels into word embedding. As a part of the input of PARN, each word embedding is a D Y -dimensional vector: where m is the number of attribute words in each pedestrian image and y i is the word embedding corresponding to each attribute word. For attribute words, common one-hot encoding considers each word as an individual, which ignores the correlation among words. However, word embedding represents each word as a continuous dense vector, which makes those correlative words closer in space.

Attention Mechanism.
Attention mechanism has been widely used in natural language processing and computer vision. By measuring the correlation between the output and different parts of the input, attention mechanism gives different weights to different parts of the input, enabling the network to use more important feature information for prediction and reduce the dimension of input data [27].
In practice, a certain attribute of pedestrian is only corresponding to a certain part of the image. For example, when recognizing whether a pedestrian is wearing a hat, people only pay attention to the area above the head of the pedestrian usually, instead of other areas irrelevant to the attribute. erefore, before feeding the image features into the LSTM network, attention mechanism is introduced to calculate the correlation between different positions of image features and the hidden state of LSTM at the previous time.
e schematic diagram of attention mechanism is shown in Figure 5.
At time t, the fully connected layer and tanh function are used to integrate the information of the input feature vector and the hidden state of LSTM at the previous time, which is formulated as where W attX and W atth represent the fully connected layer weights of x i and h t−1 , respectively. e attention weights α i is obtained by supplying softmax function on score vector v i : where W attv represents the fully connected layer weights of v i . e output Z is the weighted sums of all α i : Feature Z highlights the local features that are helpful to predict and suppress the other local features that make small contribution to prediction, which reduces the dimension of features to some extent and makes LSTM network focus on the part of input features that have greater correlation with prediction while recognizing person attributes so that improves the efficiency and prediction accuracy of the network.

LSTM.
Normal RNN updates network parameters with back propagation, which easily suffers from the vanishing gradient when there is a long gap between relevant information and the current position to predict. LSTM network solves this problem well because of its own structure advantage. e architecture of LSTM unit is shown in Figure 6.
Here, c is the cell state for storing and transferring information, f is the forget gate which decides what should be abandoned in the cell state, i is the input gate which decides what should be stored in the cell state, g represents a vector of new candidate values that should be added to the cell state, o is the output gate which decides what parts of the cell state should be output to the next time, h is the hidden state of LSTM, x is the input of LSTM, and t denotes the current time.
In PARN, at any time t, the input of LSTM consists of two parts: word embedding y t−1 at the previous time and image features Z t at the current time. e operation processing of LSTM in PARN is formulated as Computational Intelligence and Neuroscience 5 where W and b represent the weights and biases of each gate, respectively. Sigmoid function outputs a number between 0 and 1 to decide the quantity scale of values that go through these gates. Tanh function is the activation function of input. is design makes LSTM give different weights for information at different times so that it can choose what part of information to remember or forget in a long-term sequence. e time step of LSTM in PARN is set as the number of attributes of each pedestrian image m. e output hidden state h t is calculated by a fully connected layer and softmax function at each time step to predict pedestrian attribute words.
It is worth noting that, at each time step, the input word embedding of LSTM is the one that LSTM learned at the previous time step, which takes advantage of the LSTM when solving the long-term sequence problem. Because of the correlation among attributes of pedestrian, like the pedestrian owning attribute "male" generally own attribute "short hair" and attribute "pants," LSTM can utilize the previous attribute result to predict the current attribute more accurately.

Network Optimization.
e loss function of PARN includes two parts, one is the cross-entropy between prediction and ground truth of network, which is expressed as where u is the prediction value of the network, u ′ is the ground truth of pedestrian labels, m is the time step of LSTM, and n w is the total number of attribute words in each dataset. As shown in equation (19), the loss of a single image in single epoch is the cumulated loss value after m iterations. e other one is the loss function of attention mechanism as described in [40]: where α j i represents the attention weights of the image feature x i and n is the number of parts in the image feature. e object function of PARN to optimize is where M is the batch size and λ is the rate of L α in the object function.

Attribute-Aided Reranking Algorithm
Given a probe image q 0 and gallery image set G � g 1 , g 2 , ..., g N including N pedestrian images, the initial similarity distance between q 0 and g i is computed with the Euclidean distance as where x q 0 and x g i represent the features of q 0 and g i , respectively, that extracted by MiF-CNN. e initial rank . If the top-1 gallery g 0′ 1 is the positive sample, it means rank 1 is correct. If the positive sample is not the top-1 gallery but the top-5 gallery, it means rank 5 is correct. e attribute-aided reranking algorithm reranks the rank list R 0 according to the attribute feature similarity between q 0 and g i so that more positive gallery gets higher ranking in R 0 , which could improve the performance of person re-id. To be specific, when rank 5 is correct but rank 1 is not, the proposed algorithm distributes the initial feature score s f for top-5 gallery g 0′ 1 ∼g 0′ 5 according to their ranking in R 0 , the higher the ranking, the higher the s f . en, the proposed algorithm distributes attribute score s a for g 0′ 1 ∼g 0′ 5 according to the number of their attributes that are the same as q 0 , the more same attributes, the higher s a . e total score of each gallery is s � s f (1 − c) + s a c, where c is the score weight. e reranking rank list R * 0 is obtained by reranking g 0′ 1 ∼g 0′ 5 in descending order of their total score s. For M query images, the whole processing of the attribute-aided reranking algorithm is shown as Algorithm 1.

Experiments
is paper evaluates the attribute recognition accuracy of PARN on person re-id public datasets first and then compares our results with other person attribute recognition methods.
en, this paper evaluates the performance of MiF-CNN on three people re-id datasets and the availability of the proposed attribute-aided reranking algorithm for improving the accuracy of person re-id. e analysis of experiment results is also given after comparing our results with the state-of-the-art person re-id methods.

Datasets and Evaluation Protocol.
is paper evaluates the proposed methods on three challenging person re-id Figure 5: Schematic diagram of attention mechanism. Computational Intelligence and Neuroscience public datasets including VIPeR [28], CUHK01 [29], and Market1501 [30]. Evaluation Protocol. For the single-shot datasets VI-PeR and CUHK01, cumulative match characteristic (CMC) is used to record the ranks of correct identify [31]. For the multishot dataset Market1501, apart from CMC, mean Average Precision (mAP) is utilized to evaluate the performance of the proposed methods.

Experimental Results on Attribute Recognition.
e attributes in PARN refer to pedestrian identity-level attributes. For the VIPeR dataset, attributes include "gender," "length of hair," "lower clothing," "upper clothing," "backpack or not," and "carrying anything or not." For the CUHK01 dataset, attributes include "gender," "length of hair," "backpack or not," "handbag or not," "color of upper," and "color of lower." For the Market1501 dataset, attributes contain "gender," "length of hair," "wearing hat or not," "lower clothing," "backpack or not," "handbag or not," "length of sleeve," and "length of lower clothing." Each person in each dataset owns an attribute word list as shown in Figure 7.
is paper evaluates the attribute recognition accuracy of PARN on the Market1501 dataset and compares our results with outstanding person attribute recognition method APR [22]. e experiment results are reported in Table 1, where "L.slv" represents the "length of sleeve" and "L.low" represents the "length of lower clothing." It can be observed from Table 1 that PARN obtains superior recognition accuracy of 8 attributes, especially "length of hair," "handbag or not," and mean accuracy are higher than APR, whereas accuracy of other attributes is closer with APR. Comparison with APR demonstrates the outstanding attribute recognition performance of PARN which plays an important role in improving the accuracy of person re-id.

Experimental Results on Person re-id.
is paper evaluates the performance of the proposed methods on three datasets. e experimental results of the proposed methods and other state-of-the-art methods are presented in Table 2.
As shown in Table 2, the proposed MiF-CNN model obtains a great performance among state-of-the-art methods, and on the contrary, comparison between MiF and MiF + PARN demonstrates that the proposed attributeaided reranking algorithm is helpful to increase the person re-id accuracy. e improvement effect is especially obvious on VIPeR and CUHK01 datasets, where the identification accuracy of rank 1, rank 5, and rank 10 improves by 6.02%, 6.33%, and 2.22% and 4.94%, 4.94%, and 2.26%, respectively. e proposed MiF-CNN model with the attribute-aided reranking algorithm gets the best rank 5 accuracy on the VIPeR dataset, the best rank 1 and rank 5 accuracy on the CUHK01 dataset, and the best mAP on the Market1501 dataset among various methods.

Analysis of Experimental Results.
Because of the small quantity of training samples in VIPeR and CUHK01 datasets, a deep CNN model is hard to be trained and easily suffers from overfitting. To handle this, the FT-JSTL + DGD method based on deep CNN learned deep features from multiple domains jointly by merging all the datasets together and fine-tuned the pretrained model on VIPeR and CUHK01 separately. e structure and hyperparameters of the deep CNN model in the M3TCP method need to be adjusted manually for adapting the scale of different datasets so as to overcome the training problem of the deep neural network.
In contrast, the proposed MiF-CNN has an immobile structure and can be trained on smaller datasets directly without any fine tuning. In spite of the deep structure, the model can converge quickly and well. MiF-CNN without the attribute-aided reranking algorithm outperforms FT-JSTL + DGD and M3TCP by 0.27% and 13.17%, respectively, at rank 1 accuracy on the CUHK01 dataset. It also outperforms FT-JSTL + DGD by 2.22% at rank 1 accuracy on the VIPeR dataset, which indicates that the proposed MiF-CNN model has an excellent ability of reducing overfitting and generalization and an outstanding performance in person re-id. Figure 8 demonstrates the rank list of MiF and MiF + PARN. It can be seen that the MiF-CNN with the attribute-aided reranking method enables the lower ranked positive sample in rank list of MiF-CNN obtain a higher ranking so as to improve the accuracy of person re-id, which verifies the effectiveness of the attribute-aided reranking algorithm for improving performance of person re-id model.

Conclusion
is paper studies and discusses the training problem of deep neural network in person re-id task, and the using of pedestrian attributes for further improving the accuracy of 8 Computational Intelligence and Neuroscience person re-id. e proposed MiF-CNN model realizes the reuse of feature maps and gradient information, which enhances the feature mobility of network and improves the efficiency of gradient propagation. e designed person attribute recognition network uses an attention mechanism to measure the correlation between input feature maps and the hidden state of LSTM at previous time for reducing the dimension of features. It also employs LSTM to decode image features into pedestrian attributes. On the basis of the attribute recognition results, the attribute-aided reranking algorithm is presented, which rematches attribute features among samples to aid more positive samples rank higher in rank list so as to improve the identify accuracy further. e experimental results on three public person re-id datasets indicate the outstanding performance of MiF-CNN model in person re-id. e attribute-aided reranking algorithm makes a major contribution to improve the accuracy of person re-id. In the future, more improvement and optimization will be done on pedestrian attribute recognition network. Moreover, the caption of person images or videos   can be obtained by the LSTM model with natural language process ideas which make person re-id methods more significant.
Data Availability e (attributes recognition accuracy on the Market1501 dataset and person re-id accuracy on VIPeR, CUHK01, and Market1501 datasets) data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.