3D Pose Estimation for Object Detection in Remote Sensing Images

3D pose estimation is always an active but challenging task for object detection in remote sensing images. In this paper, we present a new algorithm for predicting an object’s 3D pose in remote sensing images, called Anchor Points Prediction (APP). Compared to previous methods, such as RoI Transform, our object results of the final output can obtain direction information. We predict the object’s multiple feature points based on the neural network to obtain the homograph transformation relationship between object coordinates and image coordinates. The resulting 3D pose can accurately describe the three-dimensional position and attitude of the object. At the same time, we redefine the method IoUAPP for calculating the direction and posture of the object. We tested our algorithm on the HRSC2016 dataset and the DOTA dataset with accuracy rates of 0.863 and 0.701, respectively. The experimental results show that the accuracy of the APP algorithm is significantly improved. At the same time, the algorithm can achieve one-stage prediction, which makes the calculation process easier and more efficient.


Introduction
In recent years, with the deepening of research and the improvement of computing power, deep learning has become more and more widely used in various fields. At the same time, the object detection algorithm has made great progress so far. In particular, remote sensing images has been a specific but active topic in computer vision [1,2]. Recent progresses in object detection in aerial images have benefited a lot from the R-CNN frameworks [1,[3][4][5][6]. These methods use horizontal bounding boxes as the region of IoU and then rely on region-based features for category identification [2,7,8]. Faster-RCNN [4,5] leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network's computation. A multi-stage object detection framework, the Cascade R-CNN, is proposed for the design of high-quality object detectors [6,9]. Additionally, FPN uses feature pyramids for object detection [10]; Yolt achieves object detection of high-resolution remote sensing images based on Yolo v3 [11,12]; and Yolo v3 is significantly faster than other methods in achieving the same accuracy [13]. These classic algorithms have different adaptation scenarios and greatly promote the development of this field. However, in remote sensing images, the object is often placed obliquely, so using an inclined box to detect the object will be more adaptive to the scene. These algorithms use a horizontal rectangular box to detect the object, so it does not accurately reflect the object pose of the remote sensing image to some extent. Also, these horizontal RoIs typically lead to misalignments between the bounding boxes and objects [8,14,15]. The RoI Transform algorithm locates the inclined box by predicting the rotation angle of the object box [8,16]. However, this algorithm has some problems. The first problem is that the rotation angle θ of the regression inclined box is ambiguous in most cases. This means that θ is equal to 0 • and 180 • corresponding to IoU is equal, but if the algorithm does not contain direction information, it will be considered the same type. The second problem is efficiency. It is a two-stage algorithm and the localization method relies entirely on Faster-RCNN [5,8]. The algorithm can only use the rectangle obtained by Faster-RCNN, and cannot use the feature information of the object region. CornerNet is a new one-stage approach to object detection by predicting the coordinates of the top-left and bottom-right points that does away with anchor boxes, which is more accurate and efficient [17][18][19][20]. However, predicting two points does not fully describe the information of the inclined box [5,21,22].
A 3D pose describes the three-dimensional pose of the camera relative to the object's own coordinate system, not the pose of the object relative to the ground plane. Whether the reference object has z-coordinates and whether 3D information can be estimated are not the same thing. The object is on a plane and does not affect the rotation and translation of the camera observing the object relative to the three axes in three dimensions. Therefore, even objects on a two-dimensional plane will be observed in three-dimensional space out of the plane, which also has the 3D pose problem.
To solve the above problems, we propose the Anchor Points Prediction (called APP) algorithm. Different from other methods, we predict the position and attitude of the object by at least four corner points through the full convolution network, and can obtain the 3D pose by decomposing the homograph transformation matrix, and the algorithm is more efficient. The corner pooling layer used in the algorithm greatly improves the points prediction accuracy [17].
We give the correspondence between the predicted points and the available object information in Table 1, and a comparison of the traditional method and our method is shown in Figure 1. We have reason to believe that object detection by point prediction will become a new trend in the future. Table 1. Correspondence between predicted points and available object information [17,23]. Figure 1. Object detection comparison. (a,c) Traditional inclined box. The inclined box has symmetry, so it is not possible to uniquely describe the direction of the object in the 2D image space, so the object has four possible directions. (b,d) The 3D pose diagram obtained from APP. The X-axis of the object is marked with red, the Y-axis is marked with green, and the Z-axis is marked with blue. The X-axis points to the right side of the object; the negative direction of Y-axis indicates the static direction of the object; and the Z-axis points to the ground.

Object Detection Based on APP
Any object detection problems can be attributed to the prediction of key points. The traditional rectangle object detection methods can be attributed to the prediction of two key points, such as Faster-RCNN, YOLO, and SSD (the upper-left and lower-right points of the rectangle) [5,21,24,25]; the 3D pose of the general object can be attributed to the prediction of eight points; the human body posture OpenPose can be attributed to the prediction of 18 key points of the human body [26]. We attribute the predicted inclined object box to a prediction of four points. The traditional inclined box detection methods have no direction information and may result in high accuracy. In addition, as shown in Figure 2, the boxes of the two objects whose center points are close but opposite in direction may cause the object to be lost in the NMS operation. The full name of NMS is non-maximum supply [27]. This method is used to search the local maximum and suppress the maximum. The purpose of this method is to eliminate redundant frames and find the best location for object detection. Unlike traditional methods, we define a new calculation method that takes into account the overlap rate and direction consistency between the tilted boxes. Assuming there are two sets of object feature points, {P 11 ...P 1n }, {P 21 ...P 2n }, we define the IoU APP calculation formula between the two sets of points as follows: We use d 12 = ∑ n i=1 (P 1i − P 2i ) 2 , d 1 = ∑ n i=1 P 1i − P 1 2 and d 2 = ∑ n i=1 P 2i − P 2 2 .P 1 and P 2 are the coordinates of the center point of point set 1 and point set 2, respectively. The definition mainly considers the deviation of the coordinate offset of the corresponding point from the size of the object itself. This deviation is relative. The larger the deviation, the smaller the IoU APP . It is clear that the range of values of IoU APP is the same as the original IoU definition, which is [0, 1]. The larger the IoU APP , the closer the two object cells are, and the IoU APP is equal to 1 when the two object cells are completely identical; the IoU APP is infinitely close to 0 when the two object cells are very different. The two sets of the object feature points may be the two prediction units to be combined in the object detection NMS process, or may be the similarity calculation between ground truth and the predicted values.
According to Figure 3, we can get the calculation formulas for IoU APP , IoU, and IoU RBox as follows, and we can get the relationship curve as shown in Figure 4.
(2)  It can be seen from the above figure that when the object position scale is constant, only the IoU APP is significantly affected by the object direction angle θ, so only the IoU APP can describe the accuracy of the object direction angle. The IoU and IoU RBox are not affected by θ and cannot describe the accuracy of the object direction angle.
The mAP (mean Average Precision) obtained from the experimental data of RoI Transform is based on IoU RBox [8]. The mAP, which is used to evaluate the accuracy of object detection methods, is based on IoU between prediction boxes and ground truth boxes [28]. As can be seen from the above Figure, as long as the neural network that can recommend the horizontal box aligns with the center point, the IoU RBox is always greater than 0.5. That is, even if the direction angle prediction is wrong (predicted to be any angle from 0 • to 360 • ), it is also hit when counting mAP, so the resulting mAP is virtually high. Thus, we proposed the solution IoU APP . The IoU APP uses the method of regression coordinates to detect the object, and the method of evaluating mAP is more reasonable.

Neural Network Design
To predict the inclined box of the remote sensing object, we built a full convolutional network that predicts three scales in three different layers, each scale being the output by the APP of three different anchors' array. Different anchors are used to detect objects with different aspect ratios in the image domain, as shown in Figure 5. Our custom region layer is used to output the relative coordinates, the categories, and the information about whether the object exists or not. In the region layer, we used yolov3's definition of anchors to implement n-weight anchor predictions based on the width and height of the object on the imaging surface. Each anchor represents a specific 2D wide height object. Following this concept, this particular 2D width and height corresponds to different APP range distributions. Each cell of the output array contains n × (4 + 1 + c + 2 × 4) output neurons. The meaning of the parameters are shown in Table 2. Description bounding box coordinates number of anchors existing object number of classes We define the offset coordinate of point i in the range of a specific region as (p w × ∆x i , p h × ∆y i ). As shown in Figure 6, p w and p h are the width and height of a particular anchor. We can use Equation (3) to calculate P i .
Then, the actual pixel offset coordinate of point i relative to the anchor boxes is P i = (u i , v i ). In addition, we define the loss function in the training procedure as follows, and we give the meaning of each parameter in Table 3.
(4) We are more focused on the positioning learning of the inclined box determined by APP, so we reduce the weight of the horizontal box. Then, we use λ ROI = 0.01 to assist learning, and we can even set it to 0 to ignore the weight of the horizontal box. Loss ROI is the regression error of the center point and width of the object RoI, following yolo v2 [12].
The principle of judging whether there is an object in a local range is to calculate the maximum IoU between the default anchors' boxes and all the ground truth boxes in that range. If the IoU exceeds a threshold, there will be an object. The principle of discrimination here is consistent with the processing of yolo v2 and yolo v3 [12,13]. Therefore, the sum of the differences in the three-layer APP prediction result and the ground truth is

Training Procedure
Training datasets. We experimented with the DOTA dataset. The original DOTA images are high-resolution remote sensing images, which is not convenient for direct processing using the neural network. Therefore, first of all, the raw data needs to be standardized. The method we took was to randomly select an image point, then center the point, and align the center point (W/2, H/2) of the transformed image (W, H) for random affine transformation. The scale of the transformation is 0.5 to 1.5 times the original image. Then, we obtain the sample images.
Training and testing. At training, we used 80% DOTA images, and all the processed images were resized into 416*416 and sent to the neural network. After training, the remaining 20% of images were used for testing. In the training process, the choice of multiple anchors followed the strategy of yolo v3, and the process of backpropagation of the loss layer was divided into two phases.
Phase 1: Scan each output of the output layer array. According to the ground truth set and the boxes determined by predicted APP coordinates, the output region can be obtained from the maximum IoU between them. If IoU max is less than ε, the corresponding object presence expectation output value will be set to 0 for backpropagation correction.
Phase 2: Scan the rows and columns of each GTBox, and correct the largest anchor of the IoU between the default rectangle of n anchors at this position and the GTBox. Then, set the expected value of the object field to 1, and the loss of APP is calculated according to Equation (19), and the expected value of the softmax segment is set to perform backpropagation correction.
Application. According to the four-sided output bounding box surrounded by four APP coordinates, the four points of the inclined box are further obtained by Equation (19), and the point coordinates are converted to the large image according to Equation (5).

Calculation of the Object of 3D Pose
The conventional methods often calculate the 3D pose of the object by matching the local features extracted in the 2D image with the features in the object 3D model to be detected, but these methods are not accurate enough [29]. Therefore, based on the key point coordinates of the object output from the region layer, we use the perspective transformation method to calculate the 3D pose. Figure 7 shows the computational process of the objects' 3D pose. We use two methods, PnP [30] and homograph. The following describes these two methods.

PnP Method
As shown in Figure 7, we can get the coordinates{P 1 , P 2 , P 3 , P 4 } of the four feature points of the object in the training part. These feature point coordinates are relative to the coordinates of the complete satellite image. If the image is cropped and resized, these coordinates need to be transformed to the coordinate system of the original satellite image. The inference part obtains the 3D pose of the object based on the correspondence between the four feature points and the body coordinates of the four or eight corner points of the object. Assume that the length and width of the bounding box of the object is W and H, and the height of the object from the ground is H g . We define the eight points of the bounding box of the object's own coordinate system as: Since the distance from the camera to the ground object is much larger than the object's own height H g , the image coordinates of the two feature points at different heights of the object at the same latitude and longitude are very close on the image. Therefore, the corresponding relationship between the eight points of the object's bounding box and the key points {P 1 , P 2 , P 3 , P 4 } of the image coordinates is: According to this correspondence, we call the PnpSolve function in OpenCV to solve the camera external parameters R and T, where R is the attitude matrix of the camera relative to the object, and T is the position of the object relative to the camera.

Homograph Method
We define the object as an inclined box with a width W and a length H. The origin of the object's coordinate system is defined at the center of the box. Considering the particularity of remote sensing images, the four vertices of the object in the object coordinate system are similar on a ground plane π, and the four vertices of the object in the object coordinate system are marked as: According to the DOTA data format, the four points are sorted clockwise from the upper left corner. Assuming that the aspect ratio of the object is also unknown, set to α, then the four object points can be written as: According to the principle of satellite remote sensing imaging, the line of sight of the imaging camera of the remote sensing image is perpendicular to the ground plane π, so the conversion of the four points on the object plane to the image plane follows the rotation transformation matrix: Conversely, the conversion of four points on the image plane to the four points on the object plane follows the inverse of Equation (8): According to the basic principle of affine transformation, a 11 a 12 a 13 −a 12 a 11 a 23 = a 11 a 12 a 13 −a 12 a 11 a 23 −1 = 1 a 11 2 + a 12 2 a 11 −a 12 a 12 a 23 − a 13 a 11 a 12 a 11 − a 11 a 23 + a 13 a 12 .
In order to solve the four parameters a 11 , a 12 , a 13 , a 23 of affine transformation, we obtain the formulas of the four parameters and α according to Equation (7): The four-point coordinates in Equation (9) are sequentially brought into X and Y in the above equation, Then, we can get  After solving Equation (14) by the least two multiplication method, it is brought into Equation (16) to get a 11 , a 12 , a 13 , and a 23 . According to the principle of perspective transformation, we can get Columns 1, 2 of K −1 H are then unitized to obtain the matrix: The attitude matrix R = [c 1 , c 2 , c 1 × c 2 ]. The columns 1 and 2 of K −1 H 1 construct the columns 1, 2 c 1 , c 2 of the matrix R. The third column of K −1 H 1 = is the offset T of the object relative to the camera. Based on the above, we summarize the calculation process of the 3D pose for remote sensing image objects.
(1) Predict the APP coordinates of each object through the neural network.
(2) According to Equation (15), we can get the inverse affine transformation parameters a 11 , a 12 , a 13 , a 23 and the object width-to-length ratio α. (3) According to Equation (12), we can get the affine transformation matrix from the object coordinate system to the image coordinate system: A = a 11 a 12 a 13 −a 12 a 11 a 23 (18) (4) Matrix K −1 H 1 is obtained from Equation (17). (5) Attitude matrix R and displacement T can be obtained by decomposing matrix K −1 H 1 .

Object Spatial Location Using Remote Sensing Image
As shown in Figure 8, the geometric transformation of the satellite relative to the earth R s t s 1 can be obtained accurately; the rigid body connection transformation R c t c 1 of the satellite camera relative to the satellite can also be measured. Then, through the method based on APP, the transformation of the object relative to the camera can be obtained as R T T 1 . To sum up, it can be calculated that the 3D transform of the object relative to the earth is converted into R s t s 1 .

Experimental Details
Experimental data. We conducted experiments on the DOTA and HRSC2016 datasets. For the DOTA dataset, we cut the image into subgraphs with a resolution of 1024 × 1024. At training, we used batch-size = 64, and the learning rate was 0.0001 in the first 10,000 training sessions, and after every 20,000 increments, the learning rate was reduced by 0.1 times until the learning rate was equal to 0.000001. Our full convolutional network supports the input of images with different resolutions. We tested two different resolution inputs for 416 × 416 and 1024 × 1024. The 416 × 416 model can get an output of 13 × 13, 26 × 26, 52 × 52 scales, and the 1024 × 1024 model can get an output of 32 × 32, 64 × 64, and 128 × 128 scales, which correspond to three different object types of large, medium, and small scales, respectively. In the detecting, we first divided the DOTA image into blocks. The blocks needed to overlap to avoid cross-border loss of the object. Finally, we calculated and counted the mAP of the object. mAP is the mean Average Precision, which means the average of the AP of all object categories. Tables 4 and 5 are the statistics on the HRSC2016 and DOTA data sets, respectively. As can be seen from these two tables, the APP algorithm has the highest mAP.  We also tested five sets of models with different parameters. According to Equation (8), we could get the IoU between the predicted object point set and the true value point set, and calculate the corresponding mAP of each of these models. Tables 6 and 7 are model parameters and experimental results, respectively. By comparing the experimental results, we found that the mAP of the first set of model parameters was the highest.  The efficiency of the algorithm is one of the most important indicators for measuring the quality of the algorithm. Our model can input images in two different resolutions. As shown in Table 8, we used a one-stage process including NMS operation, so it was more efficient. Table 9 is the mean length of the objects. Assuming that the width W of all objects is equal to 1, the average length of each type of object also can be solved.  Experimental results. We improved yolo v3 and added four corner points of the APP prediction object based on the original region layer. The local object was predicted according to Equation (19), and then the coordinates were converted to the large image to obtain the homograph transform. Further, we decomposed the three-dimensional attitude R and displacement T of the object relative to the camera. Figure 9 is the result of the experiment. Figure 10 is an incorrectly labeled image that we can detect correctly after training.  For large images such as DOTA, we used the method of block synthesis. The large image is divided into a number of sub-blocks, each of which is just the standard size of the neural network input layer [S w , S h ], and the neural network was used to detect the object APP coordinates (u i , v i ) in each sub-block range. Assuming that the coordinate of the upper-left corner of a sub-block is (le f t, top), the coordinate of the point in the full image is Converting all the objects of the sub-image to the large image, and in finally performing the NMS operation, in order to avoid the object being cut by the block, overlap between blocks was needed, and the overlap length was longer than the minimum object length. The overlap length of the horizontal and vertical sides was initially set at 20% of the basic length, as shown in Figure 11.

Error Analysis
We used the method of projecting pixel points to comprehensively evaluate the prediction accuracy of the 3D pose. The process is shown in Figure 7. When evaluating the accuracy of the 3D pose, the position and attitude of the object in three-dimensional space could be considered comprehensively by using the method of pixel projection error [36]. According to the nature of the attitude matrix R, the three rows of the matrix represent the three unit vectors of the x, y, and z axis of the camera coordinate system relative to the object, and then the three rows of R, in turn, represent the x, y, and z axis relative to the three unit vectors of the camera coordinate system. Since the y-axis of the object ontology coordinate system is the negative direction of the object orientation, the negative vector −r 12 , −r 22 of the second column of the matrix R describes the direction of the object. Therefore, the azimuth angle of the object relative to the camera coordinate system θ = artan (−r 12 , −r 22 ). The coordinate calculation Equation is where K is the internal parameter matrix of the camera, R and T are the attitude matrix and the displacement matrix obtained according to the algorithm of Section 4.1, and X i is the coordinates of the four projection points of the object box close to the ground. The pixel error is According to this formula, we could get the angular error and pixel error of the predicted object. We tested five different sets of parameters to get different models. It can be seen from Table 10 that the error of the first set of parameters is the smallest.

Conclusions
In this paper, we proposed a new Anchor Points Prediction algorithm that can accurately determine the position and attitude of the object's three-dimensional space. Differing from the traditional methods of predicting object RoI or inclined box, we used the neural network to predict multiple feature points to detect the objects. This algorithm is a one-stage algorithm, and its accuracy and efficiency have been greatly improved. It not only uniquely determines the direction of the object, but also calculates the 3D pose of the object from the APP coordinates. We believe that the APP algorithm can be better applied to object detection. Moreover, the point prediction algorithm has broad application prospects and may become a new trend in the future. Our method also has some shortcomings. For a slender object like Harbor, the bounding box is relatively large, and the object occupies only a small part of the bounding box. In this way, the features in the extracted ROI region will be inaccurate, leading to a decrease in the accuracy of predicting key points of the object.