Saliency Detection with Bilateral Absorbing Markov Chain Guided by Depth Information

The effectiveness of depth information in saliency detection has been fully proved. However, it is still worth exploring how to utilize the depth information more efficiently. Erroneous depth information may cause detection failure, while non-salient objects may be closer to the camera which also leads to erroneously emphasis on non-salient regions. Moreover, most of the existing RGB-D saliency detection models have poor robustness when the salient object touches the image boundaries. To mitigate these problems, we propose a multi-stage saliency detection model with the bilateral absorbing Markov chain guided by depth information. The proposed model progressively extracts the saliency cues with three level (low-, mid-, and high-level) stages. First, we generate low-level saliency cues by explicitly combining color and depth information. Then, we design a bilateral absorbing Markov chain to calculate mid-level saliency maps. In mid-level, to suppress boundary touch problem, we present the background seed screening mechanism (BSSM) for improving the construction of the two-layer sparse graph and better selecting background-based absorbing nodes. Furthermore, the cross-modal multi-graph learning model (CMLM) is designed to fully explore the intrinsic complementary relationship between color and depth information. Finally, to obtain a more highlighted and homogeneous saliency map in high-level, we structure a depth-guided optimization module which combines cellular automata and suppression-enhancement function pair. This optimization module refines the saliency map in color space and depth space, respectively. Comprehensive experiments on three challenging benchmark datasets demonstrate the effectiveness of our proposed method both qualitatively and quantitatively.


Introduction
The salient object detection (SOD) is a fundamental task in computer vision, which attempts to imitate the human visual attention mechanism to locate and segment the interesting or attractive regions in a scene. It has been widely applied to a variety of vision tasks, such as image segmentation [1], resizing [2], enhancement [3], quality assessment [4], recognition [5], and matching [6]. In fact, the human visual system can not only intuitively capture the appearance of objects, but also perceive the depth information from the scene. Benefiting from the development of 3D sensing technology, the depth information can be captured more conveniently and accurately. Therefore, the RGB-D saliency detection using depth information is attracting more and more attention. Moreover, the effectiveness of depth information has been fully proved in other computer vision tasks, such as motion segmentation [7] and people re-identification [8].
Given a pair of RGB-D (RGB + depth) images, the task of the RGB-D saliency detection aims to predict a saliency map and extract the salient regions by exploring the complementary information between color image and depth data. Furthermore, existing RGB-D saliency detection models mainly use depth information in two ways. One is based on depth features [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25], which focuses on taking depth information as an explicit supplementary feature of color features. In [12], Cheng et al. calculate the saliency map with additional depth information through color contrast, depth contrast, and spatial bias extended from 2D to 3D, which also proves that depth information is beneficial to visual saliency analysis in complex scenes. In order to fully explore the potential color and depth cues in the whole saliency processing process, Peng et al. [16] propose an evolution strategy to introduce depth information into super-pixel generation, initial saliency map generation, and saliency propagation. In [24], Fang et al. propose a united stereoscopic saliency model, which combines depth-guided background prior, boundary background, and compactness based on disparity to estimate the initial saliency map. The map is refined by using the spatial dissimilarity features under reduced dimensions and central preference. Zhu et al. [17,18] directly use the depth map to generate the depth feature saliency and merge it with the color features saliency, then optimize the saliency map by combining the center dark channel prior (CDCP) or background elimination model. In [21], Song et al. generate different saliency measures based on multi-level features at different scales and perform discriminative saliency fusion through a random forest regressor to obtain the final saliency result. Aiming at the problem that the robustness of the saliency detection algorithm is not satisfied in some complex situations containing multiple objects or complex background, Zhu et al. [20] propose a multilayer backpropagation algorithm based on depth mining, which extracts depth cues from four different saliency layers to improve performance.
The other is based on depth measurement [26][27][28][29][30][31][32][33][34][35][36], which aims to obtain implicit attributes such as shape and contour from the depth map by designing depth measurement algorithms. Ren et al. [27] propose the normalized depth prior and the global-context surface orientation prior. These prior can highlight near objects, weaken distant objects and reduce the saliency of severely inclined surfaces (such as the ground plane or ceilings). In [26], instead of using absolute depth, Ju et al. propose an anisotropic center-surround difference (ACSD) measure that considers the global depth structure to calculate and perceive the depth saliency map. Since the background usually contains the regions with a large change in depth compared to the neighborhood, this leads to a higher contrast in this region. In response to this problem, Feng et al. [28] design a local background enclosure (LBE) feature to capture the spread of angular directions, which quantifies the proportion of the object boundary that is in front of the background from the depth map. In [33], Wang et al. propose a multi-stage salient object detection framework based on minimum barrier distance transformation and multi-layer cellular automata (MCA). The framework integrates multiple visual features and priors including background prior, 3-D spatial prior and depth bias. In general, the depth-feature based method is an intuitive and simple to achieve the RGB-D saliency detection, which ignores the potential attributes in the depth map. By contrast, the depth-measurement based method aims to refine the saliency results by using implicit information.
However, limited by the technology of the depth sensor, not all depth information is accurate and practicable. In another word, when the depth maps are accurate, they can provide precise depth information to facilitate saliency detection, on the contrary, they may cause detection failure when the depth maps are poor. In order to handle this problem, Cong et al. [37] present a depth confidence measure to assess the reliability of the depth map and control the fusion ratio of depth features and color features in the saliency model. In addition, in [38], a novel saliency detection model is proposed that combines the implicit and explicit features of the depth map, its main idea is to transfer the existing RGB saliency detection model to RGB-D images with the help of depth constraint, so that it can inherit the saliency performance of RGB image. To a certain extent, the utilization efficiency of depth information is improved, but it also has a problem that the algorithm greatly relies on the performance of the RGB saliency detection algorithm. Therefore, how to effectively fuse depth information to enhance the detection of salient objects is still challenging. Moreover, the detection results of the above algorithms are mostly not ideal for scenes where the object touches the boundary.
To tackle these problems, we propose a saliency detection model with the bilateral absorbing Markov chain guided by depth information. The model includes three progressive processing stages. At the first stage, we explicitly combine depth features with color features to calculate the low-level saliency information based on background prior and contrast prior. In the second stage, we design a bilateral absorbing Markov chain model based on the background seed selection mechanism and cross-modal multi-graph learning model. In this stage, we can obtain mid-level foreground-based and background-based saliency maps by using low-level saliency cues of first stage. In the final stage, to further improve the performance of our algorithm, we propose a depth-guided optimization module to obtain a more homogeneous salient region.
The main contributions of our paper can be summarized as:

1.
A multi-stage RGB-D saliency detection framework with the bilateral absorbing Markov chain model is proposed. The framework can make full use of the explicit and implicit information in the depth map and explore the complementary relationship between the modes. 2.
The background seed screening mechanism is designed to solve the boundary touch problem. Moreover, the cross-modal multi-graph learning model is designed for implicitly fusing color and depth information by the learning.

3.
To preferably highlight the salient regions, we design a depth-guided optimization module which combines cellular automata and suppression-enhancement function pair.

Methodology
This section describes the proposed method in detail, and the overall framework is shown in Figure 1. The algorithm mainly consists of four subsections: pre-processing, low-level saliency cues calculation, mid-level saliency maps generation and high-level saliency optimization. efficiency of depth information is improved, but it also has a problem that the algorithm greatly relies on the performance of the RGB saliency detection algorithm. Therefore, how to effectively fuse depth information to enhance the detection of salient objects is still chal lenging. Moreover, the detection results of the above algorithms are mostly not ideal for scenes where the object touches the boundary.
To tackle these problems, we propose a saliency detection model with the bilatera absorbing Markov chain guided by depth information. The model includes three progres sive processing stages. At the first stage, we explicitly combine depth features with color features to calculate the low-level saliency information based on background prior and contrast prior. In the second stage, we design a bilateral absorbing Markov chain mode based on the background seed selection mechanism and cross-modal multi-graph learn ing model. In this stage, we can obtain mid-level foreground-based and background based saliency maps by using low-level saliency cues of first stage. In the final stage, to further improve the performance of our algorithm, we propose a depth-guided optimiza tion module to obtain a more homogeneous salient region.
The main contributions of our paper can be summarized as: 1. A multi-stage RGB-D saliency detection framework with the bilateral absorbing Mar kov chain model is proposed. The framework can make full use of the explicit and implicit information in the depth map and explore the complementary relationship between the modes. 2. The background seed screening mechanism is designed to solve the boundary touch problem. Moreover, the cross-modal multi-graph learning model is designed for im plicitly fusing color and depth information by the learning. 3. To preferably highlight the salient regions, we design a depth-guided optimization module which combines cellular automata and suppression-enhancement function pair.

Methodology
This section describes the proposed method in detail, and the overall framework is shown in Figure 1. The algorithm mainly consists of four subsections: pre-processing, low level saliency cues calculation, mid-level saliency maps generation and high-level saliency optimization. Figure 1. Flowchart of the proposed method. Background seed screening mechanism (BSSM) represents the background seed screening mechanism; cross-modal multi-graph learning model (CMLM) represents the cross-modal multi-graph learning model; bgAMC and fgAMC denote background-based and foreground-based saliency maps based on absorbing Markov chain respectively; SE function pair represents suppression-enhancement function pair. Flowchart of the proposed method. BSSM: background seed screening mechanism; CMLM: cross-modal multigraph learning model; bgAMC and fgAMC denote background-based and foreground-based saliency maps based on absorbing Markov chain respectively; SE function pair represents suppression-enhancement function pair.

Initial Two-Layer Sparse Graph Constrution
Given an RGB image and an aligned depth map, we first convert the RGB image to the CIELAB color space and segment it into N superpixels using mean shift [39] algorithm. The superpixel is a small region in the image composed of a series of adjacent pixels with similar features e.g., color, brightness, texture, etc. Then, we construct an initial two-layer sparse graph G = (V, E) such as [40], where V = {v i |1 ≤ i ≤ N} denotes the nodes and E = e ij 1 ≤ i, j ≤ N denotes the edges between nodes. The graph is generated by connecting each node to neighboring nodes and the most similar node sharing a common boundary with its neighboring nodes. It is worth to notice that the nodes on the four boundaries of the image are connected to each other to reduce the geodesic between the background nodes. As [40] proves, compared with the ordinary two-layer graph, the two-layer spares graph can effectively avoid the interference from surrounding redundant nodes.
In this work, we utilize the pre-trained FCN-32s network [41] to extract the color feature vector, the Euclidean distance c ij in RGB color space and depth difference d ij between superpixels i and j are defined as and where x i is the mean color feature vector of superpixel i, and d i denotes the mean depth value of superpixel i. The similarity a i j between superpixels i and j is defined as where the coefficient ε adjusts the weight of depth information and set as 0.5, a c ij and a d ij represent the color similarity and depth similarity respectively, and are defined as and where σ 2 is a parameter to control strength of the similarity which is set to 0.1. The affinity matrix W = [w ij ] N×N of the graph is defined as the similarity between two superpixels, where Ω i is the neighbors of superpixel i based on the initial two-layer sparse graph.

Low-Level Saliency Cues Calculation Using Color and Depth Cues
In this part, explicitly combining color and depth cues, we calculate low-level saliency information based on background prior and contrast prior. The saliency prior maps are shown in Figure 1.

Background Prior Calculation
We adopt boundary connectivity [42] to generate the background prior map, which is defined as Sensors 2021, 21, 838 5 of 23 in which BndCon(i) refers to the value of boundary connectivity for superpixel i and σ bndCon is a weighting factor for boundary connectivity. Here empirically sets σ 2 bndCon = 1. This background measure is robust to the normal cases and can effectively eliminate most background regions.

Region Contrast Prior Calculation
Human attention tends to focus on those image regions that contrast strongly with the surroundings. Therefore, we calculate a region contrast similar with [43], which integrates depth features and rich color features together. Then, compared to all other regions, we compute its saliency value by measuring its depth and color combined contrast, where D o (i, j) represents the Euclidean spatial distance between the superpixel i and j, Area v j is the area ratio of superpixel j compared with the whole image.

Mid-Level Saliency Maps Generation by Bilateral Absorbing Markov Chain
Inspired by [44], we design a bilateral absorbing Markov chain model, which combines multi-layer color features and depth features to obtain learned transition probability matrixes, and generate mid-level saliency maps. Most of the saliency models have poor detection results when the salient object is not in the center of the image, especially in the case of some salient regions touch the image boundary. To handle this situation in ours model, we propose a background seed screening mechanism (BSSM) to improve the graph model and better select background-based absorbing nodes. Moreover, we present a cross-modal multi-graph learning model (CMLM) to obtain the learned affinity and transition probability matrixes, which can make full use of the complementarity of color and depth information.

Absorbing Markov Chain for Saliency Detection
To facilitate the understanding, we give a brief introduction to the principle of absorbing Markov chain [45,46]. For a given set of states S = {s 1 , s 2 , . . . , s k }, the probability of moving from state s i to the next state s j is expressed as the transition probability p ij , which does not depend on the chain before the current state. An absorbing Markov chain contains at least one absorbing state (p ii = 1), and starts from every transient state, a certain absorbing state can be reached. For an absorbing chain with r absorbing states and t transient states, the canonical form of the transition matrix P is as follows, where Q ∈ [0, 1] t×t represents the transition probability of any pair of t transient states, while R ∈ [0, 1] t×r represents the transition probability between any transient state and absorbing state. 0 is the r × t zero matrix and I is the r × r identity matrix. Furthermore, the fundamental matrix N is computed [45], where n ij of N can describe the expected number of times from transient state s i to transient state s j in the absorbing chain. Then the absorption probability for each transient state to reach any absorbing state can be defined as [46], where b ij of B indicates the absorption probability from transient state s i to transient state s j . Traditional saliency detection models based on absorbing Markov chain generally mirror image boundary superpixels as absorbing nodes (or states), and all others as transient nodes. Then, the transition matrix P is constructed according to the similarity (the transition probability) between nodes. The saliency value is measured by the absorption probability, the higher the absorption probability of the node, the more similar to the absorbing nodes.

Background Seed Screening Mechanism
Generally, traditional saliency detection models based on absorbing Markov chain [44][45][46] usually mirror image edge superpixels as absorbing nodes and simply connect all edge superpixels in pairs. However, as shown in Figure 2, when the salient object touches the image boundary, the mirroring will mistakenly regard the foreground nodes as background-based absorbing nodes, thus suppressing the saliency of the foreground regions or causing detection failure. Similarly, if the edge nodes contain foreground nodes, the full connections between them may be poorly robust. To overcome them, we propose a background seed screening mechanism (BSSM) for improving the two-layer sparse graph and selecting better background-based absorbing nodes. This mechanism removes the nodes that may belong to the foreground from the edge nodes. Furthermore, in order to increase the diversity of the background and restrain the background regions, a small number of random non-edge background nodes are selected to form a new edge node set and a background-based absorbing node set. Moreover, to obtain more homogeneous salient regions, we design the non-local connection similar to [47]. Next, we will introduce the construction process of the background seed screening mechanism and the non-local connections in detail.  [45]. (d) Saliency maps generated by [4 liency maps generated by [44].
To facilitate understanding, we provide a schematic diagram in Figure 3, w scribes the main screening process of the background seeds. First, according to th utes of saliency, position and depth, all nodes are classified as three categories. A in Figure 3a, based on the low-level background prior b p S , we divide all nodes in  [45]. (d) Saliency maps generated by [46]. (e) Saliency maps generated by [44]. To facilitate understanding, we provide a schematic diagram in Figure 3, which describes the main screening process of the background seeds. First, according to the attributes of saliency, position and depth, all nodes are classified as three categories. As shown in Figure 3a, based on the low-level background prior S bp , we divide all nodes into background seed set Ω BG , foreground seed set Ω FG and others.
where S f p represents the foreground prior, Sensors 2021, 21, x FOR PEER REVIEW seed set Ω BG and edge node set Ω edge . More specifically, we cluster the se Ω edge and Ω FG to find the nodes that are similar with the foreground seeds Ω 3d is the filtered result: new edge node set ′ Ω edge and background seed set ′ Ω BG 3e, we take depth information into consideration in the process of backgro screening. In Figure   Then, for guaranteeing the diversity of the background and suppressing ground more effectively, we combine a small number of non-edge nodes with further form the final edge nodes Ω _ f edge . These non-edge nodes are rando posed of 50% Ω A , 10% Ω B , and 50% Ω C . In the initial two-layer sparse gr duce the geodesic distances of nodes, all edge nodes are simply connected toget ever, it may be poorly robust to the case when salient objects touch the image b Therefore, instead of the rough connections, we use the final edge nodes Ω f nected in pairs to obtain a new two-layer sparse graph new G . In addition, to ob consistent salient regions, we introduce the non-local connection into the grap cally, it first sorts the foreground prior fp S and the region contrast prior nodes, the top 50% of both are selected as foreground seeds, and the bottom 5 lected as background seeds. For each superpixel, we connect it to two nodes th domly chosen from the two seed sets respectively. This connection mechanis According to the position attribute, the nodes can be classified as edge node set Ω edge = {i|i ∈ edge} and non-edge node set Ω non_edge = {i|i / ∈ edge} as shown in Figure 3b.
Considering that objects far away from the camera are likely to belong to the background, as shown in Figure 3c, we use the depth threshold to divide the nodes into depth-based background seed set Ω Dep and others.
To alleviate the boundary touch problem and select background seeds more accurately, we utilize k-means algorithm to filter out the foreground nodes in the background seed set Ω BG and edge node set Ω edge . More specifically, we cluster the sets of Ω BG , Ω edge and Ω FG to find the nodes that are similar with the foreground seeds Ω FG . Figure 3d is the filtered result: new edge node set Ω edge and background seed set Ω BG . In Figure 3e, we Sensors 2021, 21, 838 8 of 23 take depth information into consideration in the process of background seed screening. In Figure 3f, non-edge background seeds and depth-based background seeds are further divided into three sub-sets: Ω A , Ω B , and Ω C . It is obvious that the seeds in Ω A satisfy both background probability and depth with high values, while the seeds in Ω B and Ω C only satisfy the requirement of high background probability or high depth value, respectively.
Then, for guaranteeing the diversity of the background and suppressing the background more effectively, we combine a small number of non-edge nodes with Ω edge and further form the final edge nodes Ω f _edge . These non-edge nodes are randomly composed of 50% Ω A , 10% Ω B , and 50% Ω C . In the initial two-layer sparse graph, to reduce the geodesic distances of nodes, all edge nodes are simply connected together. However, it may be poorly robust to the case when salient objects touch the image boundaries. Therefore, instead of the rough connections, we use the final edge nodes Ω f _edge connected in pairs to obtain a new two-layer sparse graph G new . In addition, to obtain more consistent salient regions, we introduce the non-local connection into the graph. Specifically, it first sorts the foreground prior S f p and the region contrast prior S rc of all nodes, the top 50% of both are selected as foreground seeds, and the bottom 50% are selected as background seeds. For each superpixel, we connect it to two nodes that are randomly chosen from the two seed sets respectively. This connection mechanism is more conducive to highlight the foreground objects and suppress the background regions. The improved two-layer sparse graph with the non-local connection is visualized in Figure 4e.    Moreover, Figure 5 demonstrate the effects of the proposed background seed screening mechanism (BSSM) and non-local connection. In Figure 5e, it is clearly observed that the background is well suppressed by the improved two-layer sparse graph based on background seed screening mechanism (BSSM). Figure 5g illustrates that the non-local connection can achieve more complete and consistent salient regions. Ground truth. (d) A diagram of the connections of one of the nodes based on initial two-layer sparse graph. A node (illustrated by a pink dot) connects to its adjacent nodes (blue dots and connections) and the most similar node (dark green dots and connections) sharing common boundaries with its adjacent nodes. All edge nodes are connected to pairs (yellow dots and local connections). (e) A diagram of the connections of one of the nodes based on improved two-layer sparse graph. Different from the initial graph, the new edge nodes first remove some foreground nodes which are in the image boundary (the nodes at the bottom edge of image), and further join a small number of non-edge background nodes (black nodes). Each pair of the new edge nodes connects to each other (yellow and black dots and connections). Additionally, each node connects to the background seeds (light green dots and connections) and the foreground seeds (purple dots and connections).

Cross-Modal Multi-Graph Learning Model
The two-layer sparse graph constructs the connections among the local regions which will restrict the range of random walk to the local regions. Therefore, the absorption time may be inaccurate, especially when the long-range smooth background distributes

Cross-Modal Multi-Graph Learning Model
The two-layer sparse graph constructs the connections among the local regions, which will restrict the range of random walk to the local regions. Therefore, the absorption time may be inaccurate, especially when the long-range smooth background distributes near the center of image. To overcome it, we have improved the graph model from the connection relationship in the above section. However, in the absorbing Markov chain model, another key influencing factor is the weight of the edges between nodes. Similar to Formula (3), most of the existing graph models directly weight depth and color cues to measure the similarity between nodes. However, the models do not consider the effect of color and depth information on saliency detection in different scenarios. For example, in some scenes, color is more reliable than depth, so a larger weight of color is needed. Conversely, if depth is more reliable, we need to strengthen the weight of depth. Therefore, we propose a cross-modal multi-graph learning model (CMLM), which fully explores the complementary relationship between color and depth in different scenarios. The learning model constructs a more accurate affinity matrix and captures the optimal fusion state of color and depth information.
Some algorithms [44,48] have constructed the affinity matrix by the learning. In [48], the learning model based on the single graph is proposed, which construct an approximate full affinity matrix by using the following equation, where Y = [y i , y 2 , . . . , y n ] ∈ R N×N is an affinity matrix optimized by unsupervised learning based on the original sparse affinity matrix. y i = [y i1 , y i2 , . . . , y iN ] is a column vector indicating the degree of affinity between the node i and all other nodes, i i is the i-th column of an identity matrix I which indicates the similarity with itself. In Equation (18), the first item is a smoothing constraint item, which indicates the difference between y i and y j . The two nodes are more similar, the value of first item will be lower. The second item is a self-restraint item, which emphasizes that no matter how we update the value of y i of node i, it should not be too different from its initial value. µ is a parameter that balances the relationship between the two items, µ > 0.
Formula (18) is the learning process under the single-layer graph. To make full use of the complementarity of color and depth information, we explore feature spaces of multiple modes and develop a cross-modal multi-graph model to learn an affinity matrix. We use where the parameter γ controls the weight distribution of all affinity matrices, ensuring that different-mode features can be fully utilized. Without this parameter, in some cases, it is possible that only partial features participate in the learning of affinity matrix, which may utilize the complementarity between different features insufficiently. The parameter µ and γ are set to 0.001 and 4 respectively. To facilitate the derivation, we rewrite the above objective function (19) in the form of matrix, where L c is the graph Laplacian matrix of the ν−th color feature, D (ν) c is the degree matrix and d Tr(·) and · F compute the trace and the Frobenius norm of the matrix separately. We can see that there are two unknown items β and Y to be solved in Equation (20), so we decompose it into two sub-problems to solve this optimization problem by iteration. Fix β, Update Y: To get the optimal solution of sub-problems, we utilize partial derivative and Lagrange Multiplier Method. The specific derivation process can refer to [48]. With the learned affinity matrix Y, we can calculate the transition matrixes of absorbing Markov chain. The final learned affinity matrix is obtained by normalization, Figure 6d shows the effects of the proposed cross-modal multi-graph learning model (CMLM). As it is illustrated, compared to single-mode multi-graph learning model (color mode), the proposed model is more precise to highlight the salient regions. In this part, we select background-based absorbing nodes based on the above background seed screening mechanism. As is presented in Figure 7a, we mirror edge nodes ′ Ω edge and some non-edge background nodes as virtual absorbing nodes, and all nodes in

Background-Based Saliency Map via Absorbing Markov Chain
In this part, we select background-based absorbing nodes based on the above background seed screening mechanism. As is presented in Figure 7a, we mirror edge nodes Ω edge and some non-edge background nodes as virtual absorbing nodes, and all nodes in the image as transient nodes. The non-edge background nodes are randomly composed of 50% Ω A , 50% Ω B and 50% Ω C . The number of absorbing nodes is r. Then, the backgroundbased affinity matrix N×r can be obtained with Formula (24). Furthermore, the learned transition matrix is defined as is the sum of the matrix D 1 and is the degree matrix of W L , and d . Based on the above work, the saliency of the node i is defined as The background-based saliency map bg S is shown in Figure 1. Then, we mirror the nodes with the saliency value greater than the threshold th as the foreground-based absorbing nodes, which is illustrated in Figure 7b. The number of absorbing nodes is k.

Foreground-Based Saliency via Absorbing Markov Chain
Similarly, the foreground-based affinity matrix can be obtained with Formula (24), and the learned transition matrix is as follows,  The background-based saliency map S bg is shown in Figure 1. Then, we mirror the nodes with the saliency value greater than the threshold th as the foreground-based absorbing nodes, which is illustrated in Figure 7b. The number of absorbing nodes is k.

Foreground-Based Saliency via Absorbing Markov Chain
Similarly, the foreground-based affinity matrix N×k can be obtained with Formula (24), and the learned transition matrix is as follows, is the sum of the matrix D 1 and D 2 , According to Formula (11), the absorption probability matrix In order to calculate the foreground-based saliency more accurately and eliminate the interference of weak correlated nodes, we sort each row of B F and select the top 60% of the nodes to calculate the final saliency value, where c = 0.6 * k, and the foreground-based saliency map S f g is shown in Figure 1.

High-Level Saliency Map Optimization via Depth Guidance
In order to further highlight the salient regions and effectively explore the inner relationship between depth information and salient information, we design a depth-guided optimization module which combines cellular automata and suppression-enhancement function pair.

Optimization via Cellular Automata
We perform a primary fusion of the saliency maps produced by the bilateral absorbing Markov chain model, Based on the improved two-layer sparse graph, we use the cellular automata [49] propagation mechanism to further optimize the fused saliency map. First, based on the learned affinity matrix W L and the color similarity matrix A c = [a c ij ] N×N , we construct an Furthermore, all superpixel nodes (cells) are updated simultaneously through the following iteration rules, where I is the identity matrix.
and C * = diag c * 1 , c * 2 , . . . , c * N are normalized impact factor matrix and coherence matrix respectively, . . , d f N is the degree of the matrix and d f i = ∑ j f ij . The constant coefficients a and b are set to 0.6 and 0.2, respectively, norm(·) means normalization function. Each cell can automatically evolve into a more accurate and stable state, and under the influence of the neighborhood, the salient regions are easier to be detected. The initial S h when h = 0 is S f b in Equation (30), and the ultimate saliency map after h = 10 time steps is denoted as S CA , which is visualized in Figure 8g.

Refinement via Depth Information
Cellular automata mainly explores the neighborhood relationship between the nodes in the color feature space, but ignores the spatial position information in the scene. Therefore, we use depth cues to enhance and refine the salient regions and suppress the background regions. In this work, we design a depth selective refinement mechanism by a suppression-enhancement function pair: the suppression function is used to suppress the background, and then an enhancement function is used to emphasize the salient regions through high-confidence depth seeds.
Suppression function: The regions far away from the camera have a higher probability of being the background and need to be suppressed. Therefore, we defined the suppression function as follows,

Enhancement function:
Although the suppression function inhibits background information to a certain extent, it may lose some saliency information. The enhancement function can play a complementary role. First of all, we need to determine which depth information is reliable and needs to be retained. Here we combine three saliency maps to filter out the potential depth seed set Ω D with high confidence. The depth seeds are all salient in saliency maps of fg S , bg S and CA S . The enhancement function is defined as

Refinement via Depth Information
Cellular automata mainly explores the neighborhood relationship between the nodes in the color feature space, but ignores the spatial position information in the scene. Therefore, we use depth cues to enhance and refine the salient regions and suppress the background regions. In this work, we design a depth selective refinement mechanism by a suppression-enhancement function pair: the suppression function is used to suppress the background, and then an enhancement function is used to emphasize the salient regions through high-confidence depth seeds.
Suppression function: The regions far away from the camera have a higher probability of being the background and need to be suppressed. Therefore, we defined the suppression function as follows, where th CA is the adaptive threshold of the saliency map S CA obtained by Otsu [50] algorithm, and S d (i) = norm(d i ) is the depth prior. After filtering S SF through the Otsu algorithm, the suppressed saliency map S 1 is obtained. Enhancement function: Although the suppression function inhibits background information to a certain extent, it may lose some saliency information. The enhancement function can play a complementary role. First of all, we need to determine which depth information is reliable and needs to be retained. Here we combine three saliency maps to filter out the potential depth seed set Ω D with high confidence. The depth seeds are all salient in saliency maps of S f g , S bg and S CA . The enhancement function is defined as follows, After the suppression-enhancement function pair, we can get the final saliency map S EF , which is shown in Figure 8h.

Datasets
In this part, in order to effectively demonstrate our proposed algorithm, we evaluate the model in three most popular datasets, including NLPR [13], NJU2K [26], and STERE [9]. The NLPR dataset includes 1000 RGB-D images, where the depth maps are captured by Microsoft Kinect. The NJU2K dataset contains 1985 RGB-D images which are collected from the Internet, 3-D movies and photographs taken by stereo camera, and depth maps are estimated by the optical-flow method. The STERE dataset contains 1000 stereoscopic images with the corresponding pixel-level ground truth.

Evaluation Metrics
Following [51], we use the following five popular evaluation metrics to evaluate the performance of the saliency detection methods comprehensively.
MAE estimates a mean absolute error between a predicted saliency map S and groundtruth map GT, it is defined as where H and W are the height and the width of the saliency map. PR curve is formed by a series of pairs of precision and recall scores calculated at fixed thresholds ranging from 0 to 255, which describes the model performance at different situations.
F-measure is a harmonic mean of average precision and recall, which is defined as, We empirically set β 2 = 0.3. S-measure [52] is used to measure the spatial structure information, which is defined as, where α is a balance parameter between the object-aware structural similarity S 0 and region-aware structural similarity S r , and it is set to 0.5. E-measure [53] is to evaluate the foreground map (FM) and noise, which combines local pixel values with image-level mean values to jointly capture image-level statistics and local pixel matching information. where φ is an enhanced alignment matrix for the two properties of a binary map.

Ablation Study
Our algorithm combines background seed screening mechanism, non-local connection, cross-modal multi-graph learning model, and depth-guided optimization module. To further demonstrate the effectiveness of the components, a series of experiments are carried out. Figure 9 shows all the results of the above experiments intensively. In this part, we will combine the two-layer graph and the bilateral absorbing Markov chain based on singlemodal multi-graph learning as the baseline model, which is the combination 1 in Figure 9. As is illustrated in Figure 9, the two-layer sparse graph and the background seed screening mechanism greatly improve the performance of our algorithm, which can be observed from the combinations 1, 2 and 3. Compared to the two-layer graph, the two-layer sparse graph suppresses most of the background better in Figure 5d. From Figure 5d, based on the background seed screening mechanism, background is further diluted, and the foreground is further strengthened. Compared with combination 3, the cross-modal multi-graph learning model has better improvement in precision-recall and S-measure, but the other evaluation parameters may be slightly lower. From comprehensive perspective, the crossmodal multi-image learning model and depth guided optimization module can achieve the best results which can refer to combinations 5 and 6. As Figure 6d shows, compared to the single-mode multi-graph learning model (color mode), the cross-mode multi-graph learning model can better pop foreground objects from various scenes. Figure 8h displays the effect of the depth-guided optimization module. Finally, from combinations 7 and 8, it obvious that the non-local connection can effectively improve the overall performance of the algorithm. The saliency maps with the non-local connection are more precise as shown in Figure 5g.
where φ is an enhanced alignment matrix for the two properties of a binary map.

Ablation Study
Our algorithm combines background seed screening mechanism, non-local conne tion, cross-modal multi-graph learning model, and depth-guided optimization modu To further demonstrate the effectiveness of the components, a series of experiments a carried out. Figure 9 shows all the results of the above experiments intensively. In th part, we will combine the two-layer graph and the bilateral absorbing Markov chain bas on single-modal multi-graph learning as the baseline model, which is the combination in Figure 9. As is illustrated in Figure 9, the two-layer sparse graph and the backgroun seed screening mechanism greatly improve the performance of our algorithm, which c be observed from the combinations 1, 2 and 3. Compared to the two-layer graph, the tw layer sparse graph suppresses most of the background better in Figure 5d. From Figu 5d, based on the background seed screening mechanism, background is further dilute and the foreground is further strengthened. Compared with combination 3, the cros modal multi-graph learning model has better improvement in precision-recall and measure, but the other evaluation parameters may be slightly lower. From comprehensi perspective, the cross-modal multi-image learning model and depth guided optimizati module can achieve the best results which can refer to combinations 5 and 6. As Figure  shows, compared to the single-mode multi-graph learning model (color mode), the cros mode multi-graph learning model can better pop foreground objects from various scen Figure 8h displays the effect of the depth-guided optimization module. Finally, from com binations 7 and 8, it obvious that the non-local connection can effectively improve t overall performance of the algorithm. The saliency maps with the non-local connecti are more precise as shown in Figure 5g.

Comparisions with State-of-the-Art Methods
We compare our proposed algorithm with 10 state-of-the-art RGB-D saliency detection models, including ACSD [26], DESM [12], LHM [13], GP [27], DCMC [37], LBE [28], SE [16], CDCP [18], CDB [24], and DTM [38]. For fair comparison, we employ saliency maps provided by the [51]. Table 1 and Figure 10 show the quantitative results of different RGBD saliency detection models. We also report saliency maps with various scenes as shown in Figure 11. Table 1. Quantitative comparisons of different RGB-D saliency detection methods on three popular datasets. Red, green and blue indicate the best, second and third performances. ↑ denotes larger is better, and ↓ denotes smaller is better.

Methods
Year NLPR NJU2K STERE We report PR curves of three datasets in Figure 10 and list α S , m E , β F and MAE in Table 1. As shown in Figure 10, our method achieves better PR curves on the three datasets, especially on NLPR and STERE datasets. This indicates that our method can obtain higher precision and recall compared with other methods. On the NJU2K dataset, although the end of our PR curve drops faster than some methods, we always maintain a robust curve on each dataset and keep a good balance between precision and recall overall. As listed in Table 1, we can intuitively observe the superiority of our method among all the methods, which can be proved with the best results over all the three datasets. This demonstrates that our algorithm can generate more accurate salient regions and is more adaptable to various scenes than others.  Table 1. Quantitative comparisons of different RGB-D saliency detection methods on three popular datasets. Red, green and blue indicate the best, second and third performances.  denotes larger is better, and  denotes smaller is better. In addition to the quantitative comparisons, to prove the effectiveness of our model visually, we also display some saliency maps in Figure 11. As we can see, the most saliency detection methods can effectively handle the cases with relatively simple backgrounds and homogenous objects. However, these methods fail to handle the complicated cases. In contrast, our method can deal with these intricate scenarios more effectively. To make it more convincing, we compare these methods in the following four aspects: (1) the effec- We report PR curves of three datasets in Figure 10 and list S α , E m , F β and MAE in Table 1. As shown in Figure 10, our method achieves better PR curves on the three datasets, especially on NLPR and STERE datasets. This indicates that our method can obtain higher precision and recall compared with other methods. On the NJU2K dataset, although the end of our PR curve drops faster than some methods, we always maintain a robust curve on each dataset and keep a good balance between precision and recall overall.

Methods
As listed in Table 1, we can intuitively observe the superiority of our method among all the methods, which can be proved with the best results over all the three datasets. This demonstrates that our algorithm can generate more accurate salient regions and is more adaptable to various scenes than others.
In addition to the quantitative comparisons, to prove the effectiveness of our model visually, we also display some saliency maps in Figure 11. As we can see, the most saliency detection methods can effectively handle the cases with relatively simple backgrounds and homogenous objects. However, these methods fail to handle the complicated cases. In contrast, our method can deal with these intricate scenarios more effectively. To make it more convincing, we compare these methods in the following four aspects: (1) the effectiveness of dealing with boundary touch issues; (2) the effectiveness of the background suppression; (3) the effectiveness of solving similar appearances; and (4) the effectiveness of detection with a poor depth map. Here combined with examples to vividly expand the above four aspects. First, as shown in the 7-th and 8-th rows of Figure 11a, the 3-th, 5-th, and 7-th rows of Figure 11b, and the 8-th row of Figure 11c, only the GP algorithm has certain resistance to boundary touch problem, but when the background is complex and the depth map is poor, as shown in the 3-th rows of Figure 11b, the detection will fail. In contrast, our algorithm achieves better results in various scenes when encountering this situation. Then, from the 3-th, 4th, and 6-th rows of Figure 11a, we can find that most of the algorithms cannot effectively remove the background in front of the salient objects due to the interference from the depth near the camera. However, our method can availably eliminate them by using learning fusion. Moreover, as shown in the 7-th row of Figure 11b, the 8-th and 10-th rows of Figure 11c, our method works well when the color appearance of salient object is similar to the background. Finally, our model is still robust under the condition of poor depth map quality, which is demonstrated in the 3-th and 4-th rows of Figure 11b, the 1-th, 2-th, 3-th, and 6-th rows of Figure 11c.
In general, our algorithm has better robustness in the various complex scenarios. Especially, when the salient objects touch the image boundary or the depth map quality in the dataset is uneven, our method still has a good performance, which can obtain the uniform and highlighted salient objects.
Computational complexity. We utilize the computational complexity to prove the Here combined with examples to vividly expand the above four aspects. First, as shown in the 7-th and 8-th rows of Figure 11a, the 3-th, 5-th, and 7-th rows of Figure 11b, and the 8-th row of Figure 11c, only the GP algorithm has certain resistance to boundary touch problem, but when the background is complex and the depth map is poor, as shown in the 3-th rows of Figure 11b, the detection will fail. In contrast, our algorithm achieves better results in various scenes when encountering this situation. Then, from the 3-th, 4-th, and 6-th rows of Figure 11a, we can find that most of the algorithms cannot effectively remove the background in front of the salient objects due to the interference from the depth near the camera. However, our method can availably eliminate them by using learning fusion. Moreover, as shown in the 7-th row of Figure 11b, the 8-th and 10-th rows of Figure 11c, our method works well when the color appearance of salient object is similar to the background. Finally, our model is still robust under the condition of poor depth map quality, which is demonstrated in the 3-th and 4-th rows of Figure 11b, the 1-th, 2-th, 3-th, and 6-th rows of Figure 11c.
In general, our algorithm has better robustness in the various complex scenarios. Especially, when the salient objects touch the image boundary or the depth map quality in the dataset is uneven, our method still has a good performance, which can obtain the uniform and highlighted salient objects.
Computational complexity. We utilize the computational complexity to prove the advantages of our proposed method compared to other methods (traditional-based and deep learning-based). In this paper, we adopt the floating point operations (FLOPs) to measure the computational complexity of the models. For fair comparisons, we obtain the deployment codes released by authors and use the same configuration as much as possible to estimate their computational complexity. As illustrated in Table 2, compared with the latest deep learning-based methods such as D 3 Net [51], BBS-Net [54], and UC-Net [55], our computational complexity is only one tenth or even one hundredth of theirs. Moreover, compared with the traditional-based methods such as DCMC [37], CDCP [18], and DTM [38], our model can achieve obvious higher performance in the relatively lower computational complexity combined with Table 1.

Conclusions and Future Work
In this paper, we propose a RGB-D saliency detection model with the bilateral absorbing Markov chain guided by depth information. Using the explicit combination of depth and color information, we first generate the low-level saliency cues based on the background prior and contrast prior. Then, to overcome the existing drawbacks in the absorbing Markov chain model, we propose a serial of methods: the background seed screening mechanism (BSSM) for boundary touch cases and the cross-modal multi-graph learning model for multi-modal fusion. Moreover, considering the limitation of local intrinsic correlation, a non-local intrinsic correlation is introduced to improved two-layer sparse graph. Based on the optimized bilateral absorbing Markov chain model, we obtain the mid-level saliency maps. Finally, we design a depth-guided optimization module to get more accurate high-level saliency map. The optimization module consists of two submodules: the cellular automata to optimize the integrated saliency map in the color space and the suppression-enhancement function pair to refine the saliency map in the depth space. Compared with most of the algorithms mentioned in this article, our method alleviates the boundary touch problem well and greatly suppresses the background. The comprehensive comparisons and ablation study on three RGB-D saliency detection datasets have demonstrated that the proposed method is effective and robust in various scenarios both qualitatively and quantitatively.
The literature [51] builds a new salient person (SIP) dataset with quite challenging which covers diverse real-world scenes from various viewpoints, poses, occlusion, illumination, and background. Moreover, deep learning-based RGB-D saliency detection methods [51,54,55] have developed vigorously and achieved the qualitative leap. Therefore, we look forward to extending our work to the deep learning in the future, exploring the complementarity of depth information and color information more fully, and dedicating ourselves to the studying of the saliency detection algorithm in real-world scenes.