Computation of Kullback–Leibler Divergence in Bayesian Networks

Kullback–Leibler divergence KL(p,q) is the standard measure of error when we have a true probability distribution p which is approximate with probability distribution q. Its efficient computation is essential in many tasks, as in approximate computation or as a measure of error when learning a probability. In high dimensional probabilities, as the ones associated with Bayesian networks, a direct computation can be unfeasible. This paper considers the case of efficiently computing the Kullback–Leibler divergence of two probability distributions, each one of them coming from a different Bayesian network, which might have different structures. The paper is based on an auxiliary deletion algorithm to compute the necessary marginal distributions, but using a cache of operations with potentials in order to reuse past computations whenever they are necessary. The algorithms are tested with Bayesian networks from the bnlearn repository. Computer code in Python is provided taking as basis pgmpy, a library for working with probabilistic graphical models.


Introduction
When experimentally testing Bayesian network learning algorithms, in most of the cases, the performance is evaluated looking at structural differences between the graphs of the original Bayesian network and the learned one [1], as in the case of using the structural Hamming distance. This measure is used in recent contributions as [2][3][4]. A study and comparison of the different metrics used to measure the structural differences between two Bayesian networks can be found in [1].
However, in most cases the aim of learning a Bayesian network is to estimate a joint probability for the variables in the problem. In that situation the error of a learning procedure should be computed by measuring the difference between the probability associated with the learned network and the original joint probability distribution. Therefore, it can be useful to estimate a network that is less dense than the original one, but in which parameters can have a more accurate estimation. This is the case of the Naive Bayes classifier, which obtains very good results in classification problems, despite the fact that the structure is not the correct one. So, in this situation, structural graphical differences are not a good measure of performance.
The basic measure to determine the divergence between an estimated distribution and a true one is the so-called Kullback-Leibler divergence [5]. Some papers use this way of asserting the quality of a learning procedure as in [6][7][8]. A direct computation of the divergence is unfeasible if the number of variables is high. However, some basic decomposition properties [9] (Theorem 8.5) can be applied to reduce the cost of computation of the divergence. This is the basis of the procedure implemented in the Elvira system [10] which is the one used in [6]. Methods in [7,8] are also based on the same basic decomposition. Kullback-Leibler divergence is not only meaningful for measuring divergence between a learned network and a true one, but also for other tasks, as for example the approximation of a Bayesian network by a simpler one [11][12][13] by removing some of the existing links. A configuration or assignment of values to a set of variables X, {X 1 = x 1 . . . X n = x n }, can be abbreviated with (x 1 . . . x n ) and is denoted as x. If the set of possible values for each variable in the previous example is {0, 1}, then a configuration can be x = (0, 0, 1), representing the assignment {X 1 = 0, X 2 = 0, X 3 = 1}.
A partial configuration involving a subset of variables Y ⊆ X is denoted as y. If the set of variables is f i or pa i , then the partial configuration will be denoted by x f i or x pa i , respectively. In our example, if f 2 = {X 2 , X 1 } an example of partial configuration about these variables will be x f 2 = (0, 0). The set of configurations for variables Y is denoted by Ω Y . If x is an assignment and Y ⊆ X, then the configuration y obtained by deleting the values of the variables in X \ Y is denoted by x ↓Y . If x f 2 = (0, 0) is a partial configuration about variables {X 2 , X 1 } and we consider Y = {X 2 }, then x ↓Y f 2 is the configuration obtained by removing the value of X 1 , i.e., (0).
If w and z are configurations for W ⊆ X and Z ⊆ X respectively, and W ∩ Z = ∅, then (w, z) is a configuration for W ∪ Z, and will be called the composition of w and z. For example, if w is the configuration (0) about variable X 1 and y is the configuration (0, 1) defined on X 2 , X 3 , then its composition will be the configuration (0, 0, 1) for variables {X 1 , X 2 , X 3 }.
The conditional probability distribution for X i given its parents will be denoted as φ i which is a potential defined on the set of variables f (X i ). In general, a potential φ for variables Y ⊆ X is a mapping defined on Ω Y into the set of real numbers: φ : Ω Y → R. The set of variables of potential φ will be denoted as v(φ). If Φ is a set of potentials, v(Φ) will denote φ∈Φ v(φ).
There are three basic operations that can be performed on potentials: • Multiplication. If φ, φ are potentials, then their multiplication is the potential φ · φ , with set of variables v(φ · φ ) = v(φ) ∪ v(φ ) and obtained by pointwise multiplication: In our example, the combination of φ 2 and φ 3 will be the potential φ 2 · φ 3 defined on {X 1 , X 2 , X 3 } and given by Marginalization. If φ is a potential defined for variables Y and Z ⊆ Y, then the marginalization of φ on Z is denoted by φ ↓Z and it is obtained by summing in the variables in Y \ Z: When Z is equal to Y minus a variable W, then φ ↓Z will be called the result of removing W in φ and also denoted as φ −W . In the example, a marginalization of φ 3 is obtained by removing X 3 producing φ −X 3 3 defined on X 2 and given by φ Selection. If φ is a potential defined for variables Y and z is a configuration for variables Z, then the selection of φ for this configuration z is the potential φ Z=z defined on variables W = Y \ Z and given by In this expression (w, z ↓Y ) is the composition of configurations w and z ↓Y which is a configuration for variables v(φ). Going back to the example, assume that we want to perform the selection of φ 3 to configuration z = (0, 1) for variables {X 1 , X 2 }, then φ 3 Z=z will be a potential defined for variables {X 2 , , as we are reducing φ 3 (X 3 , X 2 ) to a configuration z in which The family of all the conditional distributions is denoted as Φ = {φ 1 , . . . , φ n }. It is well known that given N the joint probability distribution of the variables in N, p, is a potential that decomposes as the product of the potentials included in Φ: Considering the example, Φ = {φ 1 , φ 2 , φ 3 } and p = φ 1 · φ 2 · φ 3 . The marginal distribution of p for a set of variables Y ⊆ X is equal to p ↓Y . When Y contains only one variable X i , then to simplify the notation, p ↓Y will be denoted as p i . Sometimes it will be needed to make reference to the Bayesian network containing a potential or family. In these cases we will use a superscript. For example, f A (X i ) and pa A (X i ) refer to the family and parents set of X i in a Bayesian network N A respectively. The aim of this paper is to compute the Kullback-Leibler divergence (termed KL) between the joint probability distributions, p A and p B , of two different Bayesian networks N A and N B defined on the same set of variables X but possibly having different structures. This divergence, denoted as KL(N A , N B ) can be computed considering the probabilities for each configuration x in both distributions as follows: However, the computation of the joint probability distribution may be unfeasible for complex models as the number of configurations x is exponential in the number of variables. If p, q are probability distributions on X then the expected log likelihood (LL) of q with respect to p is: then, from Equation (2) it is immediate that: The probability distribution p B can be decomposed as well as considered in Equation (1). Therefore, the term LL(N A , N B ) in Equation (3) can be obtained as follows considering the families of variables in N B and their corresponding configurations, x ↓ f B i : Interchanging additions and reorganizing the terms in Equation (4): Equation ( values, as it is necessary to compute the marginal probability distribution for variables in f B i , the family of X i in Bayesian network N B , but using the joint probability distribution p A associated with the Bayesian network N A .

Computation with Propagation Algorithms
In this section we introduce the category of inference algorithms based on deletion of variables and then we show how these algorithms can be applied to compute the Kullback-Leibler divergence using Equation (5).

Variable Elimination Algorithms
To compute (p A ) ↓ f B i we consider Φ A the set of potentials associated to network N A : the multiplication of all the potentials in Φ A is equal to p A . Deletion algorithms [15,16], can be applied to Φ A to determine the required marginalizations. The basic step of these algorithms is the deletion of a variable from a set Φ A : • Variable Deletion. If Φ is a set of potentials, the deletion of X i consists of the following operations: , the set of potentials containing variable X i .
, combine all the potentials in Φ i and remove variable X i by marginalization.
, remove from Φ the potentials containing X i and add the new potential φ −i which does not contain X i .
The main property of the deletion step is the following: starting with ∏ φ∈Φ φ = q, then after the deletion of X i from Φ, ∏ φ∈Φ φ = q −X i . It is easy to see that the deletion of a variable X i can be computed just operating with the elements of Φ defined on X i .
If Φ is the initial set of potentials of a network N, then p = ∏ φ∈Φ φ. In order to compute the marginalization of p on a set of variables Y ⊆ X, the deletion procedure should be repeated for each variable X i in X \ Y. If the marginal probability distribution for variable X k is to be calculated, any variable in X different from X k should be deleted. The order of variable deletion is not meaningful for the final result, but the efficiency may depend on it.
When there are observed variables, Z = z, then a previous step of selection should be carried out: any potential φ ∈ Φ is transformed into φ Z=z . This step will be called evidence restriction. After it q, the product of the potentials in Φ is defined for variables in Y = X \ Z and its value is q(y) = p(y, z), i.e., the joint probability of obtaining this value and the observations. If a deletion of variables in W is carried out, then the product of the potentials in Φ is the potential defined for variables Y = (X \ Z) \ W, and satisfies q(y) = p ↓Y∪Z (y, z).
When we have observations and we want to compute the marginal on a variable X i , it is well known that not all the initial potentials in Φ are relevant. A previous pruning step can be done using the Bayes-ball algorithm [17] in order to remove the irrelevant potentials from Φ before restricting to the observations and carrying out the deletion of variables.

Computation of Kullback-Leibler Divergence Using Deletion Algorithms
Our first alternative to compute LL(N A , N B ) is based on using a simple deletion algorithm to compute the values ( (5). The basic steps are: • Given a specific variable X i , we have that Then for each possible configuration x pa B i of the parent variables, we include the observation pa B i = x pa B i and we apply a selection operation to the list of potentials associated with Bayesian network N A by means of evidence restriction. We also apply a pruning of irrelevant variables using the Bayes-ball algorithm. • Then all the variables are deleted except the target variable X i . The potentials in Φ will be all defined for variable X i and their product will be a potential q defined for variable X i such that q( ).
• The deletion algorithm is repeated for each variable X i and each configuration of the parent variables x pa B i in Bayesian network N B . So, the number of executions of the propagation algorithm in Bayesian network N A is equal to ∑ n i=1 ∏ X j ∈pa B (X i ) n j , where n j is the number of possible values of X j . This is immediate taking into account that ∏ X j ∈pa B (X i ) n j is the number of possible configurations x pa B (X i ) of variables in pa B (X i ).
Though this method can take advantage of propagation algorithms to compute marginals in a Bayesian network, and it avoids a brute force computation associated with the use of Equation (2), it is quite time consuming when the structure of the involved Bayesian network is complex.
Algorithm 1 details the basic steps of the initial proposal for computing the Kullback-Leibler divergence. This algorithm is the one used in [8,10]. Observe that this algorithm computes LL(N A , N B ). It allows the computation of the KL divergence by using Equation (3). sum ← 0.0 sets initial value to sum 3: for each X i in N B do 4: for each x pa B i do configuration of X i parents 5: Let Φ the set of relevant potentials from Φ Applying Bayes-ball 6: Restrict the potentials in Φ to evidence Let q the product of all the potential in Φ 9: for each x i in Ω X i do 10: end for 12: end for 13: end for 14: return sum 15: end function As an example, let us suppose we wish to compute the KL divergence between two Bayesian networks N A and N B defined on X = {X 1 , X 2 , X 3 } (see Figure 1). Let us assume N A is the reference model. The families of variables in both models are presented in Figure 1. We have to compute LL(N A , N A ) and LL(N A , N B ). To compute LL(N A , N B ) Algorithm 1 is applied. Initially, The algorithm works as follows: • The parent set for X 1 is empty. The set of relevant potentials in network N A to compute the marginal for X 1 is given by Φ = {φ A 1 }, which is the desired marginal q.

•
The parents set for X 2 in N B is {X 1 }. So, for each value X 1 = x 1 we have to introduce this evidence in network N A and compute the marginal on X 2 . The set relevant potentials is Φ = {φ A 1 , φ A 2 }. These potentials are reduced by selection on configuration X 1 = x 1 . If we call φ 4 , φ 5 the results of reducing φ A 1 , φ A 2 , respectively, then φ 4 is a potential defined for the empty set of variables and determined by its value φ 4 () for the empty configuration. φ 5 is a potential defined for variable X 2 . The desired marginal q is the multiplication of these potentials: q( The parents set for X 3 in N B is {X 2 }. So, for each value X 2 = x 2 we have to introduce this evidence in network N A and compute the marginal on X 3 . In this case, all the potentials are relevant The first step introduces the evidence X 2 = x 2 in all the potentials containing this variable. Only φ A 2 contains X 2 ; therefore the selection φ A 2 X 2 =x 2 is a potential defined on variable X 1 which we will denote as φ 6 . To compute the marginal on X 3 , we have to delete variable X 1 . As all the potentials in Φ contains this variable they must be combined for removing X 1 afterwards, i.e., computing (φ A 1 · φ A 3 · φ 6 ) −X 1 . After this operation, this will be the only potential in Φ and it is the desired marginal q.

Inference with Operations Cache
The approach proposed in this paper is based on the following fact: the computation of KL divergence using Equations (3) and (5) requires us to obtain the following families of marginal distributions: We have designed a procedure to compute each one of the required marginals (p Marginals are computed by deleting the variables not in Y. The procedure uses a cache of computations which can be reused in the different marginalizations in order to avoid repeated computations. We have implemented a general procedure to calculate the marginal for a family Y of subsets Y of X in a Bayesian network N. In our case the family Y is (p A ) ↓Y for each } and the Bayesian network is N A . A previous step consists of determining the relevant potentials for computing the marginal on a subset Y, as not all the initial potentials are necessary. If Φ is the list of potentials, then a conditional probability potential φ i for variable X i is relevant for Y when X i is an ascendant for some of the variables in Y. This is a consequence of known relevance properties in Bayesian networks [17]. Let us call Φ Y the family of relevant potentials for subset Y.
Our algorithm assumes that the subsets in Y are numbered from 1 to K: {Y 1 , . . . , Y K }. The algorithm first carries out the deletion algorithm symbolically, without actually doing numerical computations, in order to determine which of them can be reused. A symbolic combination of φ and φ consists of determining a potential φ · φ defined for variables v(φ) ∪ v(φ ) but without computing its numerical values (only the scope of the resulting potential is actually computed). This procedure is analogously done in the case of marginalization.
In fact, two repositories are employed: one for potentials (R Φ ) and another for operations (R O ). The entry for each potential in R Φ contains a value acting as its identifier (id); the potential itself; the identifier of the last operation for which this potential was required (this is denoted as potential time). Initially, R Φ contains the potentials in Φ assigning time = 0 to all of them. When a potential is no longer required, then it is removed from R Φ in order to alleviate memory space requirements. The potentials representing the required marginals (the results of the queries) are set with time = −1 in order to avoid their deletion.
The repository R O contains an entry for each operation (combination or marginalization) with potentials performed during the running of the algorithm in order to compute the required marginals. This allows that if an operation is needed in the future, its result can be retrieved from R O preventing repeated computations. Initially R O will be empty. At the end of the analysis, it will include the description of the elementary operations carried out throughout the evaluation of all the queries. Two kinds of operations will be stored in R O : • combination of two potentials φ 1 and φ 2 producing a new one as result, φ r . • marginalization of a potential φ 1 , in order to sum-out a variable and obtaining φ r as result.
The operation description will be stored as registers (id, type, φ 1 , φ 2 , φ r ) with the following information: • A unique identifier for the operation (id; an integer). • The type of operation (type): marginalization or combination. • Identifiers of the potentials involved as operands and result (identifiers allow to retrieve potentials from R Φ ). If the operation is a marginalization, then φ 2 will identify the index of the variable to remove.
The computation of a marginal for a set Y will also require a deletion order of variables in some cases. This order is always obtained with a fixed triangulation heuristic min − weight [18]. However, the procedure described here does not depend on this heuristic and any one of them could be used. Algorithm 2 depicts the basic structure of the procedure. The result is LR, an ordered list of K potentials containing the required marginals for Y 1 , . . . , Y K . The algorithm is divided into two main parts.
In the first part (lines 2-26), the operations are planned (using symbolic propagation) and detecting repeated operations. It is assumed that there are two basic functions SCOMBINE(φ 1 , φ 2 ) and SMARGINALIZE(φ, i), representing the symbolic operations: We will also consider that there are two conditional versions of these operations: if the operation already exists, only the time is updated, and if it does not exist it is symbolically carried out and added to the repository of operations. The conditional combination will be CONDSCOMBINE(φ 1 , φ 2 , t) and the conditional marginalization will be CONDSMARGINALIZE(φ, i) and are depicted in Algorithms 3 and 4, respectively. It is assumed that both repositories are global variables for all the procedures. The potentials representing the required marginals are never deleted. For that, a time equal to −1 is assigned: if time = −1, then the potential should not be removed and then this time is never updated. We will assume the function UPDATETIME(φ, t) which does nothing if the time of φ is equal to −1, and updates the time of φ to t otherwise in repository R Φ .
Observe that the first part of Algorithm 2 (lines 2-26) just determines the necessary operations for the deletion algorithm for for all the marginals, while the second part (lines 27-32) carries out the numerical computations in the order that was established in the first part. After each operation, the potentials that are no longer necessary are removed from R Φ and their memory is deallocated. We will assume a function DELETEIF(φ, t) doing this (remove from R Φ if time of φ is equal to t).
As mentioned above, the analysis of the operation sequence will be carried out using symbolic operations and taking into account the scopes of potentials but without numerical computations. This allows an efficient analysis. The result of the analysis will be used as an operation planning for the posterior numerical computation.
Assume the same example considered in previous sections for the networks in Figure 1. The marginals to compute on N A (as reference model) will correspond to families Initially, the potentials repository R Φ contains the potentials of N A (a conditional probability for each variable given its parents): φ A 1 (X 1 ), φ A 2 (X 2 , X 1 ), and φ A 3 (X 3 , X 1 ) with time 0. We indicate the variables involved in each potential. The operations repository, R O , will be empty. Table 1 contains the initial repositories. Notice that the superscript A has been omitted in order to simplify the notation.  for each k = 1, . . . , K do 4: Let Y the subset Y k in Y 5: Let Φ Y the family of potentials from Φ relevant to subset Y 6: for 10: for l = 2, . . . , L do 11: ψ ← CONDSCOMBINE(ψ, φ l , t) 12: t ← t + 1 13: end for 14: ψ ← CONDSMARGINALIZE(ψ, i, t) 15:  if register (id, comb, φ 1 , φ 2 , φ r ) is in R O then 3: UPDATETIME(φ 1 , t), UPDATETIME(φ 2 , t), UPDATETIME(φ r , t) 4: Add register (id, comb, φ 1 , φ 2 , φ r ) to R O with id as identifier 7: UPDATETIME(φ 1 , t), UPDATETIME(φ 2 , t) if register (id, marg, φ, i, φ r ) is in R O then 3: UPDATETIME(φ, t), UPDATETIME(φ r , t) 4: Add register (id, marg, φ, i, φ r ) to R O with id as identifier 7: UPDATETIME(φ, t) 8: Add φ r to R φ with t as time 9: end if 10: return φ r 11: end function The first marginal to compute is for Y = {X 1 }. In this case, the set of relevant potentials is Φ Y = {φ 1 } and there are not operations to carry out. Therefore the first marginal is Ψ 1 = φ 1 which is appended to LR.
The second marginal to be computed is for Y = {X 1 , X 2 }. In this case, the relevant potentials are Φ Y = {φ 1 , φ 2 } and there are no variables to remove, but it is necessary to carry out the symbolic combination of φ 1 and φ 2 in order to compute Ψ 2 (lines 19-23 of Algorithm 2). If we call φ 4 the result, then the repositories after this operation will be as shown in Table 2.
The third marginal to compute is for set Y = {X 1 , X 3 }. Now, the relevant potentials are Φ Y = {φ 1 , φ 3 }. The situation is analogous to the computation of the previous marginal, with the difference that now the symbolic combination to carry out is φ 1 · φ 3 . The repositories status after k = 3 is shown in Table 3. We have that the third desired marginal is ψ 3 = φ 5 .
Finally, for k = 4 we have to compute the marginal for Y = {X 2 , X 3 }. The relevant potentials are now Φ Y = {φ 1 , φ 2 , φ 3 }. Variable X 1 has to be deleted from this set of potentials. As all the potentials contain this variable, as a first step it is necessary to combine all of them, and afterwards to remove X 1 by marginalizing on {X 2 , X 3 }. Assume that the order of the symbolic operations is: first combine φ 1 and φ 2 and then its result is combined with φ 3 ; then this result is marginalized by removing X 1 . Then the repositories after k = 4 are as presented in Table 4. The combination of φ 1 and φ 2 was previously carried out for k = 2 and therefore its result can be retrieved without new computations.
After that, the numerical part of operations in Table 4 are carried out in the same order in which they are described in that table. In this process, after doing an operation with an identifier (id) equal to t, the potentials with time equal to t are removed from the R Φ table. For example, in this case, potentials φ 2 and φ 3 can be removed from R Φ after doing operation with id = 4 and potential φ 6 can be removed after operation with id = 5, leaving only in R Φ the potentials containing the desired marginal potentials needed to compute the KL divergence between both networks (potentials with time = −1).

Experiments
In order to compare the computation approaches presented in the paper the experimentation uses a set of Bayesian networks available in the bnlearn [19] repository (https://www.bnlearn.com/bnrepository/, accessed on 24 August 2021). This library provides all the functions required for the process described below. Given a certain Bayesian network as defined in the repository: • A dataset is generated using the rbn function. As explained in the library documentation, this function simulates random samples from a Bayesian network, using forward/backward sampling. • The dataset is used for learning a Bayesian network. For this step, the tabu function is employed using the default setting (a dataset as unique argument). It is one of the structural learning methods available on bnlearn. Since the learned model could have unoriented links, the cextend function is required, which results in a Bayesian network consistent with the model passed as argument. Any other different learning algorithm could have been used, since the goal is to have an alternative Bayesian network that will be used later to calculate the Kullback-Leibler divergence with the methods described in Algorithms 1 and 2.
For each network, the Kullback-Leibler divergence is computed with the procedures presented using evidence propagation (see Algorithm 1) and using operations cache (described in Algorithm 2). The main purpose of the experiment is to get an estimation of the computation times required for both approaches. The obtained results are included in Table 5. It contains the following information:  It is observed that the calculation with the second method always offers shorter runtimes than the first one. The shortest runtimes are presented in the table with bold style. It is noteworthy that the case of three networks in which the method based on the use of evidence cannot be completed because the available memory capacity is exceeded: water, mildew, and barley. Moreover, the computational overhead required to manage operations and factor repositories is beneficial as it avoids the repetition of a significant number of operations and enables unnecessary potentials to be released, especially in the most complex networks.

Conclusions
Computing the KL divergence between the joint probabilities associated with two Bayesian networks is an important task that is relevant for many problems, for example assessing the accuracy of Bayesian network learning algorithms. However, in general, it is not possible to find this function implemented in software packages for probabilistic graphical models. In this paper, we provide an algorithm that uses local computation to calculate the KL divergence between two Bayesian networks. The algorithm is based in a procedure with two stages. The first one plans the operations determining the repeated operations and the times in which potentials are no longer necessary, while the second one carries out the numerical operations, taking care to reuse the results of repeated operations instead of repeating them and deallocating the memory space associated with useless potentials. Experiments show that this strategy saves time and space, especially in complex networks.
The functions have been implemented in Python taking as basis pgmpy software package and are available in the github repository: https://github.com/mgomez-olmedo/ KL-pgmpy, accessed on 24 August 2021. The README file of the project offers details about the implementation and the methods available for reproducing the experiments.
In the future, we plan to further improve the efficiency of the algorithms. The main line will be to invest more time in the planning stage looking for deletion orderings minimizing the total number of operations or optimizing the order of combinations, when several potentials have to be multiplied.