Geodesic Monte Carlo on Embedded Manifolds

Markov chain Monte Carlo methods explicitly defined on the manifold of probability distributions have recently been established. These methods are constructed from diffusions across the manifold and the solution of the equations describing geodesic flows in the Hamilton–Jacobi representation. This paper takes the differential geometric basis of Markov chain Monte Carlo further by considering methods to simulate from probability distributions that themselves are defined on a manifold, with common examples being classes of distributions describing directional statistics. Proposal mechanisms are developed based on the geodesic flows over the manifolds of support for the distributions, and illustrative examples are provided for the hypersphere and Stiefel manifold of orthonormal matrices.


Introduction
Markov chain Monte Carlo (MCMC) methods that originated in the physics literature have caused a revolution in statistical methodology over the last 20 years by providing the means, now in an almost routine manner, to perform Bayesian inference over arbitrary non-conjugate prior and posterior pairs of distributions (Gilks et al., 1996).
A specific class of MCMC methods, originally known as hybrid Monte Carlo (HMC), was developed to more efficiently simulate quantum chromodynamic systems (Duane et al., 1987). HMC goes beyond the random walk Metropolis or Gibbs sampling schemes and overcomes many of their shortcomings. In particular, HMC methods are capable of proposing bold long distance moves in the state space that will retain a very high acceptance probability and thus improve the rate of convergence to the invariant measure of the chain and reduce the autocorrelation of samples drawn from the stationary distribution of the chain. The HMC proposal mechanism is based on simulating Hamiltonian dynamics defined by the target distribution (see Neal (2011) for a comprehensive tutorial). For this reason, HMC is now routinely referred to as Hamiltonian Monte Carlo. Despite the relative strengths and attractive properties of HMC, it has largely been bypassed in the literature devoted to MCMC and Bayesian statistical methodology with very few serious applications of the methodology being published.
More recently, Girolami & Calderhead (2011) defined a Hamiltonian scheme that is able to incorporate geometric structure in the form a Riemannian metric. The Riemannian manifold Hamiltonian Monte Carlo (RMHMC) methodology makes proposals implicitly via Hamiltonian dynamics on the manifold defined by the Fisher-Rao metric tensor and the corresponding Levi-Civita connection. The paper has raised an awareness of the differential geometric foundations of MCMC schemes such as HMC and has already seen a number of methodological and algorithmic developments as well as some impressive and challenging applications exploiting these geometric MCMC methods (Konukoglu et al., 2011;Martin et al., 2012;Raue et al., 2012;Vanlier et al., 2012).
In contrast to Girolami & Calderhead (2011), in this particular paper, we show how Hamiltonian Monte Carlo methods may be designed for and applied to distributions defined on manifolds embedded in Euclidean space, by exploiting the existence of explicit forms for geodesics. This can provide a significant boost in speed, by avoiding the need to solve large linear systems as well as complications arising because of the lack of a single global coordinate system.
By way of specific illustration, we consider two such manifolds: the unit hypersphere, corresponding to the set of unit vectors in R d , and its extension to Stiefel manifolds, the set of p-tuples of orthogonal unit vectors in R d . Such manifolds occur in many statistical applications: distributions on circles and spheres, such as the von Mises distribution, are common in problems dealing with directional data (Mardia and Jupp, 2000). Orthonormal bases arise in dimension reduction methods such as factor analysis (Jolliffe, 1986) and can be used to construct distributions on matrices via eigendecompositions.
The problem of sampling from such distributions has not received much attention. Most methods in wide use, such as those used in directional statistics for sampling from spheres, have been developed for the specific problem at hand, often based on rejection sampling techniques tuned to a specific family. For the various multivariate extensions of these distributions, these techniques are usually embedded in a Gibbs sampling scheme.
There are relatively few works on the general problem of sampling from manifolds. The recent paper by Diaconis et al. (2013) provides a readable introduction to the concepts of geometric measure theory, and practical issues when sampling from manifolds, with the motivation of computing certain sampling distributions for hypothesis testing. Brubaker et al. (2012), somewhat similar to our approach, develop an HMC algorithm using the iterative algorithm for approximating the Hamiltonian paths.
In the next section, we provide a brief overview of the necessary concepts from differential geometry and geometric measure theory, such as geodesics and Hausdorff measures. In section 3, we construct a Hamiltonian integrator that utilizes the explicit form of the geodesics and incorporate this into a general HMC algorithm. Section 4 gives examples of various manifolds for which the geodesic equations are known, and section 5 provides some illustrative applications.

Manifolds and embeddings
In this section, we introduce the necessary terminology from differential geometry and information geometry. A more rigorous treatment can be found in reference books such as do Carmo (1976do Carmo ( , 1992 and Amari & Nagaoka (2000).
An m-dimensional manifold M is a set that locally acts like R m : that is, for each point x 2 M, there is a bijective mapping q, called a coordinate system, from an open set around x to an open set in R m . Our particular focus is on manifolds that are embedded in some higherdimensional Euclidean space R n , (i.e. they are submanifolds of R n ). Note that R d is itself a d -dimensional manifold, which we refer to as the Euclidean manifold.
Example 2.1. A simple example of an embedded manifold is the hypersphere or .d 1/-sphere: This is a .d 1/-dimensional manifold, as there exists an angular coordinate system 2 .0; 2 / .0; / d 2 where x 1 D sin 1 : : : sin n 2 sin n 1 ; x 2 D sin 1 : : : sin n 2 cos n 1 ; x 3 D sin 1 : : : cos n 2 ; : : : x n 1 D sin 1 cos 2 ; x n D cos 1 : Note that this coordinate system excludes some points of S d 1 : such as ı d D .0; : : : ; 0; 1/. As a result, it is not a global coordinate system (in fact, no global coordinate system for S d 1 exists); nevertheless, it is possible to cover all of S d 1 by utilizing multiple coordinate systems known as an atlas.
A tangent at a point x 2 M is a vector v that lies 'flat' on the manifold. More precisely, it can be defined as an equivalence class of the set of functions ¹ W OEa; b ! M W .t 0 / D xº that have the same 'time derivative' d dt q. .t//j t Dt0 in some coordinate system q. For an embedded manifold, however, a tangent can be represented simply as a vector v 2 R n such that The tangent space is the set T x of such vectors and form a subspace of R n : this is equal to the span of the set of partial derivatives @x i =@q j of some coordinate system q.
Example 2.2. A function on the sphere W OEa; b ! S d 1 must satisfy the constraint P i OE i .t / 2 D 1. By taking the time derivative of both sides, we find that Therefore, the tangent space at x 2 S d 1 is the .d 1/-dimensional subspace of vectors orthogonal to x: A Riemannian manifold incorporates a notion of distance, such that for a point q 2 M, there exists a positive-definite matrix G, called the metric tensor, that forms an inner product between tangents u and v hu; vi G D u > G.q/v: Information geometry is the application of differential geometry to families of probability distributions. Such a family ¹p. j Â/ W Â 2 ‚º can be viewed as a Riemannian manifold, using the Fisher-Rao metric tensor @ 2 @Â i @Â j log p.X j Â/ where ı i is the ith coordinate vector, is parametrized by the unit .d 1/-simplex, This is a .d 1/-dimensional manifold embedded in R d and can be parametrized in .d 1/ dimensions by dropping the last element of Â , the set of which we will denote by d 1 . d / . The Fisher-Rao metric tensor in d 1 . d / is then easily shown to be A smooth mapping from a Riemannian manifold to R n is an isometric embedding if the Riemannian inner product is equivalent to the usual Euclidean inner product. That is, or equivalently, The existence of such embeddings is determined by the celebrated Nash (1956) embedding theorem; however, it does not give any guide as how to construct them. Nevertheless, there are some such embeddings we can identify.
Example 2.4. There is a bijective mapping from the simplex d 1 to the positive orthant of the sphere S d 1 by taking the element-wise square root x i D p Â i (Fig. 1). If we consider it as a mapping from d 1 . d / , then the partial derivatives are of the form Note that by (1), this is an isometric embedding (up to proportionality) of the Fisher-Rao metric from example 2.3.

Geodesics
The affine connection of a manifold determines the relationship between tangent spaces of different points on a manifold: interestingly, this depends on the path W OEa; b ! M used to connect the two points, and for a vector field v.t/ 2 T .t/ along the path, we can measure the change by the covariant derivative.
Of course, the time derivative P .t/ D d .t/ dt is itself such a vector field: when this follows the affine connection, the covariant derivative is 0; in which case, is known as a geodesic.
This property can be expressed by the geodesic equation where i jk .x/ are known as the connection coefficients or Christoffel symbols. A Riemannian manifold induces a natural affine connection known as the Levi-Civita connection.
In the Euclidean manifold R n , the Christoffel symbols i jk are zero, and so the geodesic (2) reduces to R .t/ D 0. Hence, the geodesics are the set of straight lines .t/ D at C b.
In a Riemannian manifold, the geodesics are the locally extremal paths (maxima or minima in terms of calculus of variations) of the integrated path length Moreover, the geodesics have constant speed, in that k P .t/k G is constant over t. As the geodesics can be determined by the metric, they are consequently preserved under any metric-preserving transformation, such as an isometric embedding.
Example 2.5. A standard result in differential geometry is that the geodesics of the n-sphere are rotations about the origin, known as great circles (Fig. 2): where x.0/ 2 S n is the initial position, v.0/ is the initial velocity in the tangent space (i.e. such that x.0/ > v.0/ D 0) and˛D kv.0/k is the constant angular velocity.

Fig. 2. A geodesic (
) and great circle ( ) on the sphere S 2 and its path in the spherical polar coordinate system x D .sin 1 sin 2 ; sin 1 cos 2 ; cos 1 /. The ellipses correspond to equi-length tangents from each marked point.
For any geodesic W OEa; b ! M, the geodesic flow describes the path of the geodesic and its tangent . .t /; P .t//. Moreover, it is unique to the initial conditions .x; v/ D . .a/; P .a//, so we can describe any geodesic flow from its starting position x and velocity v: this is also known as the exponential map. If all such pairs .x; v/ describe geodesics, then the manifold is said to be geodesically complete, which is true of the manifolds we consider in this paper.

The Hausdorff measure and distributions on manifolds
As our motivation is to sample from distributions defined on manifolds, we introduce some basic concepts of geometric measure theory that will be useful for this purpose. Geometric measure theory is a large and active topic and is covered in detail in references such as Federer (1969) and Morgan (2009). However, for a more accessible overview with a statistical flavour, we suggest the recent introduction given by Diaconis et al. (2013).
Our key requirement is a reference measure from which we can specify probability density functions, similar to the role played by the Lebesgue measure for distributions on Euclidean space. For this, we use the Hausdorff measure, one of the fundamental concepts in geometric measure theory. This can be defined rigorously in terms of a limit of coverings of the manifold (see the aforementioned references); however, for a manifold embedded in R n , it can be heuristically interpreted as the surface area of the manifold.
The relationship between H m , the m-dimensional Hausdorff measure and m , the Lebesgue measure on R m , is given by the area formula (Federer, 1969, theorem 3.2

.5). If we parametrize the manifold by a Libschitz function
Here, J m f .x/ is the m-dimensional Jacobian of f : this can be defined as a norm on the matrix of partial derivatives Df .x/ (Federer, 1969, section 3.2.1), and if rank Df .x/ D m, then OEJ m f .x/ 2 is equal to the sum of squares of the determinants of all m m submatrices of Df .x/.
Example 2.6. The square root mapping in example 2.4 from d 1 . d / to S d 1 has .d 1/-dimensional Jacobian The Dirichlet distribution is a distribution on the simplex, with density Â˛i 1 i with respect to the Lebesgue measure on d 1 . d / . Therefore, the corresponding density with respect to the Hausdorff measure on S d 1 is In other words, the uniform distribution on the sphere arises when˛i D 1=2, whereas˛D 1 gives the uniform distribution on the simplex (Fig. 3). The area formula allows the Hausdorff measure to be easily extended to Riemannian manifolds (Federer, 1969 This construction would be familiar to Bayesian statisticians as the Jeffreys prior, in the case where G is the Fisher-Rao metric.
When working with probability distributions on manifolds, the Hausdorff measure forms the natural reference measure and allows for reparametrization without needing to compute any additional Jacobian term. We use H to denote the density with respect to the Hausdorff measure of the distribution of interest.
Example 2.7. The von Mises distribution is a common family of distributions defined on the unit circle (Mardia and Jupp, 2000, section 3.5.4). When parametrized by an angle Â , the density with respect to the Lebesgue measure on OE0; 2 / is The embedding transformation x D .sin Â; cos Â/ has unit Jacobian, so the density with respect to the one-dimensional Hausdorff measure is where c D .Ä sin ; Ä cos / and I k is the modified Bessel function of the first kind. In other words, it is a natural exponential family on the circle. The von Mises-Fisher distribution is the natural extension to higher-order spheres (Mardia and Jupp, 2000, section 9.3.2) with density Attempting to write this as a density with respect to the Lebesgue measure on some parametrization of the surface, such as angular coordinates, would be much more involved, as the Jacobian is no longer constant.

Hamiltonian Monte Carlo on embedded manifolds
RMHMC is an MCMC scheme whereby new samples are proposed by approximately solving a system of differential equations describing the paths of Hamiltonian dynamics on the manifold (Girolami and Calderhead, 2011).
The key requirement for Hamiltonian Monte Carlo is the symplectic integrator. This is a discretization that approximates the Hamiltonian flows yet maintains certain desirable properties of the exact solution, namely time-reversibility and volume preservation that are necessary to maintain the detailed balance conditions. The standard approach is to use a leapfrog scheme, which alternately updates the position and momentum via first order Euler updates (Neal, 2011).
Given a target density .q/ (with respect to the Lebesgue measure) in some coordinate system q, RMHMC, Girolami & Calderhead (2011) utilize a Hamiltonian of the form where G is the metric tensor. This is the negative log of the joint density (with respect to the Lebesgue measure) for .q; p/, where the conditional distribution for the auxiliary momentum variable p is N.0; G.q//. The first two terms can be combined into the negative log of the target density with respect to the Hausdorff measure of the manifold By Hamilton's equations, the dynamics are determined by the system of differential equations As this Hamiltonian is not separable (i.e. it cannot be written as the sum of a function of q and a function of p), we are unable to apply the standard leapfrog integrator.

Geodesic integrator
Girolami & Calderhead (2011) develop a generalized leapfrog scheme, which involves composing adjoint Euler approximations to (4) and (5) in a reversible manner. Unfortunately, some of these steps do not have an explicit form and so need to be solved implicitly by fixed-point iterations. Furthermore, these updates require computation of both the inverse and derivatives of the metric tensor, which are O.m 3 / operations; this limits the feasibility of numerically naive implementations of this scheme for higher-dimensional problems. Finally, such a scheme assumes a global coordinate system, which may cause problems for manifolds for which none exist, such as the sphere, where artificial boundaries may be induced.
In this contribution, we instead construct an integrator by splitting the Hamiltonian (Hairer et al., 2006, section II.5): that is, we treat each term in (3) as a distinct Hamiltonian and alternate simulating between the exact solutions.
Splitting methods have been used in other contexts to develop alternative integrators for Hamiltonian Monte Carlo (Neal, 2011, section 5.5.1) such as extending HMC to infinitedimensional Hilbert spaces (Beskos et al., 2011) and defining schemes that may reduce computational cost (Shahbaba et al., 2011).
We take the first component of the splitting to be the 'potential' term H OE1 .q; p/ D log H .q/: Hamilton's equations give the dynamics P q D @H OE1 @p D 0 and P p D @H OE1 @q D r q log H .q/: Starting at .q.0/; p.0//, this has the exact solution In other words, this is just a linear update to the momentum p. The second component is the 'kinetic' term This is simply a Hamiltonian absent of any potential term, and the solution of Hamilton's equations can be easily shown to be a geodesic flow under the Levi-Civita connection of G (Abraham and Marsden, 1978, theorem 3.7.1) or to be more precise, a co-geodesic flow .q.t /; p.t //, where p.t/ D G.q.t// P q.t/. Thus, if we are able to exactly compute the geodesic flow, then we can construct an integrator by alternately simulating from the dynamics of H OE1 and H OE2 for some time step . Each iteration of the integrator consists of the following steps, starting at position .q; p/ in the phase space: (i) Update according to the solution to H OE1 in (6), for a period of =2 by setting (ii) Update according to H OE2 , by following the geodesic flow starting at .q; p/, for a period of . (iii) Update again according to H OE1 for a period of =2 by (8).
As H OE1 and H OE2 are themselves Hamiltonian systems, their solutions are necessarily both reversible and symplectic. As the integrator is constructed by their symmetric composition, it will also be reversible and symplectic.
Therefore, the overall transition kernel for our Hamiltonian Monte Carlo scheme from an initial position q 0 is as follows: (i) Propose an initial momentum p 0 from N .0; G.q 0 //. As with the RMHMC algorithm, the metric G need only be known up to proportionality: scaling is equivalent to changing the time step . Scand J Statist 40

Embedding coordinates
The algorithm can also be written in terms of an embedding, which avoids altogether the computation of the metric tensor and the possible lack of a global coordinate system.
Given an isometric embedding W M ! R n , then the path x.t/ D .q.t//, such that Therefore, we can transform the phase space .q; p/, where P q D G 1 p, to the embedded phase (1). By substitution, the Hamiltonian (3) can be written in terms of these coordinates as Note that the target density H is still defined with respect to the Hausdorff measure of the manifold, and so no additional log-Jacobian term is introduced. We can rewrite the solution to H OE1 in (6)  Although it is possible to compute this projection using standard least squares algorithms, it can be computationally expensive and prone to numerical instability at the boundaries of the coordinate system (e.g. at the poles of a sphere). However, for all the manifolds that we consider there exists an explicit form for an orthonormal basis N of the normal to the tangent space, in which case we can simply subtract the projection onto the normal: Finally, we require a method for sampling the initial velocity v 0 . Because p 0 N.0; G.q//, it follows that We do not need to compute a Cholesky decomposition here: because .I NN > / is a projection, it is idempotent, so we can draw´from N.0; I n / and project v 0 D .I NN > /´to obtain the necessary sample. The resulting procedure is presented in Algorithm 1. In order to implement it for an embedded manifold M Â R n , we need to be able to evaluate the following at each x 2 M: (i) the log-density with respect to the Hausdorff measure log H , and its gradients; (ii) an orthogonal projection from R n to the tangent space of x 2 M; (iii) the geodesic flow from any v 2 T x M.
Note that by working entirely in the embedded space, we completely avoid the coordinate system q and the related problems where no single global coordinate system exists. The Riemannian metric G only appears in the Jacobian determinant term of the density: in certain examples, this can also be removed, for example by specifying the prior distribution as uniform with respect to the Hausdorff measure, as is performed in section 5.3

Embedded manifolds with explicit geodesics
In this section, we provide examples of embedded manifolds for which the explicit forms for the geodesic flow are known and derive the bases for the normal to the tangent space.

Affine subspaces
If the embedded manifold is flat, for example an affine subspace of R n , then the geodesic flows are the straight lines In the case of the Euclidean manifold R n , then the normal space to the tangent is null, and no projections are required. Hence, the algorithm reduces to the standard leapfrog scheme of HMC.
In standard HMC, it is common to utilize a 'mass' or 'preconditioning' positive-definite matrix M , in order to reduce the correlation between samples, especially where variables are highly correlated or have different scales of variation. This is directly equivalent to using the RMHMC algorithm with constant a Riemannian metric, or our geodesic procedure on the embedding of x D L > q, where L is a matrix square root such that LL > D M (such as the Cholesky factor).

Spheres
Recall from earlier examples that the unit .d 1/-sphere S d 1 is an .d 1/-dimensional manifold embedded in R d , characterized by the constraint Distributions on spheres, particularly S 1 and S 2 , arise in many problems in directional statistics (Mardia and Jupp, 2000): examples include the von Mises-Fisher distribution (example 2.7) and the Bingham-von Mises-Fisher (BVMF) distribution (section 5.1). For many of these distributions, the normalization constants of the density functions are often computationally intensive to evaluate, which makes Monte Carlo methods particularly attractive.
As mentioned in example 2.5, the geodesics of the sphere are the great circle rotations about the origin. The geodesic flows are then where˛D kv.t/k is the (constant) angular velocity. The normal to the tangent space at x is x itself, so .I xx > /u is an orthogonal projection of an arbitrary u 2 R d onto the tangent space.
Other than the evaluation of the log-density and its gradient, the computations only involve vector-vector operations of addition and multiplication, so the algorithm scales linearly in d .

Stiefel manifolds
A Stiefel manifold V d;p is the set of d p matrices X such that X > X D I: In other words, it is the set of matrices with orthonormal column vectors, or equivalently, the set of p-tuples of orthogonal points in S d 1 , and is a OEdp 1 2 p.p C 1/-dimensional manifold, embedded in R d p . In the special case where d D p, the Stiefel manifold is the orthogonal group O d : the set of d d orthogonal matrices.
These arise in the statistical problems related to dimension reduction such as factor analysis and principal component analysis, where the aim is to find a low dimensional subspace that represents the data. They can also arise in contexts where the aim is to identify orientations, such as projections in shape analysis, or the eigendecomposition of covariance matrices.
Previously suggested methods of sampling from distributions on Stiefel manifolds, such as Hoff (2009) and Dobigeon & Tourneret (2010), have relied on column-wise Gibbs updates. Such an approach is limited to cases where the conditional distribution of the column has a conjugate form and requires the computation of an orthonormal basis for the null space of X, requiring O.d 3 / operations. Again, we can find the constraints on the phase space by the time derivative of the constraint for an arbitrary curve That is, the tangent space at X is the set°V If we let Q x denote the matrix X written as a vector in R dp by stacking the columns x 1 ; : : : ; x p , then an orthonormal basis N for the normal to the tangent space has p vectors of the form 2 6 6 6 6 4 x 1 0 : : : For an arbitrary vector Q u 2 R dp , the projection onto the tangent space is then u 2 x 2 x > 2 u 2 1 2 x 1 x > 2 u 1 C x > 1 u 2 : : : : : : This can be more easily written in matrix form: for an arbitrary U 2 R d p , the orthogonal projection onto the Stiefel manifold is The geodesic flows are more complicated than the spherical case. For p > 1, they are no longer simple rotations but can be expressed in terms of matrix exponentials (Edelman et al., 1999, page 310) where A D X.t/ > V .t/ is a skew-symmetric matrix that is constant over the geodesic, and Although matrix exponentials can be quite computationally expensive, we note that the largest exponential of these is of a 2p 2p matrix, which requires O.p 3 / operations. Other than this and the evaluations of the log-density and its gradients, all the other operations are simple matrix additions and multiplications, the largest of which can be performed in O.dp 2 / operations; hence, the algorithm scales linearly with d .
For the orthogonal group O d , the geodesics have the simpler form (Edelman et al., 1999, equation 2.14) As A is skew-symmetric, Rodrigues' formula gives an explicit form of exp¹tAº when d D 3 in terms of simple trigonometric functions, and this can be extended into higher dimensions (Gallier and Xu, 2002;Cardoso and Leite, 2010).

Product manifolds
Given two manifolds M 1 and M 2 , their Cartesian product is also a manifold. Product manifolds arise naturally in many statistical problems; for example, extensions of the von Mises distributions to S 1 S 1 (a torus) have been used to model molecular angles (Singh et al., 2002), and the network eigenmodel in section 5.3 has a posterior distribution on V m;p R p R.
The geodesics of a product manifold are of the form . 1 ; 2 /, where each i is a geodesic of M i . Likewise, the tangent vectors are of the form .v 1 ; v 2 /, where each v i is a tangent to M i . Consequently, for an arbitrary vector .u 1 ; u 2 /, the orthogonal projection onto the tangent space is I N 1 N > 1 u 1 ; I N 2 N > 2 u 2 , where N i is an orthonormal basis of M i . As a result, when implementing our geodesic Monte Carlo scheme on a product manifold, the key operations (addition of gradient, projection and geodesic update) can be essentially performed in parallel, the only operations requiring knowledge of the other variables being the computation of the log-density and its gradient. Moreover, when tuning the algorithm, it is possible to choose different values for each constituent manifold, which can be helpful when variables have different scales of variation.

Bingham-von Mises-Fisher distribution
The BVMF distribution is the exponential family on S d 1 with linear and quadratic terms, with density of the form where c is a vector of length d , and A is a d d symmetric matrix (Mardia and Jupp, 2000, section 9.3.3). The Bingham distribution arises as the special case where c D 0: this is an axially bimodal distribution, with the modes corresponding to the eigenvector of the largest eigenvalue. The BVMF distribution may or may not be bimodal, depending on the parameter values.
Hoff (2009) develops a Gibbs-style method for sampling from BVMF distribution by first transforming y D E > x, where E > ƒE is the eigendecomposition of A. Each element y i of y is updated in random order, conditional u 2 S d 2 , where u j D y j = q 1 y 2 i for j ¤ i . The y i j u are sampled using a rejection sampling scheme with a beta envelope; however, as noted by Brubaker et al. (2012), this can give exponentially poor acceptance probabilities (of the order of 10 100 ) for certain parameter values, particularly when c is large in the direction of the negative eigenspectra.
Implementing our geodesic sampling scheme for the BVMF distribution is straightforward, as the gradient of the log-density is simply c C 2Ax, and extremely fast to run, with run times that are independent of the parameter values. However, as with any gradient-based method, it has difficulty switching between multiple modes (Fig. 4). A common method of alleviating this problem is to utilize tempering schemes (Neal, 2011, section 5.5.7): these operate by sampling from a class of 'higher temperature' distributions with densities of the form where 0 Ä Ä 1: Note that this constitutes a simple linear scaling of the log-density and so can be easily incorporated into our method. Parallel tempering (Geyer, 1991;Liu, 2008, section 10.4) utilizes multiple chains, each targeting a density with a different temperature. The scheme operates by alternately updating the individual chains, which can be performed in parallel, and randomly switching the values of neighbouring chains with a Metropolis-Hastings correction to maintain detailed balance. The results of utilizing such a scheme are shown in Fig. 5

Non-conjugate simplex models
We can use the transformation to the sphere to sample from distributions on the simplex d 1 .
These arise in many contexts, particularly as prior and posterior distributions for discretevalued random variables such as the multinomial distribution. If each observation x from the multinomial is completely observed, then the contribution to the likelihood is then Â x , giving a full likelihood of at most d terms of form which is conjugate to a Dirichlet prior distribution.
Complications arise if observations are only partially observed. For example, we may have marginal observations, which are only observed to a set S , in which case the likelihood term is P s2S Â s , or conditional observations, where the sampling was constrained to occur within a set T , with likelihood term Â x =. P t 2T Â t /. These terms destroy the conjugacy and make computation very difficult.
Such models arise under a case-cohort design (Le Polain deWaroux et al., 2012), the risk factors of a particular disease: for the case sample, the risk factors are observed conditional on the person having the disease, and for the cohort sample, the risk factors are observed marginally (as disease status is unknown). Overall population statistics may provide some further information as to the marginal probability of the disease.
The hyperdirichlet R package (Hankin, 2010) provides an interface and examples for dealing with this type of data. We consider the volleyball data from this package: the data arise from a sports league for nine players, where each match consists of two disjoint teams of players, one of which is the winner. The probability of a team T 1 beating T 2 is assumed to be where p D .p 1 ; : : : ; p 9 / 2 8 . We compare three different methods in sampling from the posterior distribution for p under a Dirichlet .˛1/ prior, for different values of˛. The results are presented in Table 1. The first is a simple random-walk Metropolis-Hastings algorithm. To ensure that the planar constraint P 9 iD1 p i D 1 is satisfied, the proposals are made from a degenerate N.x; 2 OEI nn > /, where n D d 1=2 1 is the normal to the simplex.
The second is a random walk on the sphere, based on the square root transformation to the sphere from example 2.4, using proposals of the form Although this only defines the distribution on the positive orthant, we can extend this distribution to the entire sphere by reflecting about the axes (because we only require knowledge of the density up to proportionality, we can ignore the fact that it is now 2 d times larger). One benefit of this transformation is that the surface is now smooth and without boundaries, so proposals outside the positive orthant can be accepted. The third is the geodesic Monte Carlo algorithm on the simplex. We can ensure that the planar constraint is satisfied via the affine constraint methods in 4.1; however, we need to further ensure that the integration paths satisfy the positivity constraints, which can be achieved by reflecting the path whenever it violates the constraint (see the Appendix for further details).
The fourth is our proposed geodesic scheme based on the spherical transformation. As the integrator does not pass any boundaries, no reflections are required. For small values of˛, both geodesic Monte Carlo samplers perform poorly, due to the concentration of the density at the boundaries. These peaks cause particular problems for the Hamiltonian-type algorithms, as the discontinuous gradients mean that the integration paths give poor approximations to the true Hamiltonian paths, resulting in poor acceptance probabilities. Moreover, for the algorithm on the simplex, the frequent reflections add to the computational cost.
However, when˛D 0:5, the spherical geodesic sampler improves markedly: recall from example 2.6 and Fig. 3 that the Dirichlet .0:5/ prior is uniform on the sphere, giving continuous gradients. On the simplex, however, the density remains peaked at the boundaries. The simplex sampler improves considerably for values of˛ 1 (where the gradient is now flat or negative); however the spherical algorithm still retains a slight edge. Interestingly, the spherical random walk sampler performs poorly in all of the examples.

Eigenmodel for network data
We use the network eigenmodel of Hoff (2009) to demonstrate how Stiefel manifold models can be used for dimension reduction and how our geodesic sampling scheme may be used for Stiefel and product manifolds. This is a model for a graph on a set of m nodes, where for each unordered pair of nodes ¹i; j º, there is a binary observation Y ¹i;j º indicating the existence of an edge between i and j .
The specific example of Hoff (2009) is a protein interaction network, where for m D 270 proteins, the existence of the edge indicates whether or not the pair of proteins interact.
The model represents the network by assuming a low (p D 3) dimensional representation for the probability of an edge P .Y ¹i;j º D 1/ Dˆ OEUƒU > ij C c Á ; whereˆW R ! .0; 1/ is the probit link function, U is an orthonormal m p matrix and ƒ is a p p diagonal matrix. U is assumed to have a uniform prior distribution on V m;p (with respect to the Hausdorff measure), the diagonal elements of ƒ have a N.0; m/ distribution and c N.0; 10 2 /. Hoff (2009) uses column-wise Gibbs updates for sampling U , exploiting the fact that the probit link provides an augmentation that allows these to be sampled as a BVMF distribution. However, as mentioned in section 4.3, this requires computing the full null space of U at each iteration. ) converges to a local mode, with approximately 10 36 of the density. By incorporating this into a parallel tempering scheme ( ), the sampler rapidly finds the higher mode and is able to switch between the various permutations.
We implement geodesic Monte Carlo on the product manifold, details of which are given in the Appendix. Trace plots from two chains of the diagonal elements of ƒ appear in Fig. 6: note that one chain appears to get stuck in a local mode, while the other converges to the same as the method of Hoff (2009).
By incorporating this approach into a parallel tempering scheme, the model is able to find the larger mode with greater reliability. Moreover, unlike the algorithm of Hoff (2009), it is capable of switching between the permutations of ƒ, which would further suggest that this is indeed the global mode.

Conclusion and discussion
We have presented a scheme for sampling from distributions defined on manifolds embedded in Euclidean space by exploiting their known geodesic structure. This method has been illustrated using applications from directional statistics, discrete data analysis and network analysis. This method does not require any conjugacy, allowing greater flexibility in the choice of models: for instance, it would be straightforward to change the probit link in section 5.3 to a logit. Moreover, when used in conjunction with a tempering scheme, it is capable of efficiently exploring complicated multimodal distributions.
Our approach could be widely applicable to problems in directional statistics, such as the estimation of normalization constants that are often otherwise numerically intractable. The method of transforming the simplex to the sphere could be useful for applications dealing with high-dimensional discrete data, such as statistical genetics and language modelling. Stiefel manifolds arise naturally in dimension reduction problems, and our methods could be particularly useful where the data are not normally distributed, for instance the analysis of survey data with discrete responses, such as Likert scale data. Furthermore, this method could be utilized in statistical shape and image analysis for determining the orientation of objects in projected images.
The major constraint of this technique is the requirement of an explicit form for the geodesic flows that can be easily evaluated numerically. These are not often available; for instance, the geodesic paths of ellipsoids require often computationally intensive elliptic integrals.
Of the examples we consider, the geodesics of the Stiefel manifold case are the most demanding, due to the matrix exponential terms. An alternative approach would be to utilize a Metropolis-within-Gibbs style scheme over subsets of columns, for example by updating a pair of columns such that they remain orthogonal to the remaining columns.
However, once the geodesics and the orthogonal tangent projection of the manifold are known, the remaining process of computing the derivatives is straightforward, and could be easily implemented using automatic differentiation tools, as is used in the Stan MCMC library (Stan Development Team, 2012), currently under development.
space. He considers boundaries that are orthogonal to the i th axis, with normals of the form ı i . Whenever such a constraint is violated, the position and velocity are replaced by where b i is the boundary (either upper or lower). As no other coordinates are involved in this reflection, this can be performed in parallel for all constrained coordinates. However, for the simplex d 1 , the normals are not of the form ı i , as this would result in the path being reflected off the plane ¹x W P i x i D 1º. Instead, we need to reflect about the projection of ı i onto the plane, that is, A procedure for performing the position updates is given in Algorithm 2.
Unfortunately, this procedure cannot be applied to the RMHMC integrator proposed by Girolami & Calderhead (2011), as the implicit steps involved make it difficult to calculate the reflections.