Parallel Metropolis-Hastings

This post examines a particular method for parallelizing the Metropolis-Hastings algorithm.

Parallel Metropolis-Hastings
An artistic depiction, generated by Google Gemini using the prompt "generate an artistic depiction of a probability distribution".

In this post, I will discuss a generalization of the Metropolis-Hastings algorithm which is amenable to parallelization on multi-core CPUs or GPUs, etc.

Introduction

In computing, it is often necessary to draw random samples from some probability distribution. For instance, Metropolis light transport is a path-tracing rendering algorithm which samples light paths in a virtual scene according to their distribution of brightness in the rendered image.

As a simple example, suppose that a group of people are surveyed regarding their favorite flavor of ice cream, as in Figure 1.

Figure 1: The results of a simple survey represented as a histogram.

We can then compute a probability distribution (as a "probability mass function") by normalizing the histogram from Figure 1 by dividing each tally by the sum of all people surveyed (10), as in Figure 2.

Figure 2: A probability distribution (represented using a "probability mass function").

Suppose that we wish to draw random samples from this distribution. One method of doing so involves computing the inverse of the cumulative distribution. The cumulative distribution is depicted in Figure 3. If we impose an arbitrary total ordering on the flavors (e.g. chocolate is first, vanilla is second, strawberry is third, as depicted from left to right in the figure), then the cumulative distribution tells us the probability that a flavor is less than or equal to the respective flavor.

Figure 3: The cumulative distribution.

We can then compute the inverse of the cumulative distribution, which accepts a number u in the range [0,1] and outputs a flavor:

F1(u)={chocolateif 0u0.5,vanillaif 0.5<u0.8,strawberryif 0.8<u1.0

If we generate a random number u[0,1] (for instance, using a pseudo-random number generator on a computer), then we can sample a flavor using the inverse cumulative distribution function:

F1(u).

If the generated random numbers are uniformly distributed, then we will sample flavors proportionally, as is depicted in Figure 4.

Figure 4: The black dots represent uniformly distributed random numbers. The proportion of dots falling within a given range (depicted by the colored rectangles) is proportional to the respective probabilities (the width of the rectangle). For instance, 5 dots fall within the green (50%), 3 within the orange (30%), 2 within the red (20%), etc.

There are other methods of sampling distributions, such as rejection sampling. In the case of discrete (finite) random variables, sampling a distribution is relatively straightforward. However, in general, sampling from a distribution is a difficult problem.

The Metropolis-Hastings algorithm is an ingenious technique for sampling a wide class of distributions: it constructs a Markov chain one step at a time, such that the stationary distribution of the chain is equal to the target distribution, and can do so only by evaluating a function which is proportional to the target distribution. At each step, the algorithm maintains the most recent sample and proposes a candidate sample and determines whether to accept or reject the given proposal based on a simple criterion.

However, as the term chain implies, the original algorithm can only generate samples sequentially, which is a serious problem when implementing the algorithm in a parallel manner, for instance on a GPU. The typical solution is to use multiple chains; however, this means that the chains are independent and "shorter" (consist of fewer steps, affording less opportunity to converge to the stationary distribution), and there is overhead, such as storage, associated with each chain, so using multiple chains can result in serious drawbacks.

However, there are generalizations of the Metropolis-Hastings algorithm which are amenable to parallelization and do not suffer any drawbacks. This post will discuss one particular such generalization which generates multiple independent proposals at each step (hence permitting the proposals to be generated in parallel), and then utilizes all proposals (not just the single accepted proposal, if any) by weighting them in an appropriate manner which preserves the stationary distribution and even reduces variance.

Papers

The algorithm which will be discussed in this post is described in the paper "Rao-Blackwellised parallel MCMC" [1] by authors Tobias Schwedes and Ben Calderhead (referred to as "RB-MP-MCMC" there). The earlier paper "A general construction for parallelizing Metropolis−Hastings algorithms" [2] by author Ben Calderhead also describes aspects of this algorithm. The algorithm is based on earlier work described in the paper "Using All Metropolis-Hastings Proposals to Estimate Mean Values" [3] by author Håkon Tjelmeland.

The original algorithm is described in the paper "Equation of State Calculations by Fast Computing Machines" [4] by Nicholas Metropolis which was extended in the paper "Monte Carlo Sampling Methods Using Markov Chains and Their Applications" [5] by W.K. Hastings (hence the name "Metropolis-Hastings").

See the references at the end of the post.

Before proceeding, a brief overview of probability theory is needed in order to establish the basic terminology and concepts.

Measures

Our goal is to assign a probability to certain sets (where we think of the sets as particular events). In order to do so, we must establish which sorts of sets are measurable.

A collection Σ of subsets of a set X is called a σ-algebra if it contains X and is closed under complements and countable unions, i.e.:

  • XΣ,
  • XAΣ for all AΣ,
  • FΣ for any countable family of sets FΣ.

The elements AΣ are called measurable sets and the pair (X,Σ) is called a measurable space.

Given two measurable spaces (X,Σ) and (Y,T), a function f:XY is called a measurable function if f1(E)Σ for every ET.

Given a measurable space (X,Σ), a measure is a map μ:X[0,] (i.e. to the non-negative portion of the extended real line) such that the following properties are satisfied:

  • μ()=0,
  • countable additivity: μ(F)=AFμ(A) for any countable family of sets FΣ.

The triple (X,Σ,μ) is called a measure space.

A measure is called σ-finite if μ(A)< for all measurable sets A.

Given any map f:XY between two measurable spaces (X,Σ) and (Y,T), for any measure μ:Σ[0,], the pushforward measure f(μ) is the measure defined for all BT such that

f(μ)(B)=μ(f1(B)).

Probability Spaces

If μ(X)=1, then the measure space is called a probability space. A probability space is typically denoted (Ω,F,P), i.e. it consists of the following data:

  • A set Ω called the sample space whose elements are called outcomes,
  • A σ-algebra F consisting of subsets of the outcomes called the event space whose elements are called events,
  • A probability measure P.

Thus, each event is a measurable set of outcomes, and the measure assigned to it is called its probability.

For instance, the sample space representing the outcome of the roll of a die could be represented by {1,2,3,4,5,6} and the event {2,4,6} could represent a roll landing on an even-numbered face, whose probability is P({2,4,6})=1/2.

Events are sets because sets represent arbitrary predicates (they are the extension of the predicate, i.e. {ωΩϕ(ω)}). Thus, events can encode arbitrary predicates such as "the roll of two even number dies and one odd-numbered dice", etc.

The conditional probability P(AB), which represents the probability of an event A occurring given that an event B has occurred, is defined as

P(AB)=P(AB)P(B).

Random Variables

Given a probability space (Ω,F,P) and a measurable space (E,Σ), a random variable is a measurable function X:ΩE.

Random variables are functions, but they are often written in abbreviated notation. For instance, the probability that a random variable occupies a value in a measurable set AE is often written as follows:

P(XA)=P({ωΩX(ω)A}).

As another example, the probability that a random variable occupies a distinct value is often written as follows:

P(X=a)=P({ωΩX(ω)=a}).

As another example, if the set E is an ordered set with relation , then the following notation is often used:

P(Xa)=P({ωΩX(ω)a}).

In general, for any appropriate predicate ϕ, the following notation is used:

P(ϕ(X))=P({ωΩϕ(X(ω))).

The distribution (or probability distribution) of a random variable X is the pushforward measure with respect to X, and is thus defined, for all AΣ as

X(P)(A)=P(X1(A))=P({ωΩX(ω)A})=P(XA).

A real-valued random variable is a random variable whose codomain is R, i.e. a map X:ΩR (typically, the Borel σ-algebra B(R) is implied, which is the smallest σ-algebra containing all the open intervals in R).

The cumulative distribution function (CDF) of a real-valued random variable X is the function FX defined by

FX(x)=P(Xx)=P({ωΩX(ω)x}).

The CDF completely characterizes the distribution of a real-valued random variable (i.e. each distribution has a unique CDF and conversely, each CDF corresponds to a unique distribution).

For random variables on discrete (finite or countable) measurable spaces, the probability mass function (PMF) pX is defined so that

pX(x)=P(X=x).

Each real-valued random variable generates a σ-algebra. The σ-algebra generated by a random variable X with values in a measurable space (E,Σ) is denoted σ(X) and is defined as follows:

σ(X)={X1(A)AΣ}.

Although probability spaces and random variables are important for theoretical purposes, the distribution of random variables is typically what matters, and so the probability space and random variables are suppressed (they are not defined explicitly). There might be many random variables defined on various probability spaces which exhibit the same distribution; it is the distribution itself that is important. There are theoretical results guaranteeing the existence of suitable probability spaces and random variables for a given distribution. Thus, generally, the distribution is used, typically via a CDF or PDF (see below).

Integrals

Given a real-valued measurable function f:XY on measurable spaces (X,Σ) and (Y,T), it is possible to define the (Lebesgue) integral of f over a measurable set EΣ with respect to a measure μ on Σ, denoted

Ef dμ.

The same integral is sometimes denoted

Ef(x) dμ(x)

or as

Ef(x) μ(dx).

Consider Figure 5.

Figure 5: A graphical depiction of the Lebesgue integral. The bold red and green line segments along the horizontal axis depict various (disconnected) measurable sets in the domain, and correspond to a particular choice of indicator functions. The horizontal dashed lines depict the choice of various coefficients in a linear combination of these indicator functions. Each set of rectangles of a given color, collectively, represent the area of the product of a given coefficient (corresponding to the height of the respective dashed horizontal line) with the width (measure) of the respective measurable set. Taken together, the rectangles approximate the area under the curve. The supremum of all such approximations which are bounded by the function is the integral.

An indicator function 1S:X{0,1} (also written 1S) of a measurable set S is defined such that

1S(x)={1if xA,0otherwise.

The integral of an indicator function is defined as follows:

1S dμ=μ(S).

A finite linear combination s of indicator functions is called a simple function:

s=kak1Sk,

where akR the Sk are disjoint measurable sets. When each of the coefficients ak are positive, the integral of a simple function is defined as

(kak1Sk) dμ=kak 1Sk dμ=kakμ(Sk).

If E is a measurable subset of X, then the integral over this domain is defined as

Es dμ=1Es dμ=kak μ(SkE).

Then, the integral of a non-negative measurable function f:XR is defined as the supremum of the integrals of all non-negative simple functions bounded by f:

Ef dμ=sup{Es dμ0sf,s simple}.

We can then define non-negative measurable functions as follows:

f+(x)={f(x)if f(x)>0,0otherwise,

and

f(x)={f(x)if f(x)<0,0otherwise.

Then |f|=f+f. If at least one of f+ dμ or f dμ is finite, then the integral is said to exist, and is then defined as

f dμ=f+ dμf dμ,

and if

|f| dμ<,

the function is said to be integrable.

Some details (such as assumptions regarding ) have been ignored, but this gives the general idea of the integral.

Densities

The Radon-Nikodym Theorem is an important result in measure theory which states that, for any measurable space (X,Σ) on which two measures μ and ν are defined (subject to certain conditions, namely, that both measures are σ-finite and νμ, that is, ν is absolutely continuous with respect to μ), there exists a Σ-measurable function f:X[0,), such that for any measurable set AΣ,

ν(A)=Af dμ.

The function f thus represents a density of one measure with respect to another. Because of the formal analogy to the fundamental theorem of calculus, such a density is often called a "derivative" (although it is not related to the usual concept of a derivative, namely, the best linear approximation of a continuous function), and is denoted

dνdμ,

so that

ν(A)=Adνdμ dμ.

A probability density function (PDF) of a random variable X with values in a measurable set (E,Σ) is any density f of the corresponding distribution with respect to a reference measure μ, i.e.

f=dXPdμ,

and hence, for any measurable set AΣ,

P(XA)=X1(A)dP=Af dμ.

The distribution of a real-valued random variable admits a density function if and only if its CDF is absolutely continuous; in this case, the derivative of the CDF is a PDF:

ddxFX(x)=f(x).

Stochastic Processes

Given a probability space (Ω,F,P) and a measurable space (S,Σ), a stochastic process is a collection of S-valued random variables indexed by some set T (typically, T=N, resulting in a sequence of random variables):

{Xt:tT}.

The set T is called the index set and the set S is called the state space. The indices tT are often referred to as times.

A stochastic process is stationary if each of its random variables have the same distribution.

Filtrations

Given a σ-algebra F and a totally-ordered index set I (e.g. N), a sequence of σ-algebras

(Fi)iI

such that FiF (i.e. each is a sub-algebra) is called a filtration if FiFj whenever ij.

A stochastic process (Xi)iI defined for a probability space (Ω,F,P) with values in a measurable space (S,Σ) is adapted to the filtration if each random variable Xi:ΩS is a measurable function with respect to the measurable space (Ω,Fi).

Intuitively, a filtration models the information gathered from a stochastic process up to a particular point in the process.

Expectations

If X is an integrable real-valued random variable on a probability space with probability measure P, then the expectation (or mean) of X is denoted E(X) and is defined as follows:

E(X)=X dP.

For each event A, the mapping P(A) is also a probability measure. The conditional expectation of a real-valued random variable X with respect to an event A is defined as

E(XA)=X(ω)P(dωA).

The conditional expectation of a real-valued random variable X with respect to a σ-algebra F is a random variable denoted E(XF) which satisfies the following properties:

  • E(XF) is F-measurable
  • For each AF, E(1AX)=E(1AE(XF)).

The conditional probability of an event A given a σ-algebra F is defined as follows:

P(AF)=E(1BF).

Likewise, the conditional expectation of a real-valued random variable X with respect to another real-valued random variable Y is defined as

E(XY)=E(Xσ(Y)).

Markov Processes

A stochastic process (Xt)tT indexed on a set T on a probability space (Ω,F,P) with values in a measurable space (S,Σ) which is adapted to a filtration (Ft)tT is a Markov process if, for all AΣ and each s,tT with s<t,

P(XtAFs)=P(XtAXs).

This property is called the Markov property.

Intuitively, the Markov property indicates that a Markov process only depends on the most recent state, not on the entire history of prior states.

A Markov process is time-homogeneous if, for all s,tT, xA, and AΣ,

P(Xs+tAXt=x)=P(XtAX0=x).

This property indicates that the transition probability is independent of time (i.e. is the same for all times).

A Markov process is called a discrete-time process if T=N.

Given measurable spaces (Ω1,A1) and (Ω2,A2), a transition kernel is a map κ:Ω1×A2[0,] such that that following properties hold:

  • κ(,A2) is a measurable map with respect to A1 for every A2A2,
  • κ(ω1,) is a σ-finite measure on (Ω2,A2) for every ω1Ω1.

If a stochastic kernel is a probability measure for all ω1Ω1, the κ is called a stochastic kernel or Markov kernel.

Markov Chains on Arbitrary State Spaces

A discrete-time, time-homogeneous Markov process consisting of real-valued (or real-vector-valued) random variables, for our purposes, will be called a Markov chain (although other types of Markov processes are also called Markov chains).

The initial distribution π0 of a Markov chain is the distribution of the first random variable X0, i.e., for all AS,

πo(A)=X(P)(A)=P(XA).

We can then define a stochastic kernel p as follows:

p(x,A)=P(Xs+tAXt=x)=P(XtAX0=x).

Since a Markov chain is time-homogeneous, this is well-defined.

The distribution π1 of X1 can then be expressed in terms of the distribution π0 as follows:

π1(A)=P(X1A)=xSp(x,A) dπ0(x).

Likewise, the distribution πt at each time t can be defined recursively as follows:

πt(A)=P(XtA)=xSp(x,A) dπt1(x).

If the random variables admit densities, then X1 has probability density f1 such that

π1(A)=xAf1(x) dx.

Likewise, the transition kernel can be expressed in terms of a conditional density f(xtxt1) as follows:

p(xt,A)=xnAf(xtxt1) dxt.

Then, each of the densities can be expressed as follows:

πt(A)=xt1f(xtxt1)f(xt1) dxt1.

If the Markov process has a stationary distribution, that is, if each of the random variables has the same distribution π (and hence the same as the initial distribution π0=π), then the distribution at any time t may be expressed as

π(A)=xSp(x,A) dπ(x).

This equation is called the balance equation.

In terms of PDFs, the balance equation is expressed as

f(y)=xSf(yx)f(x) dx.

A Markov chain satisfies detailed balance if there exists a probability measure π with density f such that for all x,yS,

f(yx)f(x)=f(xy)f(y).

Then, integrating both sides, we obtain

ySf(yx)f(x) dy=ySf(xy)f(y) dy.

Thus,

f(x)ySf(yx) dy=ySf(xy)f(y) dy,

and, since

ySf(yx) dy=1,

it follows that

f(x)=ySf(xy)f(y) dy.

This is (with an appropriate change of variables) exactly the balance equation. Thus, detailed balance implies balance, i.e. it implies that π is a stationary distribution.

Example Markov Chain on a Finite State Space

It is instructive to consider an example of a Markov chain on a finite state space. We will work with the distributions, via the probability mass function (PMF) and cumulative distribution function (CDF).

Consider the following Markov chain:

  • The state space S consists of three states S={s1,s2,s3},
  • The initial distribution π0 defined such that π0(s1)=1/2, π0(s2)=1/4, and π0(s3)=1/4
  • The transition function P is defined so that P(sjsi) is given by the labels on the arrows from state si to state sj in Figure 6.

A Markov chain on a finite state space can be conceived as a state machine with probabilistic transitions, as in Figure 6.

Figure 6: A Markov chain can be conceived as a state machine with probabilistic transitions.

Next, define the inverse CDF F01 corresponding to π0:

F01(u)={s1if 0u12s2if 12<u34s3if 34<u1.

We can then sample a state by generating a uniformly distributed random number u0[0,1]. Suppose that u0=0.84. Then F01(u0)=s3, and so the initial state is s3.

Then, we can define the next distribution π1 as follows:

π1(s)=i=13π0(si)P(ssi).

This uses the Law of Total Probability, i.e. the probability of occupying a given state s at time t=1 is the probability of occupying some state si at time t=0 and then transitioning to state s. Thus, we compute π1(s1)=11/24, π1(s2)=1/4, π1(s3)=7/24.

We can then define the next inverse CDF F11 corresponding to π1 as follows:

F11(u)={s1if 0u1124s2if 1124<u1724s3if 1724<u1.

We can then sample the next state by generating a uniformly distributed random number u1[0,1]. Suppose that u1=0.08. Then, F11(u1)=s1, and so the next state is s1.

This process can continue indefinitely. We can define the distribution πt at time t+1 recursively as follows:

πt+1(s)=i=13πt(si)P(ssi).

If the Markov chain possesses a stationary distribution, then it will be the limit as t. The stationary distribution will satisfy

π(s)=i=13π(si)P(ssi).

Each distribution πt can be represented as a row vector. For instance,

π0=[121414].

Likewise, the transition function can be represented as a matrix Pij=P(sjsi) as follows:

Pij=[1213121401214230].

Then, each distribution πt+1 can be computed via the recursive equation

πt+1=πtP.

Expanding this recursive equation yields an iterative equation:

πt+1=πtP=(πt1P)P==π0Pt.

If the Markov chain possesses a stationary distribution π, then it will satisfy

π=πP.

If we write

π=[xyz],

then this yields the equation

[xyz]=[xyz][1214141302312120].

We must additionally impose the constraint that x+y+z=1 since these represent probabilities. This then corresponds to the augmented system of equations

12x+14y+14z=x13x+23z=y12x+12y=zx+y+z=1.

Solving this yields x=1/3, y=1/3, z=1/3. So, the stationary distribution is a uniform distribution.

The Rao-Blackwell Theorem

The term parameter refers to a quantity that summarizes some detail of a statistical population, such as the mean, or standard deviation of the population, and is often denoted θ. A statistic is a random variable X which represents some observable quantity. An estimator for a parameter is an observable random variable θ^(X) used to estimate θ. A sufficient statistic is a statistic T(X) such that the conditional probability of X given T(X) does not depend on θ.

The mean-squared error (MSE) of an estimator θ^ for a parameter θ is defined as

E((θ^θ)2).

The Rao-Blackwell estimator θ~ is the conditional expected value of an estimator for a parameter θ given a sufficient statistic T(X), namely

θ~=E(θ^(X)T(X)).

One version of the Rao-Blackwell theorem states that the mean-squared error of the Rao-Blackwell estimator does not exceed that of the original estimator, namely

E((θ~θ)2)E((θ^θ)2).

Another version states that the variance of the Rao-Blackwell estimator is less than or equal to the variance of the original estimator.

The Rao-Blackwell theorem can be applied to produce improved estimators. In particular, it will provide a means of using the rejected samples produced by the Metropolis-Hastings algorithm.

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm, like all MCMC techniques, designs a Markov chain such that the stationary distribution π of the chain is some desired distribution, called the target distribution. The distribution of the chain asymptotically approaches the stationary distribution, and the sequence of states generated by the chain can be used as samples from the target distribution. Here, we represent distributions via PDFs.

A Markov chain is characterized by its transition probability P(xixj). The algorithm proceeds by designing P(xixj) such that the stationary distribution is π.

First, the condition of detailed balance is imposed, namely.

P(xixj)π(xj)=P(xjxi)π(xi),

which guarantees the balance condition

π(xi)=P(xixj)π(xj) dxj.

Then, the transition distribution P is decomposed into two distributions, the acceptance distribution A(xixj)[0,1] which represents the probability of accepting state xi given state xj, and the proposal distribution K(xixj) which represents the probability of proposing a transition from xi to xj. The transition distribution is then defined as the product of these distributions:

P(xixj)=K(xixj)A(xixj).

The detailed balance condition then implies that

π(xj)K(xixj)A(xixj)=π(xi)K(xjxi)A(xjxi),

which further implies that

A(xjxi)A(xixj)=π(xj)K(xixj)π(xi)K(xjxi).

Often, the proposal distribution is designed to be symmetric, meaning that

K(xixj)=K(xjxi),

and so this simplifies to

A(xjxi)A(xixj)=π(xj)π(xi).

The typical choice for the acceptance distribution is

A(xixj)=min(1,π(xj)K(xixj)π(xi)K(xjxi)).

Then, either A(xixj)=1 or A(xjxi)=1 and the detailed balance equation will be satisfied.

Thus, the balance equation is also satisfied, and the stationary distribution of the chain will be the target distribution π.

Note that since the probability of acceptance is A(xixj), the probability of rejection is therefore 1A(xixj).

Detailed balance guarantees the existence of a stationary distribution; the uniqueness of the stationary distribution is guaranteed under certain conditions (the ergodicity of the chain).

This yields a concrete algorithm:

  1. Choose an initial state x0. Set time t=0. Set xt=x0.
  2. Iterate:
    1. Generate a random proposal state x by sampling from the distribution K(xxt).
    2. Either accept the proposal or reject the proposal.
      1. Generate a uniformly distributed random number u in the range [0,1].
      2. If uA(xxj), then accept the proposal state: set xt+1=x.
      3. If u>A(xxj, then reject the proposal state: set xt+1=xt (i.e. accept the current state).
    3. Continue at time t+1.

This algorithm will yield a sequence of states which represent samples from the target distribution.

Note that the proposal state is not used when it is rejected; it is possible to use it, however. We will return to this issue in the discussion of the Rao-Blackwell theorem.

A New Perspective on Metropolis-Hastings

A change in the perspective of the Metropolis-Hastings algorithm will indicate its potential generalization. The algorithm can be conceived in terms of an auxiliary Markov chain. The proposal state and the current state comprise the states of a two-state auxiliary Markov chain whose transition probabilities are given by the acceptance probabilities A(xixj) as in Figure 7.

Figure 7: The auxiliary Markov chain implicit in the Metropolis-Hastings algorithm. Only the transitions which are actually used have been depicted.

A new auxiliary chain is constructed during each iteration of the algorithm. The chain begins in state xt. Acceptance of the proposal state x corresponds to a transition xtx which has probability A(xxt) and rejection of the proposal state (or acceptance of the current state) corresponds to a transition xtx, which implicitly has probability 1A(xxt. The acceptance probability A thus implicitly represents the transition probability of this auxiliary Markov chain. Thus, simulating this auxiliary Markov chain for a single step represents accepting or rejecting the proposal state.

We refer to the original Markov chain described explicitly in the Metropolis-Hastings algorithm as the primary chain to distinguish it from this implicit auxiliary chain.

This shift in perspective indicates several avenues toward generalization:

  1. A different auxiliary Markov chain can be constructed, in particular, one with multiple proposal states.
  2. The auxiliary Markov chain can be simulated for more than one step.

However, it is essential to ensure that the stationary distribution of the Markov chain is preserved.

In this post, we will not consider simulating the auxiliary chain for more than one step (but see references [1] and [2] for more information on this approach). This is less amenable to parallelization since it incurs significant coordination overhead. Instead, we will consider an auxiliary Markov chain with multiple proposal states which is simulated only once, yet each of the proposals is utilized via a Rao-Blackwell estimator. This approach allows the proposals to be computed in parallel, and minimizes the coordination needed.

Multiple Proposal Metropolis-Hastings

As in the single proposal case, we have a target distribution which is continuous on Rn with density π.

Logically, we think of the process as follows: at each iteration of the Metropolis-Hastings algorithm, we produce an auxiliary Markov chain consisting of multiple proposal states, then we simulate the auxiliary chain for a single step to select the next state.

Just as in the single-proposal algorithm, we will maintain a single current state at each step of the process. However, we will propose an entire vector of proposals at each step.

We thus need to define the proposal density of a proposal vector given the current state.

First, denote the number of proposals as N, where N1. Then, let there be a sequence of N+1 states, which include the proposals and current state:

y1,y2,,yN+1.

We then use the notation

y=(y1,y2,,yN+1).

The notation

yi=(y1,,yi1,yi+1,,ym)

indicates the projection of y which deletes the i-th component.

We can define a transition kernel κ~ in terms of the the underlying proposal kernel κ of the Markov chain as

κ~(yi,yi)=jiκ(yi,yj).

We can then define a multiple proposal density pi for each 1iN+1 in terms of the target distribution π as

pi(y)=π(yi)k~(yi,yi).

Next, define a uniformly-distributed auxiliary random variable I with values in the set {1,,N+1}. This random variable indicates which of the proposal densities to use (which index to use). Since it is uniformly distributed, each index has probability 1/(N+1).

Next, define a joint distribution as follows

p(y,i)=1N+1pi(y).

Finally, denote the transition matrix of the auxiliary chain as

A(i,j)=[Aij].

There will be a different transition matrix for each auxiliary chain, and each auxiliary chain corresponds to a different proposal vector.

We can now formulate the analog of the balance condition for the auxiliary chain:

pi(y)=j=1N+1pj(y)A(j,i).

Likewise, there is an analog of the detailed balance condition for the auxiliary chain:

pi(y)A(i,j)=pj(y)A(j,i).

In analogy to the single-proposal algorithm, the following transition matrix is used (which represents the acceptance distribution):

A(i,j)={1N+1min(1,pj(y)pi(y))if ij1kiA(i,k)if i=j.

This transition matrix satisfies the balance equation. To see this, first denote the number of times pi(y)/pj(y)1 as k and the number of times pi(y)/pj(y)>1 as l. Then

j=1N+1pj(y)A(j,i)=jipj(y)A(j,i)+pi(y)A(i,i)=(kN+1pi(y)+lN+1pj(y))+pi(y)(1jiA(i,j))=(kN+1pi(y)+lN+1pj(y))+pi(y)(kN+1pi(y)+lN+1pj(y))=pi(y).

Preservation of Stationary Distribution

Now that we have defined the setting for multiple-proposal Markov chains, we need to establish several essential properties. First, the stationary distribution of the modified Markov process must be preserved, i.e. must still be equal to the target distribution π.

The following proof is adapted from appendix A of reference [1].

The notation

dy=dy1dyN+1

and

dyi=dy1dyi1dyi+1dyN+1

is used.

First, define a stochastic kernel as follows:

P(yi,B)=RNdk~(yi,yi)j=1N+1A(i,j)1B(yj)dyi.

This kernel indicates the probability that the accepted sample yj belongs to a measurable set BB(Rd). It can be interpreted as the probability of proposing the proposal vector yi from current state yi and then accepting some proposal state yjB.

All transitions from the current state to the next have this form. Thus, we compute, for an arbitrary state x=yi,

RdP(x,B)π(x) dx=Rd(RNdk~(yi,yi)j=1N+1A(i,j)1B(yj)dyi)π(yi) dyi=R(N+1)dk~(yi,yi)j=1N+1A(i,j)1B(yj)π(yi) dy.

For any constant term c, 1N+11N+1c=c. Furthermore, since the indices of the formal variables in the integrand are arbitrary (i.e. whether we designate y1 or y2, etc. as the current state is irrelevant), re-arranging them does not affect the value of the integral. Thus, we may introduce a summation without affecting the integral as follows:

R(N+1)dk~(yi,yi)j=1N+1A(i,j)1B(yj)π(yi) dy=R(N+1)d1N+1i=1N+1k~(yi,yi)j=1N+1A(i,j)1B(yj)π(yi) dy

We may then re-associate the terms as follows:

R(N+1)d1N+1i=1N+1k~(yi,yi)j=1N+1A(i,j)1B(yj)π(yi) dy=R(N+1)d1N+1j=1N+1i=1N+1π(yi)k~(yi,yi)A(i,j)1B(yj) dy.

Since the balance condition is satisfied, we then substitute as follows:

R(N+1)d1N+1i=1N+1j=1N+1π(yi)k~(yi,yi)A(i,j)1B(yj) dy=R(N+1)d1N+1j=1N+1π(yj)k~(yj,yj)1B(yj) dy.

Then, just as before, since the sum does not affect the integral, we may eliminate it as follows:

R(N+1)d1N+1j=1N+1π(yj)k~(yj,yj)1B(yj) dy=R(N+1)dπ(yj)k~(yj,yj)1B(yj) dy.

Finally, we re-associate the terms and utilize the fact that k~ is a kernel, and thus RNdk~(yi,yi) dyi=1 as follows:

R(N+1)dπ(yj)k~(yj,yj)1B(yj) dy=Rdπ(yj)1B(yj)(R(N+1)dk~(yj,yj) dyj) dyj=Rdπ(yj)1B(yj) dyj=Bπ(x) dx.

Thus, the transition kernel does indeed maintain the stationary distribution.

If we denote the distribution corresponding to the density π as π~, then the stationary distribution on a Markov process was originally defined as

π~(B)=SP(x,B) dπ~(x).

Noting that

π=dπ~dx,

it follows from the Radon-Nikodym theorem that

SP(x,B) dπ~(x)=SP(x,B)dπ~dx dx=SP(x,B)π(x) dx

Likewise, by definition,

π~(B)=Bπ(x) dx.

Thus, the balance equation is satisfied.

The Rao-Blackwell Estimator

MCMC techniques such as the Metropolis-Hastings algorithm are very often used to generate samples for use in numerical integration. When applying MCMC toward numerical integration, the sample mean μ^ serves as an estimator for the expected value Eπ[f]:

1Ni=1Nf(xi)f(x)π(x) dx.

This is an unbiased estimator (that is, Eπ[μ^]=Eπ[f]), since

Eπ[1Ni=1Nf(xi)]=1Ni=1NEπ[f(xi)]=1Ni=1Nf(x)π(x) dx=f(x)π(x) dx=Eπ[f].

This allows the calculation of arbitrary integrals by using f/π in place of f, in which case

1Ni=1Nf(xi)π(xi)f(x)π(x)π(x) dx=f(x) dx.

Our goal is to introduce an alternative estimator μ~ which utilizes all proposals and serves as an unbiased estimator for Eπ[f].

We will begin with the sample mean estimator

μ^=1Ll=1Lf(x(l))

and enhance it using the Rao-Blackwell estimator

μ~=E[μ^y(l)].

We will proceed by using weighted samples weighted according to weights wi(l), where 1lL and L denotes the number of iterations of the algorithm and 1iN+1 where N denotes the number of proposals and hence N+1 indicates the total number of samples including the current sample. The weights must sum to 1. Denote the samples (consisting of the current and proposed samples) as yi(l).

We define the weights using the acceptance distribution as follows (given current index i):

wj(l)=A(i,j).

The authors in [1] employ the following technique to derive the estimator. First, define the prefix sum of weights ηi(l) as

ηi(l)=j=1iwj(l).

This is related to the CDF of A(i,j) since the CDF of a PMF is a prefix sum. Therefore, employing the same sampling technique from the introduction, we may sample A(i,j) by evaluating the inverse of the CDF (the quantile function) with a uniformly distributed random number uU(0,1).

Next, consider the following indicator function:

1(ηi1(l),ηi(l)].

Note that, for any u[0,1], 1(ηi1(l),ηi(l)](u) will evaluate to 1 for exactly one index i. Thus, if we denote the accepted sample as x(l), then, motivated by the inverse CDF sampling technique, we can express x(l) in terms of u(l)U(0,1) as

x(l)=i=1N+11(ηi1(l),ηi(l)](u(l))yi(l).

In fact, this is an expression for the quantile (inverse CDF) function. This is why the sampling technique based on inverse CDFs works: it creates a random variable with the same distribution as another, yet with a different domain.

Thus, the expression is now reformulated on a convenient domain [0,1] with respect to the uniform (constant) PDF p(x)=1. The expected value is preserved, but is much easier to compute.

Also note that

f(x(l))=i=1N+11(ηi1(l),ηi(l)](u(l))f(yi(l)).

Since these uniformly distributed random numbers are independent of the proposed samples, the conditional expected value of the estimator with respect to the proposed samples is just the expected value of the estimator, namely

E[μ^y(l)]=E[μ^].

We then compute

E[f(x(l))y]=E[f(x(l))]=E[i=1N+11(ηi1(l),ηi(l)](u(l))f(yi(l))]=(0,1)i=1N+11(ηi1(l),ηi(l)](u(l))f(yi(l)) du(l)=i=1N+1(0,1)1(ηi1(l),ηi(l)](u(l))f(yi(l)) du(l)=i=1N+1ηi1(l),ηi(l)f(yi(l)) du(l)=i=1N+1(ηi1(l),ηi(l) du(l))f(yi(l))=i=1N+1wi(l)f(yi(l)).

Each iteration therefore produces the weighted estimate (the Rao-Blackwell estimator)

E[f(x(l))y(l)]=i=1N+1wi(l)f(yi(l)).

The estimator for the mean is thus

E[μ^y(l)]=E[μ^]=E[l=1L1Lf(x(l))]=l=1LE[f(x(l))]=l=1Li=1N+1wi(l)f(yi(l)).

Unbiased Estimator

Next, we need to demonstrate that this new estimator is unbiased. The following proof is adapted from appendix A of reference [1].

We calculate the expected value with respect to the distribution π(y1)k~(y1,y1) (where y1 denotes the starting point) as follows:

E[i=1N+1wi(l)f(yi(l))]=E[i=1N+1wi(l)f(yi(l))]=RNdi=1N+1wi(l)f(yi(l))π(y1)k~(y1,y1) dy.

Then, splitting the sum into two terms, we continue to compute as follows:

RNdi=1N+1wi(l)f(yi(l))π(y1)k~(y1,y1) dy=RNd(i1wi(l)f(yi(l))+w1(l)f(y1(l))π(y1)k~(y1,y1) dy=(i1RNdwi(l)f(yi(l))π(y1)k~(y1,y1) dy)+(RNdw1(l)f(y1(l))π(y1)k~(y1,y1) dy)

Since the balance equation states that

j=1N+1π(yj)κ~(yj,yj)A(j,i)=π(yi)κ~(yi,yi),

and since A(,i)=wi, it follows that

π(yi)κ~(yi,yi)j=1N+1π(yj)κ~(yj,yj)=wi.

We may then infer that

π(y1)κ~(y1,y1)wi=π(y1)κ~(y1,y1)π(yi)κ~(yi,yi)j=1N+1π(yj)κ~(yj,yj)=π(yi)κ~(yi,yi)π(y1)κ~(y1,y1)j=1N+1π(yj)κ~(yj,yj)=π(yi)κ~(yi,yi)w1.

Substituting this into the prior equation, we obtain

(i1RNdwi(l)f(yi(l))π(y1)k~(y1,y1) dy)+(RNdw1(l)f(y1(l))π(y1)k~(y1,y1) dy)=(i1RNdw1(l)f(yi(l))π(yi)κ~(yi,yi) dy)+(RNdw1(l)f(y1(l))π(y1)k~(y1,y1) dy)

Noting that

w1=π(y1)κ~(y1,y1)π(yi)κ~(yi,yi)wi,

we then substitute and obtain

(i1RNdw1(l)f(yi(l))π(yi)κ~(yi,yi) dy)+(RNdw1(l)f(y1(l))π(y1)k~(y1,y1) dy)=(i1RNdwi(l)f(y1(l))π(y1)κ~(y1,y1) dy)+(RNdw1(l)f(y1(l))π(y1)k~(y1,y1) dy)=i=1N+1R(N+1)dwi(l)f(y1(l))π(y1)κ~(y1,y1) dy=R(N+1)d(i=1N+1wi(l))f(y1(l))π(y1)κ~(y1,y1) dy=R(N+1)df(y1(l))π(y1)κ~(y1,y1) dy=Rdf(y1(l))π(y1)(R(N+1)dκ~(y1,y1) dy1) dy1=Rdf(y1(l))π(y1) dy1=Eπ[f(x)]

Thus, this shows that

E[μ~(f)]=Eπ[f(x)],

and that μ~(f) is an unbiased estimator.

The Final Algorithm

The parallel Metropolis-Hastings algorithm is as follows:

  1. Choose an initial sample y1 and set i=1 and t=1.
  2. Iterate:
    1. Generate N independent random proposal states y=y1,,yN in parallel by sampling from the proposal kernel κ(yi,).
    2. Either accept a single proposal or reject all proposals (i.e. accept the current proposal):
      1. Generate a uniformly distributed random number u in the range [0,1].
      2. If uA(i,j, then accept the proposal state yj: set yt+1=yj.
      3. If u>A(i,j, then reject all proposal states (i.e. accept the current state): set yt+1=yi.
    3. Incorporate the Rao-Blackwell estimator by weighting all proposed samples and the current sample according to wj=A(i,j).
    4. Continue at time t+1.

References

  1. Schwedes, T. & Calderhead, B. (2021). Rao-Blackwellised parallel MCMC . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3448-3456 Available from https://proceedings.mlr.press/v130/schwedes21a.html.
  2. B. Calderhead, A general construction for parallelizing Metropolis−Hastings algorithms, Proc. Natl. Acad. Sci. U.S.A. 111 (49) 17408-17413, https://doi.org/10.1073/pnas.1408184111 (2014).
  3.  Tjelmeland H (2004) Using All Metropolis-Hastings Proposals to Estimate Mean Values (Norwegian University of Science and Technology, Trondheim, Norway), Tech Rep 4. Available from https://cds.cern.ch/record/848061/files/cer-002531395.pdf.
  4.  Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines". Journal of Chemical Physics. 21 (6): 1087–1092. Bibcode:1953JChPh..21.1087M. doi:10.1063/1.1699114
  5. Hastings, W.K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications". Biometrika. 57 (1): 97–109. Bibcode:1970Bimka..57...97H. doi:10.1093/biomet/57.1.97. JSTOR 2334940