De Metropolis-Hastingis algorithmo

In contrast to OLS and MLE, where parameter estimation essentially boils down to solving something along the lines of \hat\beta = (X^\prime X)^{-1} X^\prime y , or respectively searching for \hat\beta such that the likelihood p(y,X|\hat\beta) is largest, estimating Bayesian models may at first feel like resorting to a black box which mysteriously produces posterior densities p(\beta|y,X) . In an attempt to lift this veil of mystery, we will take a look at the Metropolis-Hastings sampler, one of those black boxes used to estimate posterior densities.

In a nutshell, the Metropolis-Hastings algorithm draws random samples from probability distributions P(\theta) .1 To move through the parameter space, the algorithm requires a function f(\theta) which is proportional to the target distribution P(\theta) . At each iteration t , the algorithm generates a candidate value \theta^* from a proposal distribution J_t(\cdot) . To decide whether to accept this value as a draw from P(\theta ), the algorithm computes the ratio f(\theta^*)/f(\theta^{(t-1)}) of the current candidate value \theta^* and the value obtained at the previous iteration ( \theta^{(t-1)} ), possibly corrected for asymmetric proposal distributions. If the ratio is larger than or equal to one, \theta^* is directly accepted as a draw from P(\theta) . If the ratio is smaller than one, the proposal is rejected with a non-zero probability.

To use the MH algorithm for posterior density estimation, first remember Bayes’ theorem: For any continuous parameter \theta and data y ,

\begin{align*} p(\theta|y) &= \frac{p(y|\theta)p(\theta)}{\int p(y|\theta) p(\theta) d\theta} = \frac{p(y|\theta)p(\theta)}{p(y)}, \end{align*}

The Metropolis-Hastings algorithm samples from p(\theta|y) as follows:

  1. generate a candidate value \theta^* from the proposal distribution J_t(\cdot) . \theta^{(t-1)} denotes the value at the beginning of iteration t .

  2. compute the acceptance probability \alpha :

    \begin{align*} \alpha &= \text{min}(r, 1) \\ r &= \frac{p(y|\theta^*)p(\theta^*)}{p(y|\theta^{(t-1)})p(\theta^{(t-1)})} \frac{P_J(\theta^{(t-1)}|\theta^*)}{P_J(\theta^*|\theta^{(t-1)})} \end{align*}

    with p(\theta|y) \propto p(y|\theta)p(\theta) . P_J(\theta^*|\theta^{(t-1)}) denotes the probability of J(\cdot) producing \theta^* conditional on the draw \theta^{(t-1)} of the previous iteration.2

  3. generate U \sim Unif (0,1) :

    \begin{align*} \theta^{(t)} = \begin{cases} \theta^* & \text{if } U \leq \alpha \\ \theta^{(t-1)} & \text{else} \end{cases} \end{align*}

    and go back to 1. until convergence to p(\theta|y) .3

Random walk Metropolis-Hastings

To see how the algorithm explores a posterior density, let’s implement a Metropolis-Hastings sampler in R. Assume we want to investigate the association between biochemistry doctoral students’ characteristics x_i and the number y_i of journal articles they write.4 The characteristics are gender (woman), marital status (married), number of children aged 5 or younger (kid), the prestige of the PhD department (phd), and the count of articles produced by the student’s mentor (mentor). The dependent variable consists of the student’s count of produced journal articles (art), so we model y_i \sim \text{Pois}(\lambda_i) , \lambda_i = \exp(x_i\beta) , with coefficient vector \beta and the covariates x_i described above.

loglik <- function(betastar, y, X) {
  ll <- sum(dpois(y, lambda = exp(X %*% betastar), log = TRUE))
  return(ll)
}

Or, alternatively, write down the log likelihood function

loglik <- function(betastar, y, X) {
  ll <- sum(y * (X %*% betastar) - exp(X %*% betastar) - log(factorial(y)))
  return(ll)
}

We will assume a (multivariate) normal prior p(\beta) for the coefficients, with prior mean vector b_0 and prior variance B_0 : \beta \sim N(b_0, B_0) .

lprior <- function(betastar, b_0, B_0) {
  lp <- dmvnorm(betastar, mean = b_0, sigma = B_0, log = TRUE)
  # requires the package "mvtnorm"
  return(lp)
}

Quite crucially, our sampler requires the jumping density J(\cdot) . For simplicity, we start with \beta^* \sim N(\beta^{(t-1)}, P) , that is a random-walk with a step size determined by P .

j <- function(beta_t, P) {
  proposal <- mvrnorm(1, beta_t, P)
  return(proposal)
}

With a gaussian jumping distribution centered at the value of the previous proposal, the probability of generating \beta^* from N(\beta^{(t-1)}, P) is equal to generating \beta^{(t-1)} from N(\beta^*, P) . The proportion of these probabilities hence being one, we may omit the correction P_J(\beta^{(t-1)}|\beta^*)/P_J(\beta^*|\beta^{(t-1)}) for asymmetric proposal densities (and in fact use the Metropolis algorithm).

for (t in 2:nsim) {
  # step 1: generate proposal values
  betastar <- j(betasim_rw[t-1,], P)
  # step 2: compute acceptance probability
  r <- exp(loglik(betastar, y, X) -
	   loglik(betasim_rw[t-1,], y, X) +
	   lprior(betastar, b_0, B_0) -
	   lprior(betasim_rw[t-1,], b_0, B_0))
  # step 3: accept/reject
  alpha <- min(r, 1)
  if (runif(1) <= alpha) {
    betasim_rw[t, ] <- betastar
  } else {
    betasim_rw[t, ] <- betasim_rw[t-1, ]
  }
}

Before being able to run the algorithm, we need to specify the starting values of the Markov chain, the mean b_0 and variance B_0 of the prior p(\beta) , and the step size P (the covariance) of the jumping distribution J(\cdot) . For simplicity, the algorithm shall start at the value of the maximum likelihood estimates \hat\beta , and use a vague prior with mean b_0 = 0 and variance B_0 = 10^4\cdot I .

The element that requires some thought is the step size P of the proposal density. If the step size is too large, many candidate values may fall into low-probability regions of the target density, and thus be rejected by the algorithm. Simply put, the Markov chain may not move forward. If the step size is too small however, the algorithm may explore p(\beta|y, X) only very slowly. The authors of the R package MCMCpack (Martin, Quinn, and Park 2011) for instance set P = T(B_0^{-1} + V^{-1})^{-1}T , with T = 1.1\cdot I , and V being the covariance matrix of the MLE \hat\beta . To keep things simple, we will do the same.

Putting the different pieces together in an R-function and running the algorithm 100,000 times:

## ---- Random-walk Metropolis-Hastings algorithm ----
library(MASS)
library(pscl)
library(mvtnorm)

# ---- random-walk MH sampler ----
rw_metropolis_hastings <- function(y, X, betasim_rw,
				   P,
				   b_0, B_0,
				   nsim) {
  # loglikelihood
  loglik <- function(betastar, y, X) {
    ll <- sum(dpois(y, lambda = exp(X %*% betastar), log = TRUE))
    return(ll)
  }
  # log-prior function
  lprior <- function(betastar, b_0, B_0) {
    lp <- dmvnorm(betastar, mean = b_0, sigma = B_0, log = TRUE)
    # requires the package "mvtnorm"
    return(lp)
  }
  # jumping distribution
  j <- function(beta_t, P) {
    proposal <- mvrnorm(1, beta_t, P)
    return(proposal)
  }
  # sampler
  set.seed(100)
  for (t in 2:nsim) {
    # step 1: generate proposal values
    betastar <- j(betasim_rw[t-1,], P)
    # step 2: compute acceptance probability
    r <- exp(loglik(betastar, y, X) -
	     loglik(betasim_rw[t-1,], y, X) +
	     lprior(betastar, b_0, B_0) -
	     lprior(betasim_rw[t-1,], b_0, B_0))
    # step 3: accept/reject
    alpha <- min(r, 1)
    if (runif(1) <= alpha) {
      betasim_rw[t, ] <- betastar
    } else {
      betasim_rw[t, ] <- betasim_rw[t-1, ]
    }
  }
  return(betasim_rw)
}

# ---- data ----
head(bioChemists, 10)
y <- bioChemists$art
X <- model.matrix(art ~ ., data = bioChemists)
dimnames(X)[[2]] <- c("intercept", "woman", "married", "kid", "phd", "mentor")

# ---- prior mean and variance ----
k <- ncol(X)
b_0 <- vector(mode = "numeric", length = k) # vector values automatically initialized to 0.
B_0 <- diag(rep(10^4, k))

# ---- starting values, step size of jumping distribution ----
inits <- glm(art ~ ., data = bioChemists, family = poisson(link = "log"))

# starting values
beta_start <- coef(inits)

# covariance matrix P for step size of jumping distribution
V <- vcov(inits)
Tmat <- diag(rep(1.1, k))
P <- Tmat %*% solve(solve(B_0) + solve(V)) %*% Tmat

# ---- some housekeeping: results object ----
nsim <- 10^5
betasim_rw <- matrix(beta_start, nrow = nsim, ncol = k, byrow = TRUE)
colnames(betasim_rw) <- colnames(X)

# ---- run algorithm ----
betasim_rw_R <- rw_metropolis_hastings(y, X, betasim_rw,
				       P,
				       b_0, B_0,
				       nsim)

Figure 1: Random-walk MH: Markov chain exploring (p(beta|y, X))

Figure 1: Random-walk MH: Markov chain exploring (p(beta|y, X))

Figure 2: Autocorrelation plots of the random-walk MH draws

Figure 2: Autocorrelation plots of the random-walk MH draws

We run the random-walk MH algorithm 100,000 times and discard the first 1,000 draws. To rather quickly assess stationarity of the Markov chains, we plot the values generated at each iteration of the MH algorithm separately for each covarate. As we can see in figure 1, the chains seem to have reached their stationary state quite soon, we don’t see the chains wandering around very much.5 The autocorrelation plots (see figure 2) display a rather high correlation between sample draws, indicating a somewhat inefficient sampling scheme.

Table 1: Random-walk MH: Bayesian Poisson regression estimates, 95% credible intervals, standard deviations, and probability of negative/positive association
estimate 2.5% 97.5% \sigma_{\textbf{estimate}} P(\beta &lt; 0) P(\beta &gt; 0)
intercept 0.305 0.102 0.503 0.102 0.002 0.998
woman -0.224 -0.332 -0.116 0.055 1.000 0.000
married 0.155 0.034 0.278 0.062 0.005 0.995
kid -0.185 -0.266 -0.107 0.040 1.000 0.000
phd 0.013 -0.037 0.065 0.026 0.317 0.683
mentor 0.025 0.021 0.029 0.002 0.000 1.000

Figure 3: Random-walk MH: Posterior density (p(beta|y, X))

Figure 3: Random-walk MH: Posterior density (p(beta|y, X))

Independent Metropolis-Hastings

As we have seen above, the Metropolis algorithm with random-walk jumping distribution provides quite correlated samples from the posterior. As a matter of fact, the jumping distribution of each candidate is determined by the set of values of the previous iteration. Correlated samples are not an issue per se, the algorithm just needs to run for quite a while to produce a sufficient number of independent draws from the posterior. To avoid the inefficiencies associated with the random-walk, we can implement a jumping distribution that produces independent candidate values.

## ---- Independent Metropolis-Hastings algorithm ----

# ---- independent MH sampler ----
indep_metropolis_hastings <- function(y, X, betasim_indep,
				      beta_1, V_1,
				      b_0, B_0,
				      nsim) {
  # loglikelihood
  loglik <- function(betastar, y, X) {
    ll <- sum(dpois(y, lambda = exp(X %*% betastar), log = TRUE))
    return(ll)
  }
  # log-prior function
  lprior <- function(betastar, b_0, B_0) {
    lp <- dmvnorm(betastar, mean = b_0, sigma = B_0, log = TRUE)
    # requires the package "mvtnorm"
    return(lp)
  }
  # jumping distribution
  j <- function(beta_t, P) {
    proposal <- mvrnorm(1, beta_t, P)
    return(proposal)
  }
  # sampler
  set.seed(100)
  for (t in 2:nsim) {
    # step 1: proposal values
    betastar <- j(beta_1, V_1)
    # step 2: compute acceptance probability,
    # including correction for asymmetric proposal distribution
    r <- exp(loglik(betastar, y, X) -
	     loglik(betasim_indep[t-1,], y, X) +
	     lprior(betastar, b_0, B_0) -
	     lprior(betasim_indep[t-1,], b_0, B_0) +
	     dmvnorm(betasim_indep[t-1,], beta_1, V_1, log = TRUE) -
	     dmvnorm(betastar, beta_1, V_1, log = TRUE))
    # step 3: accept/reject
    alpha <- min(r, 1)
    if (runif(1) <= alpha) {
      betasim_indep[t, ] <- betastar
    } else {
      betasim_indep[t, ] <- betasim_indep[t-1, ]
    }
  }
  return(betasim_indep)
}

# ---- proposal density ----
beta_hat <- coef(inits)
V <- vcov(inits)
k <- length(beta_hat)

# proposal mean and variance
beta_1 <- solve(solve(B_0) + solve(V)) %*% (solve(B_0) %*% b_0 + solve(V) %*% beta_hat)
Tmat <- diag(rep(1.1, k))
V_1 <- Tmat %*% solve(solve(B_0) + solve(V)) %*% Tmat

# ---- some housekeeping: results object ----
nsim <- 10^4
betasim_indep <- matrix(beta_hat, nrow = nsim, ncol = k, byrow = TRUE)
colnames(betasim_indep) <- colnames(X)

# ---- run algorithm ----
betasim_indep_R <- indep_metropolis_hastings(y, X, betasim_indep,
					     beta_1, V_1,
					     b_0, B_0,
					     nsim)

To keep things simple, we will generate candidates from a multivariate normal distribution: \beta^* \sim N(\beta_1, V_1) .6 Jackman (2009)7 suggests to set \beta_1 to the precision-weighted average \beta_1 = (B_0^{-1} + V^{-1})^{-1}(B_0^{-1}b_0 + V^{-1} \hat\beta) and V_1 = T(B_0^{-1} + V^{-1})^{-1}T , with T = 1.1 \cdot I . Given that the samples generated by the independent jumping distribution are uncorrelated (see figure 4), we run the algorithm for only 10,000 iterations; the results are (almost) identical (table 2) to those obtained from the random-walk.

Figure 4: Autocorrelation plots of the independent MH draws

Figure 4: Autocorrelation plots of the independent MH draws

Table 2: Independent MH: Bayesian Poisson regression estimates, 95% credible intervals, standard deviations, and probability of negative/positive association
estimate 2.5% 97.5% \sigma_{\textbf{estimate}} P(\beta &lt; 0) P(\beta &gt; 0)
intercept 0.301 0.096 0.504 0.104 0.001 0.999
woman -0.224 -0.334 -0.117 0.056 1.000 0.000
married 0.156 0.037 0.280 0.062 0.006 0.994
kid -0.185 -0.264 -0.107 0.040 1.000 0.000
phd 0.013 -0.038 0.065 0.027 0.311 0.689
mentor 0.025 0.022 0.029 0.002 0.000 1.000

Concluding remarks

In this post, we attempted to shed some light on Bayesian estimation algorithms, which at first glance may seem quite arcane. More specifically, we looked at the Metropolis-Hastings algorithm, which at the end of the day is a surprisingly simple and elegant tool to obtain posterior densities. As a matter of fact, the MH algorithm consists of very few building blocks — a likelihood function, a prior density, and a proposal distribution — and works according to the following scheme:

  1. generate a candidate value.
  2. compute the ratio of the posterior (likelihood \times prior) evaluated at the candidate value and the posterior evaluated at the value of the previous iteration
  3. accept the candidate if the ratio is at least 1, otherwise reject it with a nonzero probability. Continue with step 1.

We have seen that the choice of jumping distribution makes a difference with regard to the sampling efficiency and thus the algorithm’s required run time: Independent proposal densities may for instance allow very efficient (and thus shorter) sampling from the posterior distribution; but only when fitting the target density well. An ill-fitting independent jumping distribution on the other hand might lead to a very high rejection rate and poor exploration of the posterior density. Random-walk proposal distributions alleviate this problem, at the cost of higher autocorrelations.

Last, the above R implementation of the Metropolis-Hastings algorithm takes some time to run the 100,000 respectively 10,000 iterations, even though it already uses vectorized R functions. We will look at a lower-level (and faster) implementation of the MH algorithm in C in another post.

Note: The code used for the above computations can be found at https://git.staudtlex.de/blog/metropolis-hastings-r.


  1. See Chib/Greenberg (1995) for more details on the MH algorithm. ↩︎

  2. With symmetric proposal distributions, P_J(\theta^*|\theta^{(t-1)}) = P_J(\theta^{(t-1)}|\theta^*) ,

    \begin{align*} \alpha &amp;= \text{min}\Bigg(\frac{p(y|\theta^*)p(\theta^*)}{p(y|\theta^{(t-1)})p(\theta^{(t-1)})}, 1\Bigg) \end{align*}

    This corresponds to the transition probabilities of the Metropolis algorithm. ↩︎

  3. Convergence of MCMC samplers may or may not be discussed in a future post. ↩︎

  4. The data frame bioChemists is part of Simon Jackman’s R package pscl↩︎

  5. Usually, we would additionally run stationarity-related tests such as the Geweke diagnostic, Heidelberger-Welch test for nonstationarity, Raftery-Lewis estimate, and the Gelman-Rubin statistic (when running several algorithms in parallel). ↩︎

  6. With this jumping distribution, the probability of obtaining \beta^* from N(\beta_1, V_1) ( \beta_1 ist fixed) is in general different from obtaining \beta^{(t-1)} . We correct for these unequal probabilities by including the ratio P(\beta^{(t-1)})/P(\beta^*) ↩︎

  7. Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. Wiley: Chichester. ↩︎