Bayesian modelling

Reminder: Metropolis–Hastings algorithm

Starting from an initial value $θ_{0}$ :

draw a proposal value $θ_{t}^{⋆} \sim q (θ ∣ θ_{t - 1})$ .
Compute the acceptance ratio $R = \frac{p (θ_{t}^{⋆})}{p (θ_{t - 1})} \frac{q (θ_{t - 1} ∣ θ_{t}^{⋆})}{q (θ_{t}^{⋆} ∣ θ_{t - 1})}$
With probability $min {R, 1}$ , accept the proposal and set $θ_{t} \leftarrow θ_{t}^{⋆}$ , otherwise set the value to the previous state, $θ_{t} \leftarrow θ_{t - 1}$ .

Calculations

We compute the log of the acceptance ratio, $\ln R$ , to avoid numerical overflow, with the log posterior difference $\ln {\frac{p (θ_{t}^{⋆})}{p (θ_{t - 1})}} = ℓ (θ_{t}^{⋆}) + \ln p (θ_{t}^{⋆}) - ℓ (θ_{t - 1}) - \ln p (θ_{t - 1})$

Compare the value of $\ln R$ (if less than zero) to $\log (U)$ , where $U \sim unif (0, 1)$ .

What proposal?

The independence Metropolis–Hastings uses a global proposal $q$ which does not depend on the current state (typically centered at the MAP)

This may be problematic with multimodal targets.

The Gaussian random walk takes $θ_{t}^{⋆} = θ_{t - 1} + σ_{p} Z$ , where $Z \sim Gauss (0, 1)$ and $σ_{p}$ is the proposal standard deviation. Random walks allow us to explore the space.

Burn in

We are guaranteed to reach stationarity with Metropolis–Hastings, but it may take a large number of iterations…

One should discard initial draws during a burn in or warmup period if the chain has not reached stationarity. Ideally, use good starting value to reduce waste.

We can also use the warmup period to adapt the variance of the proposal.

Goldilock principle and proposal variance

Mixing of the chain requires just the right variance (not too small nor too large).

Figure 1: Example of traceplot with proposal variance that is too small (top), adequate (middle) and too large (bottom).

Correlograms for Goldilock

Figure 2: Correlogram for the three Markov chains.

Tuning Markov chain Monte Carlo

Outside of starting values, the variance of the proposal has a huge impact on the asymptotic variance.
We can adapt the variance during warmup by increasing/decreasing proposal variance (if acceptance rate is too large/small).
We can check this via the acceptance rate (how many proposals are accepted).

Optimal acceptance rates

The following rules were derived for Gaussian targets under idealized situations.

In 1D, rule of thumb is an acceptance rate of $0.44$ is optimal, and this ratio decreases to $0.234$ when $D \geq 2$ (Sherlock, 2013) for random walk Metropolis–Hastings.
Proposals for $D$ -variate update should have proposal variance of roughly $({2.38}^{2} / d) \times Σ$ , where $Σ$ is the posterior variance.
For MALA (see later), we get $0.574$ rather than $0.234$

Block update or one parameter at a time?

As with any accept-reject, proposals become inefficient when the dimension $D$ increase.

This is the curse of dimensionality.

Updating parameters in turn

increases acceptance rate (with clever proposals),
but also leads to more autocorrelation between parameters

Solutions for strongly correlated coefficients

Reparametrize the model to decorrelate variables (orthogonal parametrization).
Block updates: draw correlated parameters together
- using the chain history to learn the correlation, if necessary

Parameter transformation

Parameters may be bounded, e.g. $θ_{i} \in [a, b]$ .

We can ignore this and simply discard proposals outside of the range, by setting the log posterior at $- \infty$ outside $[a, b]$
We can do a transformation, e.g., $\log θ_{i}$ if $θ_{i} > 0$ and perform a random walk on the unconstrained space: don’t forget Jacobians for $q (\cdot)$ !
Another alternative is to use truncated proposals (useful with more complex algorithms like MALA)

Efficient proposals: MALA

The Metropolis-adjusted Langevin algorithm (MALA) uses a Gaussian random walk proposal $θ_{t}^{⋆} \sim Gauss {μ (θ_{t - 1}), τ^{2} A},$ with mean $μ (θ_{t - 1}) = θ_{t - 1} + A η \nabla \log p (θ_{t - 1} ∣ y),$ and variance $τ^{2} A$ , for some mass matrix $A$ , tuning parameter $τ > 0$ .

The parameter $η < 1$ is a learning rate. This is akin to a Newton algorithm, so beware if you are far from the mode (where the gradient is typically large)!

Higher order proposals

For a single parameter update $θ$ , a Taylor series expansion of the log posterior around the current value suggests using as proposal density a Gaussian approximation with (Rue & Held, 2005)

mean $μ_{t - 1} = θ_{t - 1} - f^{'} (θ_{t - 1}) / f^{″} (θ_{t - 1})$ and
precision $τ^{- 2} = - f^{″} (θ_{t - 1})$

We need $f^{″} (θ_{t - 1})$ to be negative!

This gives local adaption relative to MALA (global variance).

Higher order and moves

For MALA and cie., we need to compute the density of the proposal also for the reverse move for the expansion starting from the proposal $μ (θ_{t}^{⋆})$ .

These methods are more efficient than random walk Metropolis–Hastings, but they require the gradient and the hessian (can be obtained analytically using autodiff, or numerically).

Modelling individual headlines of Upworthy example

The number of conversions nclick is binomial with sample size $n_{i} =$ nimpression.

Since $n_{i}$ is large, the sample average nclick/nimpression is approximately Gaussian, so write

$\begin{aligned} Y_{i} & \sim Gauss (μ, σ^{2} / n_{i}) \\ μ & \sim trunc . Gauss (0.01, {0.1}^{2}, 0, 1) \\ σ & \sim expo (0.7) \end{aligned}$

MALA: data set-up

data(upworthy_question, package = "hecbayes")
# Select data for a single question
qdata <- upworthy_question |>
  dplyr::filter(question == "yes") |>
  dplyr::mutate(y = clicks/impressions,
                no = impressions)

MALA: define functions

# Create functions with the same signature (...) for the algorithm
logpost <- function(par, data, ...){
  mu <- par[1]; sigma <- par[2]
  no <- data$no
  y <- data$y
  if(isTRUE(any(sigma <= 0, mu < 0, mu > 1))){
    return(-Inf)
  }
  dnorm(x = mu, mean = 0.01, sd = 0.1, log = TRUE) +
  dexp(sigma, rate = 0.7, log = TRUE) + 
  sum(dnorm(x = y, mean = mu, sd = sigma/sqrt(no), log = TRUE))
}

MALA: compute gradient of log posterior

logpost_grad <- function(par, data, ...){
   no <- data$no
  y <- data$y
  mu <- par[1]; sigma <- par[2]
  c(sum(no*(y-mu))/sigma^2 -(mu - 0.01)/0.01,
    -length(y)/sigma + sum(no*(y-mu)^2)/sigma^3 -0.7
  )
}

MALA: compute maximum a posteriori

# Starting values - MAP
map <- optim(
  par = c(mean(qdata$y), 0.5),
  fn = function(x){-logpost(x, data = qdata)},
  gr = function(x){-logpost_grad(x, data = qdata)},  
  hessian = TRUE,
  method = "BFGS")
# Check convergence 
logpost_grad(map$par, data = qdata)

MALA: starting values and mass matrix

# Set initial parameter values
curr <- map$par 
# Compute a mass matrix
Amat <- solve(map$hessian)
# Cholesky root - for random number generation
cholA <- chol(Amat)

MALA: containers and setup

# Create containers for MCMC
B <- 1e4L # number of iterations
warmup <- 1e3L # adaptation period
npar <- 2L
prop_sd <- rep(1, npar) # tuning parameter
chains <- matrix(nrow = B, ncol = npar)
damping <- 0.8
acceptance <- attempts <- 0 
colnames(chains) <- names(curr) <- c("mu","sigma")
# Proposal variance proportional to inverse hessian at MAP
prop_var <- diag(prop_sd) %*% Amat %*% diag(prop_sd)

MALA: sample proposal with Newton step

for(i in seq_len(B + warmup)){
  ind <- pmax(1, i - warmup)
  # Compute the proposal mean for the Newton step
  prop_mean <- c(curr + damping * 
     Amat %*% logpost_grad(curr, data = qdata))
  # prop <- prop_sd * c(rnorm(npar) %*% cholA) + prop_mean
  prop <- c(mvtnorm::rmvnorm(
    n = 1,
    mean = prop_mean, 
    sigma = prop_var))
#  [...]

MALA: reverse step

  # Compute the reverse step
  curr_mean <- c(prop + damping * 
     Amat %*% logpost_grad(prop, data = qdata))
  # log of ratio of bivariate Gaussian densities
  logmh <- mvtnorm::dmvnorm(
    x = curr, mean = prop_mean, 
    sigma = prop_var, 
    log = TRUE) - 
    mvtnorm::dmvnorm(
      x = prop, 
      mean = curr_mean, 
      sigma = prop_var, 
      log = TRUE) + 
  logpost(prop, data = qdata) - 
    logpost(curr, data = qdata)

MALA: Metropolis–Hastings ratio

  if(logmh > log(runif(1))){
    curr <- prop
    acceptance <- acceptance + 1L
  }
  attempts <- attempts + 1L
  # Save current value
  chains[ind,] <- curr

MALA: adaptation

  if(i %% 100 & i < warmup){
    # Check acceptance rate and increase/decrease variance
    out <- hecbayes::adaptive(
      attempts = attempts, # counter for number of attempts
      acceptance = acceptance, 
      sd.p = prop_sd, #current proposal standard deviation
      target = 0.574) # target acceptance rate
    prop_sd <- out$sd # overwrite current std.dev
    acceptance <- out$acc # if we change std. dev, this is set to zero
    attempts <- out$att # idem, otherwise unchanged
    prop_var <- diag(prop_sd) %*% Amat %*% diag(prop_sd)
  }
} # End of MCMC for loop

Gibbs sampling

The Gibbs sampling algorithm builds a Markov chain by iterating through a sequence of conditional distributions.

Figure 3: Sampling trajectory for a bivariate target using Gibbs sampling.

Gibbs sampler

Split the parameter vector $θ \in Θ \subseteq R^{p}$ into $m \leq p$ blocks, $θ^{[j]} j = 1, \dots, m$ such that, conditional on the remaining components of the parameter vector $θ^{- [j]}$ , the conditional posterior $p (θ^{[j]} ∣ θ^{- [j]}, y)$ is from a known distribution from which we can easily simulate.

Gibbs sampling update

At iteration $t$ , we can update each block in turn: note that the $k$ th block uses the partially updated state $\begin{array}{r} θ^{- [k] ⋆} = (θ_{t}^{[1]}, \dots, θ_{t}^{[k - 1]}, θ_{t - 1}^{[k + 1]}, θ_{t - 1}^{[m]}) \end{array}$ which corresponds to the current value of the parameter vector after the updates.

Notes on Gibbs sampling

Special case of Metropolis–Hastings with conditional density as proposal $q$ .
The benefit is that all proposals get accepted, $R = 1$ !
No tuning parameter, but parametrization matters.
Automatic acceptance does not equal efficiency.

To check the validity of the Gibbs sampler, see the methods proposed in Geweke (2004).

Efficiency of Gibbs sampling

As the dimension of the parameter space increases, and as the correlation between components becomes larger, the efficiency of the Gibbs sampler degrades

Figure 4: Trace plots (top) and correlograms (bottom) for the first component of a Gibbs sampler with $d = 20$ equicorrelated Gaussian variates with correlation $ρ = 0.9$ (left) and $d = 3$ with equicorrelation $ρ = 0.5$ (right).

Gibbs sampling requires work!

You need to determine all of the relevant conditional distributions, which often relies on setting conditionally conjugate priors.
In large models with multiple layers, full conditionals may only depend on a handful of parameters (via directed acyclic graph and moral graph of the model; not covered).

Example of Gibbs sampling

Consider independent and identically distributed observations, with $\begin{aligned} Y_{i} & \sim Gauss (μ, τ), i = 1, \dots, n) \\ μ & \sim Gauss (ν, ω) \\ τ & \sim inv . gamma (α, β) \end{aligned}$

The joint posterior is not available in closed form, but the independent priors for the mean and variance of the observations are conditionally conjugate.

Joint posterior for Gibbs sample

Write the posterior density as usual, $\begin{aligned} p (μ, τ ∣ y) \propto τ^{- α - 1} \exp (- β / τ) \\ \times τ^{- n / 2} \exp {- \frac{1}{2 τ} (\sum_{i = 1}^{n} y_{i}^{2} - 2 μ \sum_{i = 1}^{n} y_{i} + n μ^{2})} \\ \times \exp {- \frac{(μ - ν)^{2}}{2 ω}} \end{aligned}$

Recognizing distributions from posterior

Consider the conditional densities of each parameter in turn (up to proportionality): $\begin{aligned} p (μ ∣ τ, y) & \propto \exp {- \frac{1}{2} (\frac{μ^{2} - 2 μ \overset{―}{y}}{τ / n} + \frac{μ^{2} - 2 ν μ}{ω})} \\ p (τ ∣ μ, y) & \propto τ^{- n / 2 - α - 1} \exp [- \frac{{\frac{\sum_{i = 1}^{n} (y_{i} - μ)^{2}}{2} + β}}{τ}] \end{aligned}$

Gibs sample

We can simulate in turn $\begin{aligned} μ_{t} ∣ τ_{t - 1}, y & \sim Gauss (\frac{n \overset{―}{y} ω + τ ν}{τ + n ω}, \frac{ω τ}{τ + n ω}) \\ τ_{t} ∣ μ_{t}, y & \sim inv . gamma {\frac{n}{2} + α, \frac{\sum_{i = 1}^{n} (y_{i} - μ)^{2}}{2} + β} . \end{aligned}$

Data augmentation and auxiliary variables

When the likelihood $p (y; θ)$ is intractable or costly to evaluate (e.g., mixtures, missing data, censoring), auxiliary variables are introduced to simplify calculations.

Consider auxiliary variables $U \in R^{k}$ such that $\int_{R^{k}} p (U, θ ∣ y) d U = p (θ ∣ y),$ i.e., the marginal distribution is that of interest, but evaluation of $p (U, θ; y)$ is cheaper.

Bayesian augmentation

The data augmentation algorithm (Tanner & Wong, 1987) consists in running a Markov chain on the augmented state space $(Θ, R^{k})$ , simulating in turn from the conditionals

$p (U ∣ θ, y)$ and
$p (θ ∣ U, y)$

For more details and examples, see Dyk & Meng (2001) and Hobert (2011).

Data augmentation: probit example

Consider independent binary responses $Y_{i}$ , with $\begin{array}{r} p_{i} = Pr (Y_{i} = 1) = Φ (β_{0} + β_{1} X_{i 1} + \dots + β_{p} X_{i p}), \end{array}$ where $Φ$ is the distribution function of the standard Gaussian distribution. The likelihood of the probit model is $L (β; y) = \prod_{i = 1}^{n} p_{i}^{y_{i}} (1 - p_{i})^{1 - y_{i}},$ and this prevents easy simulation.

Probit augmentation

We can consider a data augmentation scheme where $Y_{i} = I (Z_{i} > 0)$ , where $Z_{i} \sim Gauss (x_{i} β, 1)$ , where $x_{i}$ is the $i$ th row of the design matrix.

The augmented data likelihood is $\begin{aligned} p (z, y ∣ β) & \propto \exp {- \frac{1}{2} (z - X β)^{⊤} (z - X β)} \\ \times \prod_{i = 1}^{n} I (z_{i} > 0)^{y_{i}} I (z_{i} \leq 0)^{1 - y_{i}} \end{aligned}$

Conditional distributions for probit regression

$\begin{aligned} β ∣ z, y & \sim Gauss {\hat{β}, (X^{⊤} X)^{- 1}} \\ Z_{i} ∣ y_{i}, β & \sim {\begin{cases} trunc . Gauss (x_{i} β, - \infty, 0) & y_{i} = 0 \\ trunc . Gauss (x_{i} β, 0, \infty) & y_{i} = 1. \end{cases} \end{aligned}$ with $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} z$ the ordinary least square estimator.

Data augmentation with scale mixture of Gaussian

The Laplace distribution with mean $μ$ and scale $σ$ has density $\begin{array}{r} f (x; μ, σ) = \frac{1}{2 σ} \exp (- \frac{| x - μ |}{σ}), \end{array}$ and can be expressed as a scale mixture of Gaussians, where $Y ∣ τ \sim Laplace (μ, τ)$ is equivalent to $Z ∣ τ \sim Gauss (μ, τ)$ and $τ \sim expo {(2 σ)^{- 1}}$ .

Joint posterior for Laplace model

With $p (μ, σ) \propto σ^{- 1}$ , the joint posterior for the i.i.d. sample is $\begin{aligned} p (τ, μ, σ ∣ y) & \propto {(\prod_{i = 1}^{n} τ_{i})}^{- 1 / 2} \exp {- \frac{1}{2} \sum_{i = 1}^{n} \frac{(y_{i} - μ)^{2}}{τ_{i}}} \\ \times \frac{1}{σ^{n + 1}} \exp (- \frac{1}{2 σ} \sum_{i = 1}^{n} τ_{i}) \end{aligned}$

Conditional distributions

The conditionals for $μ ∣ \dots$ and $σ ∣ \dots$ are, as usual, Gaussian and inverse gamma, respectively. The variances, $τ_{j}$ , are conditionally independent of one another, with $\begin{aligned} p (τ_{j} ∣ μ, σ, y_{j}) & \propto τ_{j}^{- 1 / 2} \exp {- \frac{1}{2} \frac{(y_{j} - μ)^{2}}{τ_{j}} - \frac{1}{2} \frac{τ_{j}}{σ}} \end{aligned}$

Inverse transformation

With the change of variable $ξ_{j} = 1 / τ_{j}$ , we have $\begin{aligned} p (ξ_{j} ∣ μ, σ, y_{j}) & \propto ξ_{j}^{- 3 / 2} \exp {- \frac{1}{2 σ} \frac{ξ_{j} (y_{j} - μ)^{2}}{σ} - \frac{1}{2} \frac{1}{ξ_{j}}} \end{aligned}$ and we recognize the Wald (or inverse Gaussian) density, where $ξ_{i} \sim Wald (ν_{i}, λ)$ with $ν_{i} = {σ / (y_{i} - μ)^{2}}^{1 / 2}$ and $λ = σ^{- 1}$ .

Bayesian LASSO

Park & Casella (2008) use this hierarchical construction to defined the Bayesian LASSO. With a model matrix $X$ whose columns are standardized to have mean zero and unit standard deviation, we may write $\begin{aligned} Y ∣ μ, β, σ^{2} & \sim {Gauss}_{n} (μ 1_{n} + X β, σ I_{n}) \\ β_{j} ∣ σ, τ & \sim Gauss (0, σ τ) \\ τ & \sim expo (λ / 2) \end{aligned}$

Comment about Bayesian LASSO

If we set an improper prior $p (μ, σ) \propto σ^{- 1}$ , the resulting conditional distributions are all available and thus the model is amenable to Gibbs sampling.
The Bayesian LASSO places a Laplace penalty on the regression coefficients, with lower values of $λ$ yielding more shrinkage.
Contrary to the frequentist setting, none of the posterior draws of $β$ are exactly zero.

Summary

Gibbs sampling is a special case of Metropolis–Hastings algorithm that leads to acceptance
We need to get the conditional distribution

References

Dyk, D. A. van, & Meng, X.-L. (2001). The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1), 1–50. https://doi.org/10.1198/10618600152418584

Geweke, J. (2004). Getting it right: Joint distribution tests of posterior simulators. Journal of the American Statistical Association, 99(467), 799–804. https://doi.org/10.1198/016214504000001132

Hobert, J. (2011). The data augmentation algorithm: Theory and methodology. In S. Brooks, A. Gelman, G. Jones, & X. L. Meng (Eds.), Handbook of Markov chain Monte Carlo (pp. 253–293). CRC Press. https://doi.org/10.1201/b10905-11

Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681–686. https://doi.org/10.1198/016214508000000337

Rue, H., & Held, L. (2005). Gaussian Markov random fields: Theory and applications (p. 280). CRC Press.

Sherlock, C. (2013). Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. Journal of Applied Probability, 50(1), 1–15. https://doi.org/10.1239/jap/1363784420

Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540. https://doi.org/10.1080/01621459.1987.10478458