With probability , accept the proposal and set , otherwise set the value to the previous state, .
Calculations
We compute the log of the acceptance ratio, , to avoid numerical overflow, with the log posterior difference
Compare the value of (if less than zero) to , where .
What proposal?
The independence Metropolis–Hastings uses a global proposal which does not depend on the current state (typically centered at the MAP)
This may be problematic with multimodal targets.
The Gaussian random walk takes , where and is the proposal standard deviation. Random walks allow us to explore the space.
Burn in
We are guaranteed to reach stationarity with Metropolis–Hastings, but it may take a large number of iterations…
One should discard initial draws during a burn in or warmup period if the chain has not reached stationarity. Ideally, use good starting value to reduce waste.
We can also use the warmup period to adapt the variance of the proposal.
Goldilock principle and proposal variance
Mixing of the chain requires just the right variance (not too small nor too large).
Figure 1: Example of traceplot with proposal variance that is too small (top), adequate (middle) and too large (bottom).
Correlograms for Goldilock
Figure 2: Correlogram for the three Markov chains.
Tuning Markov chain Monte Carlo
Outside of starting values, the variance of the proposal has a huge impact on the asymptotic variance.
We can adapt the variance during warmup by increasing/decreasing proposal variance (if acceptance rate is too large/small).
We can check this via the acceptance rate (how many proposals are accepted).
Optimal acceptance rates
The following rules were derived for Gaussian targets under idealized situations.
In 1D, rule of thumb is an acceptance rate of is optimal, and this ratio decreases to when (Sherlock, 2013) for random walk Metropolis–Hastings.
Proposals for -variate update should have proposal variance of roughly , where is the posterior variance.
For MALA (see later), we get rather than
Block update or one parameter at a time?
As with any accept-reject, proposals become inefficient when the dimension increase.
but also leads to more autocorrelation between parameters
Solutions for strongly correlated coefficients
Reparametrize the model to decorrelate variables (orthogonal parametrization).
Block updates: draw correlated parameters together
using the chain history to learn the correlation, if necessary
Parameter transformation
Parameters may be bounded, e.g. .
We can ignore this and simply discard proposals outside of the range, by setting the log posterior at outside
We can do a transformation, e.g., if and perform a random walk on the unconstrained space: don’t forget Jacobians for !
Another alternative is to use truncated proposals (useful with more complex algorithms like MALA)
Efficient proposals: MALA
The Metropolis-adjusted Langevin algorithm (MALA) uses a Gaussian random walk proposal with mean and variance , for some mass matrix , tuning parameter .
The parameter is a learning rate. This is akin to a Newton algorithm, so beware if you are far from the mode (where the gradient is typically large)!
Higher order proposals
For a single parameter update , a Taylor series expansion of the log posterior around the current value suggests using as proposal density a Gaussian approximation with (Rue & Held, 2005)
mean and
precision
We need to be negative!
This gives local adaption relative to MALA (global variance).
Higher order and moves
For MALA and cie., we need to compute the density of the proposal also for the reverse move for the expansion starting from the proposal .
These methods are more efficient than random walk Metropolis–Hastings, but they require the gradient and the hessian (can be obtained analytically using autodiff, or numerically).
Modelling individual headlines of Upworthy example
The number of conversions nclick is binomial with sample size nimpression.
Since is large, the sample average nclick/nimpression is approximately Gaussian, so write
MALA: data set-up
data(upworthy_question, package ="hecbayes")# Select data for a single questionqdata <- upworthy_question |> dplyr::filter(question =="yes") |> dplyr::mutate(y = clicks/impressions,no = impressions)
MALA: define functions
# Create functions with the same signature (...) for the algorithmlogpost <-function(par, data, ...){ mu <- par[1]; sigma <- par[2] no <- data$no y <- data$yif(isTRUE(any(sigma <=0, mu <0, mu >1))){return(-Inf) }dnorm(x = mu, mean =0.01, sd =0.1, log =TRUE) +dexp(sigma, rate =0.7, log =TRUE) +sum(dnorm(x = y, mean = mu, sd = sigma/sqrt(no), log =TRUE))}
MALA: compute gradient of log posterior
logpost_grad <-function(par, data, ...){ no <- data$no y <- data$y mu <- par[1]; sigma <- par[2]c(sum(no*(y-mu))/sigma^2-(mu -0.01)/0.01,-length(y)/sigma +sum(no*(y-mu)^2)/sigma^3-0.7 )}
MALA: compute maximum a posteriori
# Starting values - MAPmap <-optim(par =c(mean(qdata$y), 0.5),fn =function(x){-logpost(x, data = qdata)},gr =function(x){-logpost_grad(x, data = qdata)}, hessian =TRUE,method ="BFGS")# Check convergence logpost_grad(map$par, data = qdata)
MALA: starting values and mass matrix
# Set initial parameter valuescurr <- map$par # Compute a mass matrixAmat <-solve(map$hessian)# Cholesky root - for random number generationcholA <-chol(Amat)
MALA: containers and setup
# Create containers for MCMCB <-1e4L # number of iterationswarmup <-1e3L # adaptation periodnpar <-2Lprop_sd <-rep(1, npar) # tuning parameterchains <-matrix(nrow = B, ncol = npar)damping <-0.8acceptance <- attempts <-0colnames(chains) <-names(curr) <-c("mu","sigma")# Proposal variance proportional to inverse hessian at MAPprop_var <-diag(prop_sd) %*% Amat %*%diag(prop_sd)
MALA: sample proposal with Newton step
for(i inseq_len(B + warmup)){ ind <-pmax(1, i - warmup)# Compute the proposal mean for the Newton step prop_mean <-c(curr + damping * Amat %*%logpost_grad(curr, data = qdata))# prop <- prop_sd * c(rnorm(npar) %*% cholA) + prop_mean prop <-c(mvtnorm::rmvnorm(n =1,mean = prop_mean, sigma = prop_var))# [...]
MALA: reverse step
# Compute the reverse step curr_mean <-c(prop + damping * Amat %*%logpost_grad(prop, data = qdata))# log of ratio of bivariate Gaussian densities logmh <- mvtnorm::dmvnorm(x = curr, mean = prop_mean, sigma = prop_var, log =TRUE) - mvtnorm::dmvnorm(x = prop, mean = curr_mean, sigma = prop_var, log =TRUE) +logpost(prop, data = qdata) -logpost(curr, data = qdata)
MALA: Metropolis–Hastings ratio
if(logmh >log(runif(1))){ curr <- prop acceptance <- acceptance +1L } attempts <- attempts +1L# Save current value chains[ind,] <- curr
MALA: adaptation
if(i %%100& i < warmup){# Check acceptance rate and increase/decrease variance out <- hecbayes::adaptive(attempts = attempts, # counter for number of attemptsacceptance = acceptance, sd.p = prop_sd, #current proposal standard deviationtarget =0.574) # target acceptance rate prop_sd <- out$sd # overwrite current std.dev acceptance <- out$acc # if we change std. dev, this is set to zero attempts <- out$att # idem, otherwise unchanged prop_var <-diag(prop_sd) %*% Amat %*%diag(prop_sd) }}# End of MCMC for loop
Gibbs sampling
The Gibbs sampling algorithm builds a Markov chain by iterating through a sequence of conditional distributions.
Figure 3: Sampling trajectory for a bivariate target using Gibbs sampling.
Gibbs sampler
Split the parameter vector into blocks, such that, conditional on the remaining components of the parameter vector , the conditional posterior is from a known distribution from which we can easily simulate.
Gibbs sampling update
At iteration , we can update each block in turn: note that the th block uses the partially updated state which corresponds to the current value of the parameter vector after the updates.
Notes on Gibbs sampling
Special case of Metropolis–Hastings with conditional density as proposal .
The benefit is that all proposals get accepted, !
No tuning parameter, but parametrization matters.
Automatic acceptance does not equal efficiency.
To check the validity of the Gibbs sampler, see the methods proposed in Geweke (2004).
Efficiency of Gibbs sampling
As the dimension of the parameter space increases, and as the correlation between components becomes larger, the efficiency of the Gibbs sampler degrades
Figure 4: Trace plots (top) and correlograms (bottom) for the first component of a Gibbs sampler with equicorrelated Gaussian variates with correlation (left) and with equicorrelation (right).
Gibbs sampling requires work!
You need to determine all of the relevant conditional distributions, which often relies on setting conditionally conjugate priors.
In large models with multiple layers, full conditionals may only depend on a handful of parameters (via directed acyclic graph and moral graph of the model; not covered).
Example of Gibbs sampling
Consider independent and identically distributed observations, with
The joint posterior is not available in closed form, but the independent priors for the mean and variance of the observations are conditionally conjugate.
Joint posterior for Gibbs sample
Write the posterior density as usual,
Recognizing distributions from posterior
Consider the conditional densities of each parameter in turn (up to proportionality):
Gibs sample
We can simulate in turn
Data augmentation and auxiliary variables
When the likelihood is intractable or costly to evaluate (e.g., mixtures, missing data, censoring), auxiliary variables are introduced to simplify calculations.
Consider auxiliary variables such that i.e., the marginal distribution is that of interest, but evaluation of is cheaper.
Bayesian augmentation
The data augmentation algorithm (Tanner & Wong, 1987) consists in running a Markov chain on the augmented state space , simulating in turn from the conditionals
and
For more details and examples, see Dyk & Meng (2001) and Hobert (2011).
Data augmentation: probit example
Consider independent binary responses , with where is the distribution function of the standard Gaussian distribution. The likelihood of the probit model is and this prevents easy simulation.
Probit augmentation
We can consider a data augmentation scheme where , where , where is the th row of the design matrix.
The augmented data likelihood is
Conditional distributions for probit regression
with the ordinary least square estimator.
Data augmentation with scale mixture of Gaussian
The Laplace distribution with mean and scale has density and can be expressed as a scale mixture of Gaussians, where is equivalent to and .
Joint posterior for Laplace model
With , the joint posterior for the i.i.d. sample is
Conditional distributions
The conditionals for and are, as usual, Gaussian and inverse gamma, respectively. The variances, , are conditionally independent of one another, with
Inverse transformation
With the change of variable , we have and we recognize the Wald (or inverse Gaussian) density, where with and .
Bayesian LASSO
Park & Casella (2008) use this hierarchical construction to defined the Bayesian LASSO. With a model matrix whose columns are standardized to have mean zero and unit standard deviation, we may write
Comment about Bayesian LASSO
If we set an improper prior , the resulting conditional distributions are all available and thus the model is amenable to Gibbs sampling.
The Bayesian LASSO places a Laplace penalty on the regression coefficients, with lower values of yielding more shrinkage.
Contrary to the frequentist setting, none of the posterior draws of are exactly zero.
Summary
Gibbs sampling is a special case of Metropolis–Hastings algorithm that leads to acceptance
We need to get the conditional distribution
References
Dyk, D. A. van, & Meng, X.-L. (2001). The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1), 1–50. https://doi.org/10.1198/10618600152418584
Geweke, J. (2004). Getting it right: Joint distribution tests of posterior simulators. Journal of the American Statistical Association, 99(467), 799–804. https://doi.org/10.1198/016214504000001132
Hobert, J. (2011). The data augmentation algorithm: Theory and methodology. In S. Brooks, A. Gelman, G. Jones, & X. L. Meng (Eds.), Handbook of Markov chain Monte Carlo (pp. 253–293). CRC Press. https://doi.org/10.1201/b10905-11
Rue, H., & Held, L. (2005). Gaussian Markov random fields: Theory and applications (p. 280). CRC Press.
Sherlock, C. (2013). Optimal scaling of the random walk Metropolis: General criteria for the 0.234 acceptance rule. Journal of Applied Probability, 50(1), 1–15. https://doi.org/10.1239/jap/1363784420
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540. https://doi.org/10.1080/01621459.1987.10478458
Comment about Bayesian LASSO