9  Deterministic approximations

So far, we have focused on stochastic approximations of integral. In very large models, Markov chain Monte Carlo suffer from the curse of dimensionality and it is sometimes useful to resort to cheaper approximations. We begin this review by looking at the asymptotic Gaussian limiting distribution of the maximum aposteriori, the Laplace approximations for integrals (), and their applications for model comparison () and evaluation of the marginal likelihood. We also discuss integrated nested Laplace approximations (; ), used in hierarchical models with Gaussian components to obtain approximations to the marginal distribution. This material also borrows from Section 8.2 and appendix C.2.2 of Held and Bové ().

We make use of Landau’s notation to describe the growth rate of some functions: we write x=O(n) (big-O) to indicate that the ratio x/ncR and x=o(n) when x/n0, both when n.

9.1 Laplace approximation and it’s applications

Proposition 9.1 (Laplace approximation for integrals) The Laplace approximation uses a Gaussian approximation to evaluate integrals of the form In=abg(x)dx=abexp{nh(x)}dx. Assume that g(x) and thus h(x), is concave and and twice differentiable, with a maximum at x0[a,b]. We can Taylor expand h(x) to get, h(x)=h(x0)+h(x0)(xx0)+h(x0)(xx0)2/2+R where the remainder R=O{(xx0)3}. If x0 is a maximizer and solves h(x0)=0, then letting τ=nh(x0), we can write ignoring the remainder term the approximation Inexp{nh(x0)}abexp{12(xx0)2}=exp{nh(x0)}(2πτ)1/2[Φ{τ(bx0)}Φ{τ(ax0)}] upon recovering the unnormalized kernel of a Gaussian random variable centered at x0 with precision τ. The approximation error is O(n1). This quantity reduces to exp{nh(x0)}(2πτ)1/2 when evaluated over the real line.

The multivariate analog is similar, where now for an integral of the form exp{nh(x)} over Rd, we consider the Taylor series expansion h(x)=h(x0)+(xx0)h(x0)+12(xx0)h(x0)(xx0)+R. We obtain the Laplace approximation at the mode x0 satisfying h(x0)=0d, In(2πn)d/2|H(x0)|1/2exp{nh(x0)}, where |H(x0)| is the determinant of the Hessian matrix of h(x) evaluated at the mode x0.

Laplace approximation uses a Taylor series approximation to approximate the density, but since the latter must be non-negative, it performs the approximation on the log scale and back-transform the result. It is important to understand that we can replace nh(x) by any O(n) term.

Corollary 9.1 (Laplace approximation for marginal likelihood) Consider a simple random sample Y of size n from a distribution with parameter vector θRp. We are interested in approximating the marginal likelihood. Write () p(y)=Rpp(yθ)p(θ)dθ and take nh(θ)=logp(yθ)+logp(θ) in . Then, evaluating at the maximum a posteriori θ^MAP, we get p(y)=p(θ^MAP)p(yθ^MAP)(2π)d/2|H(θ^MAP)|1/2+O(n1) where H is the Hessian matrix of second partial derivatives of the unnormalized log posterior. We get the same relationship on the log scale, whence () logp(y)=logp(θ^MAP)+logp(yθ^MAP)+p2log(2π)12log|H(θ^MAP)|+O(n1) If p(θ)=O(1) and p(yθ)=O(n) and provided the prior does not impose unnecessary support constraints, we get the same limiting approximation if we replace the maximum a posteriori point estimator θ^MAP by the maximum likelihood estimator, and H(θ^MAP) by nı, where ı denotes the Fisher information matrix for a sample of size one. We can write the determinant of the n-sample Fisher information as np|ı|.

If we use this approximation instead, we get logp(y)=logp(yθ^MLE)p2logn+logp(θ^MLE)12log|ı|+p2log(2π)+O(n1/2) where the error is now O(n1/2) due to replacing the true information by the evaluation at the MLE. The likelihood is O(n), the second is O(logn) and the other three are O(1). If we take the prior to be a multivariate Gaussian with mean θMLE and with variance ı, then the approximation error is O(n1/2), whereas the marginal likelihood has error O(1) if we only keep the first two terms. This gives the approximation 2logp(y)BIC=2logp(yθ)+plogn If the likelihood contribution dominates the posterior, the BIC approximation will improve with increasing sample size, so exp(BIC/2) is an approximation fo the marginal likelihood sometimes used for model comparison in Bayes factor, although this derivation shows that the latter neglects the impact of the prior.

Example 9.1 (Bayesian model averaging approximation) Consider the diabetes model from Park and Casella (). We fit various linear regression models, considering all best models of a certain type with at most the 10 predictors plus the intercept. In practice, we typically restrict attention to models within some distance of the lowest BIC value, as the weights otherwise will be negligible.

Figure 9.1: BIC as a function of the linear model covariates (left) and Bayesian model averaging approximate weights (in percentage) for the 10 models with the highest posterior weights according to the BIC approximation.

Most of the weight is on a handful of complex models, where the best fitting model only has around 30% of the posterior mass.

Remark 9.1 (Parametrization for Laplace). Compare to sampling-based methods, the Laplace approximation requires optimization to find the maximum of the function. The Laplace approximation is not invariant to reparametrization: in practice, it is best to perform it on a scale where the likelihood is as close to quadratic as possible in g(θ) and back-transform using a change of variable.

We can also use Laplace approximation to obtain a crude second-order approximation to the posterior. We suppose that the prior is proper.

We can Taylor expand the log prior and log density around their respective mode, say θ^0 and θ^MLE, with ȷ0(θ^0) and ȷ(θ^MLE) denoting negative of the corresponding Hessian matrices evaluated at their mode, meaning the observed information matrix for the likelihood component. Together, these yield logp(θ)logp(θ^0)12(θθ^0)ȷ0(θ^0)(θθ^0)logp(yθ)logp(yθ^MLE)12(θθ^MLE)ȷ(θ^MLE)(θθ^MLE)

In the case of flat prior, the curvature is zero and the prior contribution vanishes altogether. If we apply now to this unnormalized kernel, we get that the approximate posterior must be Gaussian with precision ȷn1 and mean θ^n, where ȷn=ȷ0(θ^0)+ȷ(θ^MLE)θ^n=ȷn1{ȷ0(θ^0)θ^0+ȷ(θ^MLE)θ^MLE} and note that ȷ0(θ^0)=O(1), whereas ȷn is O(n).

Theorem 9.1 (Bernstein-von Mises theorem) Consider any estimator asymptotically equivalent to the maximum likelihood estimator and suppose that the prior is continuous and positive in a neighborhood of the maximum. Assume further that the regularity conditions for maximum likelihood estimator holds. Then, in the limit as n θyGauss{θ^MLE,ȷ1(θ^MLE)}

The conclusions from this result is that, in large samples, the inference obtained from using likelihood-based inference and Bayesian methods will be equivalent: credible intervals will also have guaranteed frequentist coverage.

We can use the statement by replacing the maximum likelihood estimator and the observed information matrix with variants thereof (θn and ȷn, or the Fisher information, or any Monte Carlo estimate of the posterior mean and covariance). The differences will be noticeable for small samples, but will vanish as n grows.

Example 9.2 (Gaussian approximations to the posterior) To assess the performance of Laplace approximation, we consider an exponential likelihood Yiλexpo(λ) with conjugate gamma prior λgamma(a,b). The exponential model has information i(λ)=n/λ2 and the mode of the posterior is λ^MAP=n+a1i=1nyi+b.

Figure 9.2: Gaussian approximation (dashed) to the posterior density (full line) of the exponential rate λ for the waiting dataset with an exponential likelihood and a gamma prior with a=0.01 and b=0.01. The plots are based on the first 10 observations (left) and the whole sample of size n=62 (right).

Let us now use Laplace approximation to obtain an estimate of the marginal likelihood: because the model is conjugate, the true log marginal likelihood equals p(y)=Γ(n+a)Γ(a)ba(b+i=1nyi)n+a. Recall that while p(y) is a function of y, we only evaluate it at the observed data so it becomes a normalizing constant for the problem at hand.

n <- length(waiting); s <- sum(waiting)
log_marg_lik <- lgamma(n+a) - lgamma(a) + a*log(b) - (n+a) * log(b+s)
# Laplace approximation
map <- (n + a - 1)/(s + b)
logpost <- function(x){
  sum(dexp(waiting, rate = x, log = TRUE)) +
    dgamma(x, a, b, log = TRUE)
}
# Hessian evaluated at MAP
H <- -numDeriv::hessian(logpost, x = map)
log_marg_laplace <- 1/2*log(2*pi) - c(determinant(H)$modulus) + logpost(map)

For the sample of size 62, the exponential model marginal likelihood is 276.5, whereas the Laplace approximation gives 281.9.

Proposition 9.2 (Posterior expectation using Laplace method) If we are interested in computing the posterior expectation of a positive real-valued functional g(θ):RdR+, we may write EΘY(g(θ)y)=g(θ)p(yθ)p(θ)dθp(yθ)p(θ)dθ We can apply Laplace’s method to both numerator and denominator. Let θ^g and θ^MAP of the integrand of the numerator and denominator, respectively, and the negative of the Hessian matrix of the log integrands ȷg=2θθ{logg(θ)+logp(yθ)+logp(θ)},ȷ=2θθ{logp(yθ)+logp(θ)}. Putting these together EΘY(g(θ)y)=|ȷ(θ^MAP)|1/2|ȷg(θ^g)|1/2g(θ^g)p(yθ^g)p(θ^g)p(yθ^MAP)p(θ^MAP)+O(n2) While the Laplace method has an error O(n1), the leading order term of the expansion cancel out from the ratio.

Example 9.3 (Posterior mean for the exponential likelihood) Consider the posterior mean EΛY(λ) for the model of . Let s=i=1nyi. Then, λ^g=(n+a)s+b|ȷg(λ^g)|1/2=(n+aλ^g2)1/2=s+b(n+a)1/2

Simplification gives the approximation E^ΛY(Λ)exp(1)s+b(n+a)n+a+1/2(n+a1)n+a1/2 which gives 0.03457, whereas the true posterior mean is (n+a)/(s+b)=0.03457. The Laplace approximation is equal to the true value up to five significant digits.

9.2 Integrated nested Laplace approximation

In many high dimensional models, use of MCMC is prohibitively expensive and fast, yet accurate calculations are important. One class of models whose special structure is particularly amenable to deterministic approximations.

Consider a model with response y which depends on covariates x through a latent Gaussian process; typically the priors on the coefficients βRp. In applications with splines, or space time processes, the prior precision matrix for β will be sparse with a Gaussian Markov random field structure. The dimension p can be substantial (several thousands) with a comparably low-dimensional hyperparameter vector θRm. Interest typically then lies in marginal parameters p(βiy)=p(βiθ,y)p(θy)dθp(θiy)=p(θy)dθi where θi denotes the vector of hyperparameters excluding the ith element θi. The INLA method builds Laplace approximations to the integrands p(βiθ,y) and p(θy), and evaluates the integral using quadrature rules over a coarse grid of values of θ.

Write the marginal posterior p(θy) as p(β,θy)=p(βθ,y)p(θy) and perform a Laplace approximation for fixed value of θ for the term p(βθ,y), whose mode we denote by β^. This yields p~(θy)p(β^,θy)pG(β^y,θ)=p(β^,θy)|H(β^)|1/2 and the Laplace approximation has kernel pG(βy,θ)|H(β^)|1/2exp{(ββ^)H(β^)(ββ^)/2}; since it is evaluated at β^, we retrieve only the determinant of the negative Hessian of p(βθ,y), namely H(β^). Note that the latter is a function of θ.

To obtain p(θiy), we then proceed with

  1. finding the mode of p~(θy) using a Newton’s method, approximating the gradient and Hessian via finite differences.
  2. Compute the negative Hessian at the mode to get an approximation to the covariance of θ. Use an eigendecomposition to get the principal directions z.
  3. In each direction of z, consider drops in p~(θy) as we move away from the mode and define a coarse grid based on these, keeping points where the difference in p~(θy) relative to the mode is less than some numerical tolerance δ.
  4. Retrieve the marginal by numerical integration using the central composition design outline above. We can also use directly avoid the integration and use the approximation at the posterior mode of p~(θy).

To approximate p(βiy), Rue, Martino, and Chopin () proceed instead by building an approximation of it based on maximizing βiβi,θ,y to yield β^(i) whose ith element is βi, yielding p~(βiθ,y)p(β^(i),θy)p~(β^(i),iβi,θ,y), with a suitable renormalization of p~(β^(i),iβi,θ,y). Such approximations are reminiscent of profile likelihood.

While we could use the Laplace approximation pG(β^y,θ) and marginalize the latter directly, this leads to evaluation of the Laplace approximation to the density far from the mode, which is often inaccurate. One challenge is that p is often very large, so calculation of the Hessian H is costly to evaluate. Having to evaluate it repeatedly for each marginal βi for i=1,,p is prohibitive since it involves factorizations of p×p matrices.

To reduce the computational costs, Rue, Martino, and Chopin () propose to use the approximate mean to avoid optimizing and consider the conditional based on the conditional of the Gaussian approximation with mean β^ and covariance Σ=H1(β^), βiβi,θ,yGaussp1{β~(i)=β^i+Σi,i1Σi,i(βiβ^i,Mi,i1}; cf. . This only requires a rank-one update. Wood () suggest to use a Newton step to correct β~(i), starting from the conditional mean. The second step is to exploit the local dependence on β using the Markov structure to build an improvement to the Hessian. Further improvements are proposed in Rue, Martino, and Chopin (), who used a simplified Laplace approximation to correct the Gaussian approximation for location and skewness, a necessary step when the likelihood itself is not Gaussian. This leads to a Taylor series approximation to correct the log determinant of the Hessian matrix. Wood () consider a BFGS update to Mi,i1 directly, which works less well than the Taylor expansion near β^i, but improves upon when we move far from this value. Nowadays, the INLA software uses a low-rank variational correction to Laplace method, proposed in van Niekerk and Rue ().

The INLA R package provides an interface to fit models with Gaussian latent random effects. While the software is particularly popular for spatio-temporal applications using the SPDE approach, we revisit two examples in the sequel where we can exploit the Markov structure.

Example 9.4 (Stochastic volatility model with INLA) Financial returns Yt typically exhibit time-varying variability. The stochastic volatility model is a parameter-driven model that specifies Yt=exp(ht/2)Ztht=γ+ϕ(ht1γ)+σUt where UtiidGauss(0,1) and ZtiidGauss(0,1). The INLA documentation provides information about which default prior and hyperparameters are specified. We use a gamma(1,0.001) prior for the precision.

library(INLA)
# Stochastic volatility model
data(exchangerate, package = "hecbayes")
# Compute response from raw spot exchange rates at noon
y <- 100*diff(log(exchangerate$dexrate))
# 'y' is now a series of percentage of log daily differences
time <- seq_along(y)
data <- data.frame(y = y, time = time)
# Stochastic volatility model
# https://inla.r-inla-download.org/r-inla.org/doc/likelihood/stochvolgaussian.pdf
# The model uses a log link, and a (log)-gamma prior for the precision
f_stochvol <- formula(y ~ f(time, model = "ar1", param = list(prec = c(1, 0.001))))
mod_stochvol <- inla(f_stochvol, family = "stochvol", data = data)
# Obtain summary
summary <- summary(mod_stochvol)
# plot(mod_stochvol)
marg_prec <- mod_stochvol$marginals.hyperpar[[1]]
marg_phi <- mod_stochvol$marginals.hyperpar[[2]]
Figure 9.3: Marginal densities of precision and autocorrelation parameters from the Gaussian stochastic volatility model.

shows that the correlation ϕ is nearly one, leading to random walk behaviour and high persistence over time (this is also due to the frequency of observations). This strong serial dependence in the variance is in part responsible for the difficulty in fitting this model using MCMC.

We can use the marginal density approximations to obtain quantiles for summary of interest. The software also includes utilities to transform the parameters using the change of variable formula.

# Compute density, quantiles, etc. via inla.*marginal
## approximate 95% credible interval and marginal post median
INLA::inla.qmarginal(marg_phi, p = c(0.025, 0.5, 0.975))
[1] 0.9706630 0.9847944 0.9929106
# Change of variable to get variance from precision
marg_var <- INLA::inla.tmarginal(
  fun = function(x) { 1 / x }, 
  marginal = marg_prec)
INLA::inla.qmarginal(marg_var, p = c(0.025, 0.975))
[1] 0.2864908 0.7396801
# Posterior marginal mean and variance of phi
mom1 <- INLA::inla.emarginal(
    fun = function(x){x}, 
    marginal = marg_phi)
mom2 <- INLA::inla.emarginal(
    fun = function(x){x^2}, 
    marginal = marg_phi)
c(mean = mom1, sd = sqrt(mom2 - mom1^2))
      mean         sd 
0.98405272 0.00576251 

Example 9.5 (Tokyo binomial time series) We revisit , but this time fit the model with INLA. We specify the mean model without intercept and fit a logistic regression, with a second-order cyclic random walk prior for the coefficients, and the default priors for the other parameters.

data(Tokyo, package = "INLA")
# Formula (removing intercept)
formula <- y ~ f(time, model = "rw2", cyclic = TRUE) - 1
mod <- INLA::inla(
   formula = formula, 
   family = "binomial",
   Ntrials = n, 
   data = Tokyo)
Figure 9.4: Posterior probability per day of the year with posterior median and 95% credible interval for the Tokyo rainfall binomial time series.

shows posterior summaries for the β, which align with the results for the probit model.

If we wanted to obtain predictions, we need to augment the model matrix and set missing values for the response variable. These then get imputed alongside with the other parameters.