9 Deterministic approximations

So far, we have focused on stochastic approximations of integral. In very large models, Markov chain Monte Carlo suffer from the curse of dimensionality and it is sometimes useful to resort to cheaper approximations. We begin this review by looking at the asymptotic Gaussian limiting distribution of the maximum aposteriori, the Laplace approximations for integrals (Tierney and Kadane 1986), and their applications for model comparison (Raftery 1995) and evaluation of the marginal likelihood. We also discuss integrated nested Laplace approximations (Rue, Martino, and Chopin 2009; Wood 2019), used in hierarchical models with Gaussian components to obtain approximations to the marginal distribution. This material also borrows from Section 8.2 and appendix C.2.2 of Held and Bové (2020).

We make use of Landau’s notation to describe the growth rate of some functions: we write $x = O (n)$ (big-O) to indicate that the ratio $x / n \to c \in R$ and $x = o (n)$ when $x / n \to 0,$ both when $n \to \infty .$

9.1 Laplace approximation and it’s applications

Proposition 9.1 (Laplace approximation for integrals) The Laplace approximation uses a Gaussian approximation to evaluate integrals of the form $\begin{array}{r} I_{n} = \int_{a}^{b} g (x) d x = \int_{a}^{b} \exp {n h (x)} d x . \end{array}$ Assume that $g (x)$ and thus $h (x),$ is concave and and twice differentiable, with a maximum at $x_{0} \in [a, b] .$ We can Taylor expand $h (x)$ to get, $\begin{array}{r} h (x) = h (x_{0}) + h^{'} (x_{0}) (x - x_{0}) + h^{″} (x_{0}) (x - x_{0})^{2} / 2 + R \end{array}$ where the remainder $R = O {(x - x_{0})^{3}} .$ If $x_{0}$ is a maximizer and solves $h^{'} (x_{0}) = 0,$ then letting $τ = - n h^{″} (x_{0}),$ we can write ignoring the remainder term the approximation $\begin{aligned} I_{n} & \approx \exp {n h (x_{0})} \int_{a}^{b} \exp {- \frac{1}{2} (x - x_{0})^{2}} \\ = \exp {n h (x_{0})} {(\frac{2 π}{τ})}^{1 / 2} [Φ {τ (b - x_{0})} - Φ {τ (a - x_{0})}] \end{aligned}$ upon recovering the unnormalized kernel of a Gaussian random variable centered at $x_{0}$ with precision $τ .$ The approximation error is $O (n^{- 1}) .$ This quantity reduces to $\exp {n h (x_{0})} {(\frac{2 π}{τ})}^{1 / 2}$ when evaluated over the real line.

The multivariate analog is similar, where now for an integral of the form $\exp {n h (x)}$ over $R^{d},$ we consider the Taylor series expansion $\begin{aligned} h (x) & = h (x_{0}) + (x - x_{0})^{⊤} h^{'} (x_{0}) + \frac{1}{2} (x - x_{0})^{⊤} h^{″} (x_{0}) (x - x_{0}) + R . \end{aligned}$ We obtain the Laplace approximation at the mode $x_{0}$ satisfying $h^{'} (x_{0}) = 0_{d},$ $\begin{array}{r} I_{n} \approx {(\frac{2 π}{n})}^{d / 2} | H (x_{0}) |^{- 1 / 2} \exp {n h (x_{0})}, \end{array}$ where $| H (x_{0}) |$ is the determinant of the Hessian matrix of $- h (x)$ evaluated at the mode $x_{0} .$

Laplace approximation uses a Taylor series approximation to approximate the density, but since the latter must be non-negative, it performs the approximation on the log scale and back-transform the result. It is important to understand that we can replace $n h (x)$ by any $O (n)$ term.

Corollary 9.1 (Laplace approximation for marginal likelihood) Consider a simple random sample $Y$ of size $n$ from a distribution with parameter vector $θ \in R^{p} .$ We are interested in approximating the marginal likelihood. Write (Raftery 1995) $\begin{array}{r} p (y) = \int_{R^{p}} p (y ∣ θ) p (θ) d θ \end{array}$ and take $\begin{array}{r} n h (θ) = \log p (y ∣ θ) + \log p (θ) \end{array}$ in Proposition 9.1. Then, evaluating at the maximum a posteriori ${\hat{θ}}_{MAP},$ we get $\begin{array}{r} p (y) = p ({\hat{θ}}_{MAP}) p (y ∣ {\hat{θ}}_{MAP}) (2 π)^{d / 2} | H ({\hat{θ}}_{MAP}) |^{- 1 / 2} + O (n^{- 1}) \end{array}$ where $- H$ is the Hessian matrix of second partial derivatives of the unnormalized log posterior. We get the same relationship on the log scale, whence (Tierney and Kadane 1986) $\begin{array}{r} \log p (y) = \log p ({\hat{θ}}_{MAP}) + \log p (y ∣ {\hat{θ}}_{MAP}) + \frac{p}{2} \log (2 π) - \frac{1}{2} \log | H ({\hat{θ}}_{MAP}) | + O (n^{- 1}) \end{array}$ If $p (θ) = O (1)$ and $p (y ∣ θ) = O (n)$ and provided the prior does not impose unnecessary support constraints, we get the same limiting approximation if we replace the maximum a posteriori point estimator ${\hat{θ}}_{MAP}$ by the maximum likelihood estimator, and $- H ({\hat{θ}}_{MAP})$ by $n ı,$ where $ı$ denotes the Fisher information matrix for a sample of size one. We can write the determinant of the $n$ -sample Fisher information as $n^{p} | ı | .$

If we use this approximation instead, we get $\begin{aligned} \log p (y) & = \log p (y ∣ {\hat{θ}}_{MLE}) - \frac{p}{2} \log n + \\ \log p ({\hat{θ}}_{MLE}) - \frac{1}{2} \log | ı | + \frac{p}{2} \log (2 π) + O (n^{- 1 / 2}) \end{aligned}$ where the error is now $O (n^{- 1 / 2})$ due to replacing the true information by the evaluation at the MLE. The likelihood is $O (n),$ the second is $O (\log n)$ and the other three are $O (1) .$ If we take the prior to be a multivariate Gaussian with mean $θ_{MLE}$ and with variance $ı,$ then the approximation error is $O (n^{- 1 / 2}),$ whereas the marginal likelihood has error $O (1)$ if we only keep the first two terms. This gives the approximation $\begin{array}{r} - 2 \log p (y) \approx BIC = - 2 \log p (y ∣ θ) + p \log n \end{array}$ If the likelihood contribution dominates the posterior, the $BIC$ approximation will improve with increasing sample size, so $\exp (- BIC / 2)$ is an approximation fo the marginal likelihood sometimes used for model comparison in Bayes factor, although this derivation shows that the latter neglects the impact of the prior.

Example 9.1 (Bayesian model averaging approximation) Consider the diabetes model from Park and Casella (2008). We fit various linear regression models, considering all best models of a certain type with at most the 10 predictors plus the intercept. In practice, we typically restrict attention to models within some distance of the lowest BIC value, as the weights otherwise will be negligible.

Figure 9.1: BIC as a function of the linear model covariates (left) and Bayesian model averaging approximate weights (in percentage) for the 10 models with the highest posterior weights according to the BIC approximation.

Most of the weight is on a handful of complex models, where the best fitting model only has around $30$ % of the posterior mass.

Remark 9.1 (Parametrization for Laplace). Compare to sampling-based methods, the Laplace approximation requires optimization to find the maximum of the function. The Laplace approximation is not invariant to reparametrization: in practice, it is best to perform it on a scale where the likelihood is as close to quadratic as possible in $g (θ)$ and back-transform using a change of variable.

We can also use Laplace approximation to obtain a crude second-order approximation to the posterior. We suppose that the prior is proper.

We can Taylor expand the log prior and log density around their respective mode, say ${\hat{θ}}_{0}$ and ${\hat{θ}}_{MLE},$ with $ȷ_{0} ({\hat{θ}}_{0})$ and $ȷ ({\hat{θ}}_{MLE})$ denoting negative of the corresponding Hessian matrices evaluated at their mode, meaning the observed information matrix for the likelihood component. Together, these yield $\begin{aligned} \log p (θ) & \approx \log p ({\hat{θ}}_{0}) - \frac{1}{2} (θ - {\hat{θ}}_{0})^{⊤} ȷ_{0} ({\hat{θ}}_{0}) (θ - {\hat{θ}}_{0}) \\ \log p (y ∣ θ) & \approx \log p (y ∣ {\hat{θ}}_{MLE}) - \frac{1}{2} (θ - {\hat{θ}}_{MLE})^{⊤} ȷ ({\hat{θ}}_{MLE}) (θ - {\hat{θ}}_{MLE}) \end{aligned}$

In the case of flat prior, the curvature is zero and the prior contribution vanishes altogether. If we apply now Proposition 8.1 to this unnormalized kernel, we get that the approximate posterior must be Gaussian with precision $ȷ_{n}^{- 1}$ and mean ${\hat{θ}}_{n},$ where $\begin{aligned} ȷ_{n} & = ȷ_{0} ({\hat{θ}}_{0}) + ȷ ({\hat{θ}}_{MLE}) \\ {\hat{θ}}_{n} & = ȷ_{n}^{- 1} {ȷ_{0} ({\hat{θ}}_{0}) {\hat{θ}}_{0} + ȷ ({\hat{θ}}_{MLE}) {\hat{θ}}_{MLE}} \end{aligned}$ and note that $ȷ_{0} ({\hat{θ}}_{0}) = O (1),$ whereas $ȷ_{n}$ is $O (n) .$

Theorem 9.1 (Bernstein-von Mises theorem) Consider any estimator asymptotically equivalent to the maximum likelihood estimator and suppose that the prior is continuous and positive in a neighborhood of the maximum. Assume further that the regularity conditions for maximum likelihood estimator holds. Then, in the limit as $n \to \infty$ $\begin{array}{r} θ ∣ y \overset{\cdot}{\sim} Gauss {{\hat{θ}}_{MLE}, ȷ^{- 1} ({\hat{θ}}_{MLE})} \end{array}$

The conclusions from this result is that, in large samples, the inference obtained from using likelihood-based inference and Bayesian methods will be equivalent: credible intervals will also have guaranteed frequentist coverage.

We can use the statement by replacing the maximum likelihood estimator and the observed information matrix with variants thereof ( $θ_{n}$ and $ȷ_{n},$ or the Fisher information, or any Monte Carlo estimate of the posterior mean and covariance). The differences will be noticeable for small samples, but will vanish as $n$ grows.

Example 9.2 (Gaussian approximations to the posterior) To assess the performance of Laplace approximation, we consider an exponential likelihood $Y_{i} ∣ λ \sim expo (λ)$ with conjugate gamma prior $λ \sim gamma (a, b)$ . The exponential model has information $i (λ) = n / λ^{2}$ and the mode of the posterior is ${\hat{λ}}_{MAP} = \frac{n + a - 1}{\sum_{i = 1}^{n} y_{i} + b} .$

Figure 9.2: Gaussian approximation (dashed) to the posterior density (full line) of the exponential rate $λ$ for the `waiting` dataset with an exponential likelihood and a gamma prior with $a = 0.01$ and $b = 0.01 .$ The plots are based on the first $10$ observations (left) and the whole sample of size $n = 62$ (right).

Let us now use Laplace approximation to obtain an estimate of the marginal likelihood: because the model is conjugate, the true log marginal likelihood equals $\begin{array}{r} p (y) = \frac{Γ (n + a)}{Γ (a)} \frac{b^{a}}{{(b + \sum_{i = 1}^{n} y_{i})}^{n + a}} . \end{array}$ Recall that while $p (y)$ is a function of $y$ , we only evaluate it at the observed data so it becomes a normalizing constant for the problem at hand.

n <- length(waiting); s <- sum(waiting)
log_marg_lik <- lgamma(n+a) - lgamma(a) + a*log(b) - (n+a) * log(b+s)
# Laplace approximation
map <- (n + a - 1)/(s + b)
logpost <- function(x){
  sum(dexp(waiting, rate = x, log = TRUE)) +
    dgamma(x, a, b, log = TRUE)
}
# Hessian evaluated at MAP
H <- -numDeriv::hessian(logpost, x = map)
log_marg_laplace <- 1/2*log(2*pi) - c(determinant(H)$modulus) + logpost(map)

For the sample of size $62,$ the exponential model marginal likelihood is $- 276.5,$ whereas the Laplace approximation gives $- 281.9 .$

Proposition 9.2 (Posterior expectation using Laplace method) If we are interested in computing the posterior expectation of a positive real-valued functional $g (θ) : R^{d} \to R_{+},$ we may write $\begin{aligned} E_{Θ ∣ Y} (g (θ) ∣ y) & = \frac{\int g (θ) p (y ∣ θ) p (θ) d θ}{\int p (y ∣ θ) p (θ) d θ} \end{aligned}$ We can apply Laplace’s method to both numerator and denominator. Let ${\hat{θ}}_{g}$ and ${\hat{θ}}_{MAP}$ of the integrand of the numerator and denominator, respectively, and the negative of the Hessian matrix of the log integrands $\begin{aligned} ȷ_{g} & = - \frac{\partial^{2}}{\partial θ \partial θ^{⊤}} {\log g (θ) + \log p (y ∣ θ) + \log p (θ)}, \\ ȷ & = - \frac{\partial^{2}}{\partial θ \partial θ^{⊤}} {\log p (y ∣ θ) + \log p (θ)} . \end{aligned}$ Putting these together $\begin{array}{r} E_{Θ ∣ Y} (g (θ) ∣ y) = \frac{| ȷ ({\hat{θ}}_{MAP}) |^{1 / 2}}{| ȷ_{g} ({\hat{θ}}_{g}) |^{1 / 2}} \frac{g ({\hat{θ}}_{g}) p (y ∣ {\hat{θ}}_{g}) p ({\hat{θ}}_{g})}{p (y ∣ {\hat{θ}}_{MAP}) p ({\hat{θ}}_{MAP})} + O (n^{- 2}) \end{array}$ While the Laplace method has an error $O (n^{- 1}),$ the leading order term of the expansion cancel out from the ratio.

Example 9.3 (Posterior mean for the exponential likelihood) Consider the posterior mean $E_{Λ ∣ Y} (λ)$ for the model of Example 9.2. Let $s = \sum_{i = 1}^{n} y_{i}$ . Then, $\begin{aligned} {\hat{λ}}_{g} & = \frac{(n + a)}{s + b} \\ | ȷ_{g} ({\hat{λ}}_{g}) |^{1 / 2} & = {(\frac{n + a}{{\hat{λ}}_{g}^{2}})}^{1 / 2} = \frac{s + b}{(n + a)^{1 / 2}} \end{aligned}$

Simplification gives the approximation $\begin{array}{r} {\hat{E}}_{Λ ∣ Y} (Λ) \approx \frac{\exp (- 1)}{s + b} \frac{(n + a)^{n + a + 1 / 2}}{(n + a - 1)^{n + a - 1 / 2}} \end{array}$ which gives $0.03457,$ whereas the true posterior mean is $(n + a) / (s + b) = 0.03457 .$ The Laplace approximation is equal to the true value up to five significant digits.

9.2 Integrated nested Laplace approximation

In many high dimensional models, use of MCMC is prohibitively expensive and fast, yet accurate calculations are important. One class of models whose special structure is particularly amenable to deterministic approximations.

Consider a model with response $y$ which depends on covariates $x$ through a latent Gaussian process; typically the priors on the coefficients $β \in R^{p} .$ In applications with splines, or space time processes, the prior precision matrix for $β$ will be sparse with a Gaussian Markov random field structure. The dimension $p$ can be substantial (several thousands) with a comparably low-dimensional hyperparameter vector $θ \in R^{m} .$ Interest typically then lies in marginal parameters $\begin{aligned} p (β_{i} ∣ y) & = \int p (β_{i} ∣ θ, y) p (θ ∣ y) d θ \\ p (θ_{i} ∣ y) & = \int p (θ ∣ y) d θ_{- i} \end{aligned}$ where $θ_{- i}$ denotes the vector of hyperparameters excluding the $i$ th element $θ_{i} .$ The INLA method builds Laplace approximations to the integrands $p (β_{i} ∣ θ, y)$ and $p (θ ∣ y),$ and evaluates the integral using quadrature rules over a coarse grid of values of $θ .$

Write the marginal posterior $p (θ ∣ y)$ as $p (β, θ ∣ y) = p (β ∣ θ, y) p (θ ∣ y)$ and perform a Laplace approximation for fixed value of $θ$ for the term $p (β ∣ θ, y),$ whose mode we denote by $\hat{β} .$ This yields $\begin{array}{r} \tilde{p} (θ ∣ y) \propto \frac{p (\hat{β}, θ ∣ y)}{p_{G} (\hat{β} ∣ y, θ)} = \frac{p (\hat{β}, θ ∣ y)}{| H (\hat{β}) |^{1 / 2}} \end{array}$ and the Laplace approximation has kernel $p_{G} (β ∣ y, θ) \propto | H (\hat{β}) |^{1 / 2} \exp {- (β - \hat{β})^{⊤} H (\hat{β}) (β - \hat{β}) / 2};$ since it is evaluated at $\hat{β},$ we retrieve only the determinant of the negative Hessian of $p (β ∣ θ, y),$ namely $H (\hat{β}) .$ Note that the latter is a function of $θ .$

To obtain $p (θ_{i} ∣ y)$ , we then proceed with

finding the mode of $\tilde{p} (θ ∣ y)$ using a Newton’s method, approximating the gradient and Hessian via finite differences.
Compute the negative Hessian at the mode to get an approximation to the covariance of $θ .$ Use an eigendecomposition to get the principal directions $z$ .
In each direction of $z$ , consider drops in $\tilde{p} (θ ∣ y)$ as we move away from the mode and define a coarse grid based on these, keeping points where the difference in $\tilde{p} (θ ∣ y)$ relative to the mode is less than some numerical tolerance $δ .$
Retrieve the marginal by numerical integration using the central composition design outline above. We can also use directly avoid the integration and use the approximation at the posterior mode of $\tilde{p} (θ ∣ y) .$

To approximate $p (β_{i} ∣ y)$ , Rue, Martino, and Chopin (2009) proceed instead by building an approximation of it based on maximizing $β_{- i} ∣ β_{i}, θ, y$ to yield ${\hat{β}}_{(i)}$ whose $i$ th element is $β_{i},$ yielding $\begin{array}{r} \tilde{p} (β_{i} ∣ θ, y) \propto \frac{p ({\hat{β}}_{(i)}, θ ∣ y)}{\tilde{p} ({\hat{β}}_{(i), - i} ∣ β_{i}, θ, y)}, \end{array}$ with a suitable renormalization of $\tilde{p} ({\hat{β}}_{(i), - i} ∣ β_{i}, θ, y) .$ Such approximations are reminiscent of profile likelihood.

While we could use the Laplace approximation $p_{G} (\hat{β} ∣ y, θ)$ and marginalize the latter directly, this leads to evaluation of the Laplace approximation to the density far from the mode, which is often inaccurate. One challenge is that $p$ is often very large, so calculation of the Hessian $H$ is costly to evaluate. Having to evaluate it repeatedly for each marginal $β_{i}$ for $i = 1, \dots, p$ is prohibitive since it involves factorizations of $p \times p$ matrices.

To reduce the computational costs, Rue, Martino, and Chopin (2009) propose to use the approximate mean to avoid optimizing and consider the conditional based on the conditional of the Gaussian approximation with mean $\hat{β}$ and covariance $Σ = H^{- 1} (\hat{β}),$ $\begin{array}{r} β_{- i} ∣ β_{i}, θ, y \approx {Gauss}_{p - 1} {{\tilde{β}}_{(i)} = {\hat{β}}_{- i} + Σ_{i, i}^{- 1} Σ_{i, - i} (β_{i} - {\hat{β}}_{i}, M_{- i, - i}^{- 1}}; \end{array}$ cf. Proposition 1.6. This only requires a rank-one update. Wood (2019) suggest to use a Newton step to correct ${\tilde{β}}_{(i)},$ starting from the conditional mean. The second step is to exploit the local dependence on $β$ using the Markov structure to build an improvement to the Hessian. Further improvements are proposed in Rue, Martino, and Chopin (2009), who used a simplified Laplace approximation to correct the Gaussian approximation for location and skewness, a necessary step when the likelihood itself is not Gaussian. This leads to a Taylor series approximation to correct the log determinant of the Hessian matrix. Wood (2019) consider a BFGS update to $M_{- i, - i}^{- 1}$ directly, which works less well than the Taylor expansion near ${\hat{β}}_{i}$ , but improves upon when we move far from this value. Nowadays, the INLA software uses a low-rank variational correction to Laplace method, proposed in van Niekerk and Rue (2024).

The INLA R package provides an interface to fit models with Gaussian latent random effects. While the software is particularly popular for spatio-temporal applications using the SPDE approach, we revisit two examples in the sequel where we can exploit the Markov structure.

Example 9.4 (Stochastic volatility model with INLA) Financial returns $Y_{t}$ typically exhibit time-varying variability. The stochastic volatility model is a parameter-driven model that specifies $\begin{aligned} Y_{t} & = \exp (h_{t} / 2) Z_{t} \\ h_{t} & = γ + ϕ (h_{t - 1} - γ) + σ U_{t} \end{aligned}$ where $U_{t} \overset{iid}{\sim} Gauss (0, 1)$ and $Z_{t} \sim \overset{iid}{\sim} Gauss (0, 1) .$ The INLA documentation provides information about which default prior and hyperparameters are specified. We use a $gamma (1, 0.001)$ prior for the precision.

library(INLA)
# Stochastic volatility model
data(exchangerate, package = "hecbayes")
# Compute response from raw spot exchange rates at noon
y <- 100*diff(log(exchangerate$dexrate))
# 'y' is now a series of percentage of log daily differences
time <- seq_along(y)
data <- data.frame(y = y, time = time)
# Stochastic volatility model
# https://inla.r-inla-download.org/r-inla.org/doc/likelihood/stochvolgaussian.pdf
# The model uses a log link, and a (log)-gamma prior for the precision
f_stochvol <- formula(y ~ f(time, model = "ar1", param = list(prec = c(1, 0.001))))
mod_stochvol <- inla(f_stochvol, family = "stochvol", data = data)
# Obtain summary
summary <- summary(mod_stochvol)
# plot(mod_stochvol)
marg_prec <- mod_stochvol$marginals.hyperpar[[1]]
marg_phi <- mod_stochvol$marginals.hyperpar[[2]]

Figure 9.3: Marginal densities of precision and autocorrelation parameters from the Gaussian stochastic volatility model.

Figure 9.3 shows that the correlation $ϕ$ is nearly one, leading to random walk behaviour and high persistence over time (this is also due to the frequency of observations). This strong serial dependence in the variance is in part responsible for the difficulty in fitting this model using MCMC.

We can use the marginal density approximations to obtain quantiles for summary of interest. The software also includes utilities to transform the parameters using the change of variable formula.

# Compute density, quantiles, etc. via inla.*marginal
## approximate 95% credible interval and marginal post median
INLA::inla.qmarginal(marg_phi, p = c(0.025, 0.5, 0.975))

[1] 0.9706630 0.9847944 0.9929106

# Change of variable to get variance from precision
marg_var <- INLA::inla.tmarginal(
  fun = function(x) { 1 / x }, 
  marginal = marg_prec)
INLA::inla.qmarginal(marg_var, p = c(0.025, 0.975))

[1] 0.2864908 0.7396801

# Posterior marginal mean and variance of phi
mom1 <- INLA::inla.emarginal(
    fun = function(x){x}, 
    marginal = marg_phi)
mom2 <- INLA::inla.emarginal(
    fun = function(x){x^2}, 
    marginal = marg_phi)
c(mean = mom1, sd = sqrt(mom2 - mom1^2))

      mean         sd 
0.98405272 0.00576251

Example 9.5 (Tokyo binomial time series) We revisit Example 7.6, but this time fit the model with INLA. We specify the mean model without intercept and fit a logistic regression, with a second-order cyclic random walk prior for the coefficients, and the default priors for the other parameters.

data(Tokyo, package = "INLA")
# Formula (removing intercept)
formula <- y ~ f(time, model = "rw2", cyclic = TRUE) - 1
mod <- INLA::inla(
   formula = formula, 
   family = "binomial",
   Ntrials = n, 
   data = Tokyo)

Figure 9.4: Posterior probability per day of the year with posterior median and 95% credible interval for the Tokyo rainfall binomial time series.

Figure 9.4 shows posterior summaries for the $β,$ which align with the results for the probit model.

If we wanted to obtain predictions, we need to augment the model matrix and set missing values for the response variable. These then get imputed alongside with the other parameters.