Bayesian modelling
Variational inference
Last compiled Tuesday Apr 15, 2025
Variational inference
Laplace approximation provides a heuristic for large-sample approximations, but it fails to characterize well .
We consider rather a setting where we approximate by another distribution which we wish to be close.
The terminology variational is synonym for optimization in this context.
Kullback–Leibler divergence
The Kullback–Leibler divergence between densities and is The does not depend on
Model misspecification
- The divergence is strictly positive unless
- The divergence is not symmetric.
The Kullback–Leibler divergence notion is central to study of model misspecification.
- if we fit when data arise from the maximum likelihood estimator of the parameters will be the value of the parameter that minimizes the Kullback–Leibler divergence .
Marginal likelihood
Consider now the problem of approximating the marginal likelihood, sometimes called the evidence, where we only have the joint is the product of the likelihood times the prior.
Approximating the marginal likelihood
Consider with an approximating density function
- whose integral is one over (normalized density)
- whose support is part of that of (so KL divergence is not infinite)
Objective: minimize the Kullback–Leibler divergence
Problems ahead
Minimizing the Kullback–Leibler divergence is not feasible to evaluate the posterior.
Taking is not feasible: we need the marginal likelihood to compute the expectation!
Alternative expression for the marginal likelihood
We consider a different objective to bound the marginal likelihood. Write
Bounding the marginal likelihood
For a convex function, Jensen’s inequality implies that and applying this with we get
Evidence lower bound
We can thus consider the model that minimizes the reverse Kullback–Leibler divergence
Since ,
Evidence lower bound
Instead of minimizing the Kullback–Leibler divergence, we can equivalently maximize the so-called evidence lower bound (ELBO)
The ELBO is a lower bound for the marginal likelihood because a Kullback–Leibler divergence is non-negative and
Use of ELBO
The idea is that we will approximate the density
- the ELBO can be used for model comparison (but we compare bounds…)
- we can sample from as before.
Heuristics of ELBO
Maximize the evidence, subject to a regularization term:
The ELBO is an objective function comprising:
- the first term will be maximized by taking a distribution placing mass near the MAP of
- the second term can be viewed as a penalty that favours high entropy of the approximating family (higher for distributions which are diffuse).
Laplace vs variational approximation
![]()
Figure 1: Skewed density with the Laplace approximation (dashed orange) and variational Gaussian approximation (dotted blue).
Choice of approximating density
In practice, the quality of the approximation depends on the choice of
- We typically want matching support.
- The approximation will be affected by the correlation between posterior components
- Derivations can also be done for , where are latent variables from a data augmentation scheme.
Factorization
We can consider densities that factorize into blocks with parameters where If we assume that each of the parameters are independent, then we obtain a mean-field approximation.
Maximizing the ELBO one step at a time
which is the negative of a Kullback–Leibler divergence.
Optimal choice of approximating density
The maximum possible value of zero for the KL is attained when The choice of marginal that maximizes the ELBO is Often, we look at the kernel of to deduce the normalizing constant.
Coordinate-ascent variational inference (CAVI)
- We can maximize in turn for each treating the other parameters as fixed.
- This scheme is guaranteed to monotonically increase the ELBO until convergence to a local maximum.
- Convergence: monitor ELBO and stop when the change is lower then some present numerical tolerance.
- The approximation may have multiple local optima: perform random initializations and keep the best one.
Example of CAVI mean-field for Gaussian target
We consider the example from Section 2.2.2 of Ormerod & Wand (2010) for approximation of a Gaussian distribution, with This is an example where the full posterior is available in closed-form, so we can compare our approximation with the truth.
Variational approximation to Gaussian — mean
We assume a factorization of the variational approximation the factor for is proportional to which is quadratic in and thus must be Gaussian with precision and mean
Variational approximation to Gaussian — precision
The optimal precision factor satisfies Thus a gamma with shape and rate .
Rate of the gamma for
It is helpful to rewrite the expected value as so that it depends on the parameters of the distribution of directly.
CAVI for Gaussian
The algorithm cycles through the following updates until convergence:
- where is a function of both and
We only compute the ELBO at the end of each cycle.
Maximization?
Recall that alternating these steps is equivalent to maximization of the ELBO.
- each iteration performs conditional optimization implicitly (as we minimize the reverse KL divergence).
Monitoring convergence
The derivation of the ELBO is straightforward but tedious;
We can also consider relative changes in parameter values as tolerance criterion.
Bivariate posterior density
![]()
Figure 2: Bivariate density posterior for the conjugate Gaussian-gamma model (left) and CAVI approximation (right).
Marginal posterior densities
![]()
Figure 3: Marginal posterior density of the mean and precision of the Gaussian (full line), with CAVI approximation (dashed).
CAVI for probit regression
A probit regression is a generalized linear model with probability of success where is the cumulative distribution function of a standard Gaussian variable.
We can write the model as since
Data augmentation and CAVI
Consider data augmentation with auxiliary variables
With the model admits conditionals where and is if and if
CAVI factorization for probit model
We consider a factorization of the form
Then, the optimal form of the density further factorizes as
Gibbs, EM and CAVI
- We exploit the conditionals in the same way as for Gibbs sampling
- The only difference is that we substitute unknown parameter functionals by their expectations.
- Also deep links with the expectation-maximization (EM) algorithm, which optimizes at each step parameters after replacing the log posterior of augmented data by their expectation.
- CAVI however fixes the parameter values (less uncertainty in the posterior because of that).
Updates for CAVI - probit regression
The model depends on
- , the mean parameter of
- the mean of
Consider the terms in the posterior proportional to , where which is linear in .
Truncated Gaussian
The expectation of a univariate truncated Gaussian is
Update for CAVI
If we replace , we get the update since
Update for regression parameters
The optimal form for is Gaussian and proceeding similarly, where
Other parameters of the distribution are known functions of covariates, etc.
Example
We consider for illustration purposes data from Experiment 2 of Duke & Amir (2023) on the effect of sequential decisions and purchasing formats.
We fit a model with - age
of the participant (scaled) and - format,
the binary variable which indicate the experimental condition (sequential vs integrated).
ELBO and marginal density approximation
![]()
Figure 4: ELBO (left) and marginal density approximation with true density (full) versus variational approximation (dashed).
Stochastic optimization
We consider alternative numeric schemes which rely on stochastic optimization (Hoffman et al., 2013).
The key idea behind these methods is that
- we can use gradient-based algorithms,
- and approximate the expectations with respect to by drawing samples from it
Also allows for minibatch (random subset) selection to reduce computational costs in large samples
Stochastic gradient descent
Consider a differentiable function with gradient and a Robbins–Munro sequence.
To maximize , we construct a series of first-order approximations starting from with where the expected value is evaluated via Monte Carlo, until changes in is less than some tolerance value.
Robbins–Munro sequence
The step sizes must satisfy
Parameter-specific scaling helps with updates of parameters on very different scales.
Black-box variational inference
Ranganath et al. (2014) shows that the gradient of the ELBO reduces to using the change rule, differentiation under the integral sign (dominated convergence theorem) and the identity
Black-box variational inference in practice
- Note that the gradient simplifies for in exponential families.
- The gradient estimator is particularly noisy, so Ranganath et al. (2014) provide two methods to reduce the variance of this expression using control variates and Rao–Blackwellization.
Automatic differentiation variational inference
Kucukelbir et al. (2017) proposes a stochastic gradient algorithm, but with two main innovations.
- The first is the general use of Gaussian approximating densities for factorized density, with parameter transformations to map from the support of via
- The second is to use the resulting location-scale family to obtain an alternative form of the gradient.
Gaussian full-rank approximation
Consider an approximation where consists of
- mean parameters and
- covariance , parametrized through a Cholesky decomposition
The full approximation is of course more flexible, but is more expensive to compute than the mean-field approximation.
Gaussian entropy
The entropy of the multivariate Gaussian with mean and covariance , where is a lower triangular matrix, is and only depends on .
Eigendecomposition
We work with the matrix-log of the covariance matrix, defined through it’s eigendecomposition (or singular value decomposition) where is a orthogonal matrix of eigenvectors, whose inverse is equal to it’s transpose.
Matrix-log
Most operations on the matrix only affect the eigenvalues : the matrix-log is
Operations on matrices
Other operations on matrices are defined analogously:
- The symmetrization operator is
Gaussian scale
Since the Gaussian is a location-scale family, we can write in terms of a standardized Gaussian, for with .
Gradients of the ELBO
Write the gradient of the joint log posterior density as Then, the gradients of the ELBO are
Gradients of ELBO for location-scale families
We can rewrite the expression for the gradient with respect to the matrix-log using integration by part The first expression typically leads to a more noisy gradient estimator, but the second requires derivation of the Hessian.
Change of variable
The change of variable introduces a Jacobian term for the approximation to the density , where
and we replace the gradient by
Chain rule
If and we have for equal to either or , using the chain rule,
Quality of approximation
Consider the stochastic volatility model.
![]()
Fitting HMC-NUTS to the exchange rate data takes 156 seconds for 10K iterations, vs 2 seconds for the mean-field approximation.
Bayesian modelling Variational inference Léo Belzile Last compiled Tuesday Apr 15, 2025
Comments
With vague priors, the coefficients for the mean matches the frequentist point estimates of the probit regression to four significant digits.
Convergence is very fast, as shown by the ELBO plot.
The marginal density approximations are underdispersed.