#' @param mu_lc mean of the linear combination
#' @param sd_lc std. dev of the linear combination
library(cubature)
<- function(mu_lc, sd_lc){
ep_update # Calculate outside of the loop the cavity
<- function(x){ dnorm(x, mean = mu_lc, sd = sd_lc)*plogis(x)}
fn # Compute normalizing constant
<- pcubature(f = fn, lowerLimit = -Inf, upperLimit = Inf)$integral
cst <- pcubature(f = function(x){fn(x)*x}, -Inf, Inf)$integral/cst
mu <- pcubature(f = function(x){fn(x)*(x-mu)^2}, -Inf, Inf)$integral/cst
va }
11 Expectation propagation
11.1 Newton smoothing
This section revisits Newton method for calculation of the maximum a posteriori and sheds a new light on the technique by viewing it as a sequence of Gaussian approximations to the target. For simplicity, write the Gaussian distribution in terms of canonical parameters \[\begin{align*} q(\boldsymbol{\theta}) \propto \exp \left( - \frac{1}{2} \boldsymbol{\theta}^\top \mathbf{Q}\boldsymbol{\theta} + \boldsymbol{\theta}^\top \boldsymbol{r}\right) \end{align*}\] where \(\mathbf{Q}\) is the precision matrix and \(\boldsymbol{r}=\mathbf{Q}\boldsymbol{\mu},\) the linear shift.
Let \(p(\boldsymbol{\theta} \mid \boldsymbol{y})=\exp\{-\psi(\boldsymbol{\theta})\}\) denote the posterior density. Since logarithm is a monotonic transform, we can equivalent minimize \(\psi(\boldsymbol{\theta})\) to find the posterior mode. Denote the gradient \(\nabla_{\boldsymbol{\theta}} \psi(\boldsymbol{\theta}) = \partial \psi/\partial \boldsymbol{\theta}\) and the Hessian matrix \(\mathbf{H}(\boldsymbol{\theta}) = \partial^2 \psi/(\partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^\top).\) Starting from an initial value \(\boldsymbol{\theta}_{(0)},\) we consider at step \(i\), a second order Taylor series expansion of \(\psi(\boldsymbol{\theta})\) around \(\boldsymbol{\theta}_{(i)},\) which gives \[\begin{align*} \psi(\boldsymbol{\theta}) \approx \psi(\boldsymbol{\theta}_{(i)}) + \nabla_{\boldsymbol{\theta}} \psi(\boldsymbol{\theta}_{(i)})(\boldsymbol{\theta}-\boldsymbol{\theta}_{(i)}) + (\boldsymbol{\theta}-\boldsymbol{\theta}_{(i)})^\top\mathbf{H}(\boldsymbol{\theta}_{(i)})(\boldsymbol{\theta}-\boldsymbol{\theta}_{(i)}) \end{align*}\] The term \(\psi(\boldsymbol{\theta}_{(i)})\) is constant, so if we plug-in this inside the exponential, we obtain \[\begin{align*} q_{(i+1)}(\boldsymbol{\theta}) &\propto \exp \left\{ - \frac{1}{2} \boldsymbol{\theta}^\top\mathbf{H}(\boldsymbol{\theta}_{(i)}) \boldsymbol{\theta} + \boldsymbol{\theta}_{(i+1)}^\top\mathbf{H}(\boldsymbol{\theta}_{(i)})\boldsymbol{\theta}\right\} \end{align*}\] where the mean of the approximation is \[\begin{align*} \boldsymbol{\theta}_{(i+1)} = \boldsymbol{\theta}_{(i)} - \mathbf{H}^{-1}(\boldsymbol{\theta}_{(i)}) \nabla_{\boldsymbol{\theta}} \psi(\boldsymbol{\theta}_{(i)}). \end{align*}\] The iterations perform gradient descent, with a correction that adjusts for the curvature locally. This scheme works provided that \(\mathbf{H}(\boldsymbol{\theta}_{(i)})\) is positive definite (all of it’s eigenvalues are positive); this may fail for non-convex targets, in which case we could perturb the scheme by adding a ridge-penalty (large diagonal matrix of positive terms) to ensure convergence until we reach a neighborhood of the mode. The new mean vector \(\boldsymbol{\theta}_{(i+1)}\) corresponds to a Newton update, and at the same time we have defined a sequence of Gaussian updating approximations. The fixed point to which the algorithm converges is the Laplace approximation.
Suppose for simplicity that the domain \(\boldsymbol{\Theta}=\mathbb{R}^p\), so that no prior transformation is necessary. For location-scale family, we have seen in the section on ADVI that the variational Bayes with a Gaussian approximation on the target \(\boldsymbol{\theta} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{Z}\) with \(\mathbf{LL}^\top=\boldsymbol{\Sigma}\) and \(\boldsymbol{Z} \sim \mathsf{Gauss}_p(\boldsymbol{0}_p, \mathbf{I}_p)\) that the gradient satisfies \[\begin{align*} \nabla_{\boldsymbol{\mu}}\mathsf{ELBO}(q)&= -\mathsf{E}_{\boldsymbol{Z}}\{\nabla_{\boldsymbol{\theta}}\psi(\boldsymbol{\theta})\} \\ \nabla_{\mathbf{L}}\mathsf{ELBO}(q)&= -\mathsf{E}_{\boldsymbol{Z}}\{\nabla_{\boldsymbol{\theta}}\psi(\boldsymbol{\theta})\boldsymbol{Z}^\top\} + \mathbf{L}^{-\top} \end{align*}\] If we apply integration by parts (Stein’s lemma, see Proposition 11.1) using the fact that the integral is with respect to a standard Gaussian density \(\phi_p(\boldsymbol{z}),\) we can rewrite the second term as \[\begin{align*} \nabla_{\mathbf{L}}\mathsf{ELBO}(q)&= -\mathsf{E}_{\boldsymbol{Z}}\left\{ \frac{\partial^2 \psi(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top}\right\}\mathbf{L} + \mathbf{L}^{-\top}. \end{align*}\] At a critical point, both of these derivatives must be zero, whence \[\begin{align*} \mathsf{E}_{\boldsymbol{Z}}\{\nabla_{\boldsymbol{\theta}}\psi(\boldsymbol{\theta})\} &= \boldsymbol{0}_p. \\ \mathsf{E}_{\boldsymbol{Z}}\left\{ \frac{\partial^2 \psi(\boldsymbol{\theta})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top}\right\} &= \boldsymbol{\Sigma}^{-1}. \end{align*}\] Compared to the Laplace approximation, the variational Gaussian approximation returns a vector \(\boldsymbol{\mu}\) around which the expected value of the gradient is zero and similarly \(\boldsymbol{\Sigma}\) for which the expected value of the curvature (Hessian) is equal to the precision. The averaging step is what distinguishes the Laplace and variational approximations.
Proposition 11.1 (Stein’s lemma) Consider \(h: \mathbb{R}^d \to \mathbb{R}\) a differentiable function and integration with respect to \(\boldsymbol{X} \sim \mathsf{Gauss}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) such that the gradient is absolutely integrable, \(\mathsf{E}_{\boldsymbol{X}}\{|\nabla_i h(\boldsymbol{X})|\} < \infty\) for \(i=1, \ldots, d.\) Then (Liu 1994), \[\begin{align*} \mathsf{E}_{\boldsymbol{X}}\left\{h(\boldsymbol{X})(\boldsymbol{X}-\boldsymbol{\mu})\right\} = \boldsymbol{\Sigma}\mathsf{E}_{\boldsymbol{X}}\left\{\nabla h(\boldsymbol{X})\right\} \end{align*}\]
11.2 Expectation propagation
Variational inference minimizes the reverse Kullback–Leibler divergence between some approximation \(q\) and the target \(p(\boldsymbol{\theta} \mid \boldsymbol{y});\) it thus places significant mass in areas of the \(\boldsymbol{\Theta}\) where \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\) is small or near zero (due to the log term) unless the approximation vanishes there. Qualitatively, the approximation will be very different from minimizing the Kullback–Leibler divergence \(\mathsf{KL}\{p(\boldsymbol{\theta} \mid \boldsymbol{y}) \parallel q\}\) \[\begin{align*} \mathrm{argmin}_{\boldsymbol{\psi}} \int_{\boldsymbol{\Theta}}\log \left(\frac{p(\boldsymbol{\theta} \mid \boldsymbol{y}) }{q(\boldsymbol{\theta}; \boldsymbol{\psi})}\right) p(\boldsymbol{\theta} \mid \boldsymbol{y}) \mathrm{d} \boldsymbol{\theta}. \end{align*}\] One can show that, if we approximate the posterior \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\) by a Gaussian approximation \(q(; \boldsymbol{\psi})\), the parameters that minimize the Kullback–Leibler divergence are the posterior mean and posterior variance of the \(p(\boldsymbol{\theta} \mid \boldsymbol{y}).\) Solving this problem exactly typically won’t be feasible, as this is the main reason to consider approximations in the first place.
Expectation propagation is an approximation algorithm proposed by Minka (2001) that tries to tackle the problem of minimizing the Kullback–Leibler divergence with distributions from exponential family; we will restrict attention here to Gaussian approximating functions. Expectation propagation is more accurate, but generally slower than variational Bayes, although it is quite fast when implemented properly and it is parallelizable. EP builds on a decomposition of the posterior as a product of terms; typically, we have likelihood contributions \(L_i(\boldsymbol{\theta})\) (called “factors” or “sites” in the EP lingo) for independent data, where \[\begin{align*} p(\boldsymbol{\theta} \mid \boldsymbol{y}) \propto p(\boldsymbol{\theta}) \prod_{i=1}^n L_i(\boldsymbol{\theta}) = \prod_{i=0}^n L_i(\boldsymbol{\theta}) \end{align*}\] with the convention that \(L_0(\boldsymbol{\theta})\) equals the prior density. Such factorization is also feasible in graphical models (e.g., autoregressive processes, Markov fields), but needs not be unique. Note that it is not equivalent to the factorization of the posterior (mean-field approximation) for variational Bayes; every term in the EP approximation is a function of the whole vector \(\boldsymbol{\theta}.\)
The expectation propagation considers a factor structure approximation, in which each \(q_i\) is Gaussian with precision \(\mathbf{Q}_i\) and linear shift \(\boldsymbol{r}_i\), \[\begin{align*} q(\boldsymbol{\theta}) &\propto \prod_{i=1}^n q_i(\boldsymbol{\theta}) \\& \propto \prod_{i=0}^n \exp \left(-\frac{1}{2} \boldsymbol{\theta}^\top\mathbf{Q}_i\boldsymbol{\theta} + \boldsymbol{\theta}^\top\boldsymbol{r}_i\right) \\ &= \exp \left( - \frac{1}{2} \boldsymbol{\theta}^\top \sum_{i=0}^n\mathbf{Q}_i\boldsymbol{\theta} + \boldsymbol{\theta}^\top \sum_{i=0}^n\boldsymbol{r}_i\right) \end{align*}\] and so the global approximation is also Gaussian; the last line also holds for distributions in the exponential family, where \(-0.5\boldsymbol{\Sigma}^{-1}\) and \(\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu} = \boldsymbol{r}\) are the canonical parameters of the Gaussian distribution.
Expectation propagation works by starting with the joint approximation and removing one site to get the cavity \[q_{-j}(\boldsymbol{\theta}) = \prod_{\substack{i = 0\\i \neq j}}^n q_{i}(\boldsymbol{\theta});\] in practice, this is most easily done by subtracting the term from the approximation of the canonical parameters. We replace then the missing term by \(L_j(\boldsymbol{\theta})\) to construct the so-called hybrid density \(h_j(\boldsymbol{\theta}) = L_j(\boldsymbol{\theta})q_{-j}(\boldsymbol{\theta}).\) The resulting density is unnormalized, but closer to a sense to the target posterior than is \(q(\boldsymbol{\theta}).\) We can start from this and compute a Gaussian approximation to minimize the Kullback–Leibler distance with \(\mathsf{KL}(h_j \parallel q^*_j)\) by matching moments, where \[\begin{align*} c_j &= \int h_j(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta} \\ \boldsymbol{\mu}_j &= c_{j}^{-1} \int \boldsymbol{\theta} h_j(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta} \\ \boldsymbol{\Sigma}_j &= c_j^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}_j)(\boldsymbol{\theta} - \boldsymbol{\mu}_j)^\top h_j(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta} \end{align*}\] The normalizing constant, mean and variance in the above are written in terms of \(p\)-dimensional integrals but could be easily obtained by Monte Carlo by drawing from the cavity prior. We then go back from moments to canonical parameters \(\mathbf{Q}_j\) and \(\boldsymbol{r}_j\) and update the global approximation \(q.\) These updates can be performed sequentially or in parallel, which can lead to massive gains in efficiency.
Example 11.1 (Logistic regression) Consider a logistic regression model where we code successes \(Y=1\) and failures \(Y=-1;\) for a given row vector \(\mathbf{x}\) of length \(p\) and the vector \(\boldsymbol{\beta} \in \mathbb{R}^p\) of coefficients, the probability of success is \[\begin{align*} \Pr(Y=1 \mid \mathbf{x}, \boldsymbol{\beta}) = \left\{1+\exp(-\mathbf{x}\boldsymbol{\beta})\right\}^{-1} = \mathrm{expit}(\mathbf{x}\boldsymbol{\beta}). \end{align*}\] The term on the right is the logistic function. This reparametrization makes it easier to write the likelihood contribution of observation \(i\) as \[L_i(\boldsymbol{\beta}) = \mathrm{expit}(y_i \mathbf{x}_i\boldsymbol{\beta}).\] If we have the approximation for \(\boldsymbol{\beta} \sim \mathsf{Gauss}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}),\) then the integrals for the normalizing constant, etc. depend only on the linear combination \(\mathbf{x}_i^\top\boldsymbol{\beta} \sim \mathsf{Gauss}(y_i \mathbf{x}_i^\top\boldsymbol{\mu}, y_i \mathbf{x}_i^\top\boldsymbol{\Sigma}\mathbf{x}_i).\) This means all integrals for the tilting problem are unidimensional and can be obtained by Gaussian quadrature or similar numerical integration schemes.
Given the mean and variance of the cavity, denoted \(\boldsymbol{\mu}_{-j}\) and \(\boldsymbol{\Sigma}_{-j},\) and with \(s_j = \mathbf{x}_j\boldsymbol{\Sigma}_{-j}\mathbf{x}_j^\top\), the updated mean and variance are \[\begin{align*} \boldsymbol{\mu}_j^* &= \boldsymbol{\mu}_{-j} s_j^{-1}\boldsymbol{\Sigma}_{-j}\mathbf{x}^\top \{\mathbf{E}(Z_j) -\mathbf{x}_j\boldsymbol{\mu}_{-j}\} \\ \boldsymbol{\Sigma}_{j} &=\boldsymbol{\Sigma}_{-j} + s_j^{-2}\boldsymbol{\Sigma}_{-j}\mathbf{x}^\top_j\left\{\mathsf{Var}(Z_j) - s_j\right\}\mathbf{x}_j\boldsymbol{\Sigma}_{-j} \end{align*}\] where \(Z_j\) has distribution proportional to \(L_j(z) \times \phi(z; \mathbf{x}_j\boldsymbol{\mu}_j, \mathbf{x}_j\boldsymbol{\Sigma}_{-j}\mathbf{x}_j^\top),\) where \(\phi(\cdot; \mu, \sigma^2)\) denotes the density of a Gaussian random variable. The updates to the parameters for more general exponential families are found in page 23 of Cseke and Heskes (2011).
In generalized linear models, the linear predictor will always be one-dimensional, and more generally, in exponential families, we can get a similar dimension reduction.
The EP algorithm is particularly well suited to latent Gaussian models (Cseke and Heskes 2011) and generalized linear models with Gaussian priors for the coefficients.
The EP algorithm iterates the steps until convergence:
- Initialize the site-specific parameters
- Loop over each observation of the likelihood factorization: 2.1 form the cavity and the hybrid distribution 2.2 compute the moments of the hybrid \(\boldsymbol{\mu}\) and \(\boldsymbol{\Sigma}\) 2.3 transform back to canonical parameters \(\mathbf{Q}\) and \(\boldsymbol{r}\) 2.3 update the global approximation
We can monitor convergence of EP by looking at changes in the parameters: it is a fixed point algorithm. However, the EP does not have guarantees of convergence and may diverge; Dehaene and Barthelmé (2018) explain some heuristics for these by making analogies with Newton’s method, which only converges in the neighbourhood of fixed points and can diverge. The idea is to update the global approximation with a damping term for the canonical parameters. It is also useful to perform updates using suitable numerical linear algebra routines, for example by competing the Cholesky root of the covariance and back-solving to get the precision.
The expensive steps are related to the inversion of the moments to get the canonical parameters from the moments; the matrix inversion has complexity \(\mathrm{O}(np^3)\) for a single pass, but the algorithm typically converges quicky. The other bottleneck is calculation of the moments.