Bayesian modelling

Final review

Léo Belzile

Last compiled Tuesday Apr 15, 2025

Fundamentals

  • Bayesian inference uses likelihood based inference.
  • It complements the likelihood \(p(\boldsymbol{y} \mid \boldsymbol{\theta})\) with a prior \(p(\boldsymbol{\theta})\).
  • Provided that \(p(\boldsymbol{\theta}, \boldsymbol{y})\) is integrable, we get \[\begin{align*} p(\boldsymbol{\theta} \mid \boldsymbol{y}) \stackrel{\boldsymbol{\theta}}{\propto} p(\boldsymbol{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta}). \end{align*}\]

Marginal likelihood

The normalizing constant \[\begin{align*} p(\boldsymbol{y}) = \int_{\boldsymbol{\Theta}} p(\boldsymbol{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta} \end{align*}\] to make the posterior a valid density is termed marginal likelihood.

Marginal likelihood

Moments of the posterior depend on \(p(\boldsymbol{y})\).

It is hard to compute because \(\boldsymbol{\Theta} \subseteq \mathbb{R}^p\), and the integral is often high-dimensional.

  • Monte Carlo integration (does not typically work because prior need not align with likelihood)
  • Numerical integration performance degrades with \(p\), numerical overflow.

Bayes factors

The \(\color{#6e948c}{\text{Bayes factor}}\) is the ratio of marginal likelihoods, as \[\begin{align*} p(\boldsymbol{y} \mid \mathcal{M}_i) = \int p(y \mid \boldsymbol{\theta}^{(i)}, \mathcal{M}_i) p( \boldsymbol{\theta}^{(i)} \mid \mathcal{M}_i) \mathrm{d} \boldsymbol{\theta}^{(i)}. \end{align*}\] Values of \(\mathsf{BF}_{ij}>1\) correspond to model \(\mathcal{M}_i\) being more likely than \(\mathcal{M}_j\).

  • Strong dependence on the prior \(p(\boldsymbol{\theta}^{(i)} \mid \mathcal{M}_i)\).
  • Must use proper priors.

Predictive distributions

Define the \(\color{#D55E00}{\text{posterior predictive}}\), \[\begin{align*} p(y_{\text{new}}\mid \boldsymbol{y}) = \int_{\boldsymbol{\Theta}} p(y_{\text{new}} \mid \boldsymbol{\theta}) \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{y})} \mathrm{d} \boldsymbol{\theta} \end{align*}\]

Bayesian inference

If we have samples from \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\) or an approximation of the joint/marginals, then we can

  • use the posterior distribution to answer any question that is a function of \(\boldsymbol{\theta}\) alone.
  • use the posterior predictive \(p(y_{\text{new}}\mid \boldsymbol{y})\) for prediction or forecasting, and checks of model adequacy.

Point estimators and credible regions

Interpretation is different from frequentist, but methods are similar:

  • point estimators (MAP, posterior mean and median, etc.) derive from consideration of loss functions that return a summary of the posterior.
  • credible interval or regions (interval for which the true parameter lies with a certain probability).

Stochastic approximations

Stochastic approximations rely on sampling methods (rejection sampling, MCMC)

  • returns (correlated) posterior samples.
  • Metropolis–Hastings acceptance ratio bypasses marginal likelihood calculation.
  • Marginalization is straightforward.

Markov chains

  • Need to assess convergence to the stationary distribution (traceplots)
  • Autocorrelation reduces precision of Monte Carlo estimates (effective sample size)

Markov chain Monte Carlo algorithms

We covered in class the following (in decreasing order of efficiency).

  • random walk Metropolis
  • Metropolis-adjusted Langevin algorithm (MALA)
  • Hamiltonian Monte Carlo

Better sampling performance, but the latter two require gradient and are more expensive to compute.

Model selection

  • Bernstein-von Mises ensures convergence in total variation of the posterior under weak conditions.
  • Distinguish between
    • \(\mathcal{M}\)-closed: true parameter is part of set considered or
    • \(\mathcal{M}\)-open: only misspecified models are considered.
  • The model that gets selected minimizes the Kullback–Leibler divergence with the truth.
  • In discrete parameter settings, we recover the truth with probability 1.

Priors

  • Priors don’t matter in large sample on the data layer, as likelihood is \(\mathrm{O}(n)\) vs \(\mathrm{O}(1)\) for the prior.
  • Support constraints have an impact
  • Their impact depends largely on how far they are from the data.
  • Prior sensitivity check: compare posterior vs prior density

Type of priors

  • Different roles (expert opinion, simplification of calculations, regularization).
  • Conditional conjugacy mostly useful for Gibbs sampling, etc.
  • Careful with improper priors (unless they are known to yield valid posterior).
  • Prefer weak priors to near improper priors.

Prior selection

  • Moment matching
  • Prior predictive distribution: draw new observations from likelihood and plot