Bayesian modelling
Final review
Last compiled Tuesday Apr 15, 2025
Fundamentals
- Bayesian inference uses likelihood based inference.
- It complements the likelihood \(p(\boldsymbol{y} \mid \boldsymbol{\theta})\) with a prior \(p(\boldsymbol{\theta})\).
- Provided that \(p(\boldsymbol{\theta}, \boldsymbol{y})\) is integrable, we get \[\begin{align*}
p(\boldsymbol{\theta} \mid \boldsymbol{y}) \stackrel{\boldsymbol{\theta}}{\propto} p(\boldsymbol{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta}).
\end{align*}\]
Marginal likelihood
The normalizing constant \[\begin{align*}
p(\boldsymbol{y}) = \int_{\boldsymbol{\Theta}} p(\boldsymbol{y} \mid \boldsymbol{\theta})p(\boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}
\end{align*}\] to make the posterior a valid density is termed marginal likelihood.
Marginal likelihood
Moments of the posterior depend on \(p(\boldsymbol{y})\).
It is hard to compute because \(\boldsymbol{\Theta} \subseteq \mathbb{R}^p\), and the integral is often high-dimensional.
- Monte Carlo integration (does not typically work because prior need not align with likelihood)
- Numerical integration performance degrades with \(p\), numerical overflow.
Bayes factors
The \(\color{#6e948c}{\text{Bayes factor}}\) is the ratio of marginal likelihoods, as \[\begin{align*}
p(\boldsymbol{y} \mid \mathcal{M}_i) = \int p(y \mid \boldsymbol{\theta}^{(i)}, \mathcal{M}_i) p( \boldsymbol{\theta}^{(i)} \mid \mathcal{M}_i) \mathrm{d} \boldsymbol{\theta}^{(i)}.
\end{align*}\] Values of \(\mathsf{BF}_{ij}>1\) correspond to model \(\mathcal{M}_i\) being more likely than \(\mathcal{M}_j\).
- Strong dependence on the prior \(p(\boldsymbol{\theta}^{(i)} \mid \mathcal{M}_i)\).
- Must use proper priors.
Predictive distributions
Define the \(\color{#D55E00}{\text{posterior predictive}}\), \[\begin{align*}
p(y_{\text{new}}\mid \boldsymbol{y}) = \int_{\boldsymbol{\Theta}} p(y_{\text{new}} \mid \boldsymbol{\theta}) \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{y})} \mathrm{d} \boldsymbol{\theta}
\end{align*}\]
Bayesian inference
If we have samples from \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\) or an approximation of the joint/marginals, then we can
- use the posterior distribution to answer any question that is a function of \(\boldsymbol{\theta}\) alone.
- use the posterior predictive \(p(y_{\text{new}}\mid \boldsymbol{y})\) for prediction or forecasting, and checks of model adequacy.
Point estimators and credible regions
Interpretation is different from frequentist, but methods are similar:
- point estimators (MAP, posterior mean and median, etc.) derive from consideration of loss functions that return a summary of the posterior.
- credible interval or regions (interval for which the true parameter lies with a certain probability).
Stochastic approximations
Stochastic approximations rely on sampling methods (rejection sampling, MCMC)
- returns (correlated) posterior samples.
- Metropolis–Hastings acceptance ratio bypasses marginal likelihood calculation.
- Marginalization is straightforward.
Markov chains
- Need to assess convergence to the stationary distribution (traceplots)
- Autocorrelation reduces precision of Monte Carlo estimates (effective sample size)
Markov chain Monte Carlo algorithms
We covered in class the following (in decreasing order of efficiency).
- random walk Metropolis
- Metropolis-adjusted Langevin algorithm (MALA)
- Hamiltonian Monte Carlo
Better sampling performance, but the latter two require gradient and are more expensive to compute.
Model selection
- Bernstein-von Mises ensures convergence in total variation of the posterior under weak conditions.
- Distinguish between
- \(\mathcal{M}\)-closed: true parameter is part of set considered or
- \(\mathcal{M}\)-open: only misspecified models are considered.
- The model that gets selected minimizes the Kullback–Leibler divergence with the truth.
- In discrete parameter settings, we recover the truth with probability 1.
Priors
- Priors don’t matter in large sample on the data layer, as likelihood is \(\mathrm{O}(n)\) vs \(\mathrm{O}(1)\) for the prior.
- Support constraints have an impact
- Their impact depends largely on how far they are from the data.
- Prior sensitivity check: compare posterior vs prior density
Type of priors
- Different roles (expert opinion, simplification of calculations, regularization).
- Conditional conjugacy mostly useful for Gibbs sampling, etc.
- Careful with improper priors (unless they are known to yield valid posterior).
- Prefer weak priors to near improper priors.
Prior selection
- Moment matching
- Prior predictive distribution: draw new observations from likelihood and plot