Hamiltonian Monte Carlo and probabilistic programming
Last compiled Thursday Apr 3, 2025
This material is drawn from
Check out these animations by Chi Feng
We are interested in calculating expectations of some function \(g\) against the posterior. \[\begin{align*} \int_{\mathbb{R}^d} g(\boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{y}) \mathrm{d} \boldsymbol{\theta}. \end{align*}\]
The integral is determined by the product of the ``volume’’ of \(g(\cdot)\) and the density.
As the dimension of the posterior, \(d\), grows, the mass concentrates in a small region, the so-called typical set. The number of regions/directions to consider increases exponentially in \(d\).
If we start at the stationary distribution, most proposals from a random walk Metropolis will fall outside of the typical set and get rejected.
This phenomenon also explains the decrease in performance of numerical integration schemes (quadrature).
For differentiable targets, we saw that we can do better than random walk Metropolis–Hastings.
Hamitonian Monte Carlo borrows ideas from Hamiltonian dynamics.
Consider the evolution over time of a particle characterized by a
Write the negative of the log of the joint density as \[H(\boldsymbol{\theta}, \boldsymbol{s}) = -\log p(\boldsymbol{s}) - \log p(\boldsymbol{\theta}) = U(\boldsymbol{\theta}) + K(\boldsymbol{s}).\]
The partial derivatives of the Hamiltonian give the evolution over time of the system: \[\begin{align*} \frac{\mathrm{d} \theta_j}{\mathrm{d} t} = \frac{\partial H}{\partial s_j} &= \frac{\partial K}{\partial s_j}\\ \frac{\mathrm{d} s_j}{\mathrm{d} t}= - \frac{\partial H}{\partial \theta_j} &= - \frac{\partial U}{\partial \theta_j}, \quad j =1, \ldots, d. \end{align*}\]
There is no explicit solution to these differential equations in most settings.
The most popular choice of kinetic energy is the Gaussian, \[\begin{align*} K(\boldsymbol{s}) = \frac{1}{2} \boldsymbol{s}^\top \mathbf{M}^{-1}\boldsymbol{s} \end{align*}\] the negative of a mean zero log Gaussian density with positive-definite covariance matrix \(\mathbf{M}.\)
Typically, we take \(\mathbf{M}=\mathrm{diag}\{m_1, \ldots, m_d\}\) diagonal, or else proportional \(\mathbf{M} = m \mathbf{I}_d\).
The mapping \(T_s\) from time \(t\) at \((\boldsymbol{\theta}(t), \boldsymbol{s}(t))\) to time \(t + \varepsilon\), \((\boldsymbol{\theta}(t + \varepsilon), \boldsymbol{s}(t + \varepsilon))\) satisfies the following properties:
There is no explicit solution to the Hamiltonian differential equation. We must move away from continuous time…
The leapfrog integrator performs a half step for momentum, then does a full step for the position using the updated components, etc.
\[\begin{align*} s_j(t+\varepsilon/2) &= s_j(t) - \frac{\varepsilon}{2} \left.\frac{\partial U(\boldsymbol{\theta})}{\partial \theta_j}\right|_{\boldsymbol{\theta}(t)} \\ \theta_j(t+\varepsilon) &= \theta_j(t) + \varepsilon \frac{s_j(t+\varepsilon/2)}{m_j} \\ s_j(t+\varepsilon) &= s_j(t+\varepsilon/2) - \frac{\varepsilon}{2} \left.\frac{\partial U(\boldsymbol{\theta})}{\partial \theta_j}\right|_{\boldsymbol{\theta}(t + \varepsilon)} \end{align*}\]
Consider the joint distribution with positions \(\boldsymbol{\theta}\) and momentum variables \(\boldsymbol{s}\), \(p(\boldsymbol{\theta}, \boldsymbol{s}) \propto \exp \{- H(\boldsymbol{\theta}, \boldsymbol{s})\}.\)
We start with a position vector \(\boldsymbol{\theta}_{t-1}\) at step \(t-1\):
Hamiltonian Monte Carlo (HMC) has numerous tuning parameters
The Störmer–Verlet (leapfrog) integrator is a second order method, so for step size \(\varepsilon\):
Leapfrog updates one variable at a time, a shear transformation.
Leapfrog step should be \(\mathrm{O}(d^{-1/4})\) (Beskos et al., 2013)
In practice, we use a Metropolis step to adjust for the discretization of the system.
In theory, the energy of the Hamiltonian should stay constant, but the numerical scheme leads to small perturbations (hence the rejection step).
We have seen that for differentiable posterior \(p(\boldsymbol{\theta} \mid \boldsymbol{y})\), using the gradient information can improve convergence by informing about the direction of the mode.
There are several languages and interfaces that implement probabilistic programming where the user has only to specify the likelihood and prior.
Historically, Bugs paved the way to practitioners.
It relies on Gibbs sampling (updating one parameter at the time), but is not actively developed. Still the source of many exercises and inspiration for the syntax of other implementations (e.g., Nimble, JAGS).
The programming language Stan is written in C++ and offers cross-platform interfaces.
Financial returns \(Y_t\) typically exhibit time-varying variability. The stochastic volatility model is a parameter-driven model that specifies \[\begin{align*} Y_t &= \exp(h_t/2) Z_t \\ h_t &= \gamma + \phi (h_{t-1} - \gamma) + \sigma U_t \end{align*}\] where \(U_t \stackrel{\mathrm{iid}}{\sim} \mathsf{Gauss}(0,1)\) and \(Z_t \sim \stackrel{\mathrm{iid}}{\sim} \mathsf{Gauss}(0,1).\)
It is possible to introduce leverage by adding \(\mathsf{Cor}(Z_t, U_t) = \rho.\)