Bayesics
2024
In frequentist statistic, “probability” is synonym for
long-term frequency under repeated sampling
Probability reflects incomplete information.
Quoting Finetti (1974)
Probabilistic reasoning — always to be understood as subjective — merely stems from our being uncertain about something.
The joint density of data \(\boldsymbol{Y}\) and parameters \(\boldsymbol{\theta}\) is
\[\begin{align*} p(\boldsymbol{Y}, \boldsymbol{\theta}) = p(\boldsymbol{Y} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta}) = p(\boldsymbol{\theta} \mid \boldsymbol{Y}) p(\boldsymbol{Y}) \end{align*}\] where the marginal \(p(\boldsymbol{Y}) = \int_{\boldsymbol{\Theta}} p(\boldsymbol{Y}, \boldsymbol{\theta}) \mathrm{d} \boldsymbol{\theta}\).
Using Bayes’ theorem, the posterior density is
\[\begin{align*} \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{Y})} = \frac{\color{#0072B2}{p(\boldsymbol{Y} \mid \boldsymbol{\theta})} \times \color{#56B4E9}{p(\boldsymbol{\theta})}}{\color{#E69F00}{\int p(\boldsymbol{Y} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})\mathrm{d} \boldsymbol{\theta}}}, \end{align*}\]
meaning that \[\color{#D55E00}{\text{posterior}} \propto \color{#0072B2}{\text{likelihood}} \times \color{#56B4E9}{\text{prior}}\]
By Bayes’ rule, we can consider updating the posterior by adding terms to the likelihood, noting that for independent \(\boldsymbol{y}_1\) and \(\boldsymbol{y}_2\), \[\begin{align*} p(\boldsymbol{\theta} \mid \boldsymbol{y}_1, \boldsymbol{y}_2) \propto p(\boldsymbol{y}_2 \mid \boldsymbol{\theta}) p(\boldsymbol{\theta} \mid \boldsymbol{y}_1) \end{align*}\] The posterior is be updated in light of new information.
A binomial variable with probability of success \(\theta \in [0,1]\) has mass function \[\begin{align*} f(y; \theta) = \binom{n}{y} \theta^y (1-\theta)^{n-y}, \qquad y = 0, \ldots, n. \end{align*}\] Moments of the number of successes out of \(n\) trials are \[\mathsf{E}(Y \mid \theta) = n \theta, \quad \mathsf{Va}(Y \mid \theta) = n \theta(1-\theta).\]
The beta distribution with shapes \(\alpha>0\) and \(\beta>0\), denoted \(\mathsf{Be}(\alpha,\beta)\), has density \[f(y) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}y^{\alpha - 1}(1-y)^{\beta - 1}, \qquad y \in [0,1]\]
We write \(Y \sim \mathsf{Bin}(n, \theta)\) for \(\theta \in [0,1]\); the likelihood is \[L(\theta; y) = \binom{n}{y} \theta^y(1-\theta)^{n-y}.\]
Consider a beta prior, \(\theta \sim \mathsf{Be}(\alpha, \beta)\), with density \[ p(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta) }\theta^{\alpha-1}(1-\theta)^{\beta - 1}. \]
The binomial distribution is discrete with support \(0, \ldots, n\), whereas the likelihood is continuous over \(\theta \in [0,1]\).
Any term not a function of \(\theta\) can be dropped, since it will absorbed by the normalizing constant. The posterior density is proportional to
\[\begin{align*} L(\theta; y)p(\theta) & \stackrel{\theta}{\propto} \theta^{y}(1-\theta)^{n-y} \times \theta^{\alpha-1} (1-\theta)^{\beta-1} \\& =\theta^{y + \alpha - 1}(1-\theta)^{n-y + \beta - 1} \end{align*}\] the kernel of a beta density with shape parameters \(y + \alpha\) and \(n-y + \beta\).
Consider the following sampling mechanism, which lead to \(k\) successes out of \(n\) independent trials, with the same probability of success \(\theta\).
Two likelihoods that are proportional, up to a constant not depending on unknown parameters, yield the same evidence.
In all cases, \(L(\theta; y) \stackrel{\theta}{\propto} \theta^k(1-\theta)^{n-k}\), so these yield the same inference for Bayesian.
We could approximate the \(\color{#E69F00}{\text{marginal likelihood}}\) through either
In more complicated models, we will try to sample observations by bypassing completely this calculation.
y <- 6L # number of successes
n <- 14L # number of trials
alpha <- beta <- 1.5 # prior parameters
unnormalized_posterior <- function(theta){
theta^(y+alpha-1) * (1-theta)^(n-y + beta - 1)
}
integrate(f = unnormalized_posterior,
lower = 0,
upper = 1)
## 1.07e-05 with absolute error < 1e-12
# Compare with known constant
beta(y + alpha, n - y + beta)
## [1] 1.07e-05
# Monte Carlo integration
mean(unnormalized_posterior(runif(1e5)))
## [1] 1.06e-05
We could define the posterior simply as the normalized product of the likelihood and some prior function.
The prior function need not even be proportional to a density function (i.e., integrable as a function of \(\boldsymbol{\theta}\)).
For example,
The posterior is beta, with expected value \[\begin{align*} \mathsf{E}(\theta \mid y) &= w\frac{y}{n} + (1-w) \frac{\alpha}{\alpha + \beta}, \\ w&=\frac{n}{n+\alpha+\beta} \end{align*}\] a weighted average of
Except for stubborn priors, the likelihood contribution dominates in large samples. The impact of the prior is then often negligible.
The output of the Bayesian learning will be either of:
Most of the field revolves around the creation of algorithms that either
Define the \(\color{#D55E00}{\text{posterior predictive}}\), \[\begin{align*} p(y_{\text{new}}\mid \boldsymbol{y}) = \int_{\boldsymbol{\Theta}} p(y_{\text{new}} \mid \boldsymbol{\theta}) \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{y})} \mathrm{d} \boldsymbol{\theta} \end{align*}\] and the \(\color{#56B4E9}{\text{prior predictive}}\) \[\begin{align*} p(y_{\text{new}}) = \int_{\boldsymbol{\Theta}} p(y_{\text{new}} \mid \boldsymbol{\theta}) \color{#56B4E9}{p(\boldsymbol{\theta})} \mathrm{d} \boldsymbol{\theta} \end{align*}\] is useful for determining whether the prior is sensical.
Given the \(\mathsf{Be}(a, b)\) prior or posterior, the predictive for \(n_{\text{new}}\) trials is beta-binomial with density \[\begin{align*} p(y_{\text{new}}\mid y) &= \int_0^1 \binom{n_{\text{new}}}{y_{\text{new}}} \frac{\theta^{a + y_{\text{new}}-1}(1-\theta)^{b + k - y_{\text{new}}-1}}{ \mathrm{Be}(a, b)}\mathrm{d} \theta \\&= \binom{n_{\text{new}}}{y_{\text{new}}} \frac{\mathrm{Be}(a + y_{\text{new}}, b + n_{\text{new}} - y_{\text{new}})}{\mathrm{Be}(a, b)} \end{align*}\]
Replace \(a=y + \alpha\) and \(b=n-y + \beta\) to get the posterior predictive distribution.
The posterior predictive carries over the parameter uncertainty so will typically be wider and overdispersed relative to the corresponding distribution.
Given a draw \(\theta^*\) from the posterior, simulate a new observation from the distribution \(f(y_{\text{new}}; \theta^*)\).
The output of a Bayesian procedure is a distribution for the parameters given the data.
We may wish to return different numerical summaries (expected value, variance, mode, quantiles, …)
The question: which point estimator to return?
A loss function \(c(\boldsymbol{\theta}, \boldsymbol{\upsilon}): \boldsymbol{\Theta} \mapsto \mathbb{R}^k\) assigns a weight to each value \(\boldsymbol{\theta}\), corresponding to the regret or loss.
The point estimator \(\widehat{\boldsymbol{\upsilon}}\) is the minimizer of the expected loss \[\begin{align*} \widehat{\boldsymbol{\upsilon}} &= \mathop{\mathrm{argmin}}_{\boldsymbol{\upsilon}}\mathsf{E}_{\boldsymbol{\Theta} \mid \boldsymbol{Y}}\{c(\boldsymbol{\theta}, \boldsymbol{v})\} \\&=\mathop{\mathrm{argmin}}_{\boldsymbol{\upsilon}} \int_{\boldsymbol{\Theta}} c(\boldsymbol{\theta}, \boldsymbol{\upsilon})p(\boldsymbol{\theta} \mid \boldsymbol{y}) \mathrm{d} \boldsymbol{\theta} \end{align*}\]
In a univariate setting, the most widely used point estimators are
The posterior mode \(\boldsymbol{\theta}_{\mathrm{map}} = \mathrm{argmax}_{\boldsymbol{\theta}} p(\boldsymbol{\theta} \mid \boldsymbol{y})\) is the maximum a posteriori or MAP estimator.
The freshman dream comes true! A \(1-\alpha\) credible region give a set of parameter values which contains the “true value” of the parameter \(\boldsymbol{\theta}\) with probability \(1-\alpha\).
Caveat: McElreath (2020) suggests the term ‘compatibility’, as it
returns the range of parameter values compatible with the model and data.
Multiple \(1-\alpha\) intervals, most common are