headline | impressions | clicks |
---|---|---|
H1 | 3060 | 49 |
H2 | 2982 | 20 |
H3 | 3112 | 31 |
H4 | 3083 | 9 |
Priors
Last compiled Tuesday Apr 15, 2025
The posterior density is
\[\begin{align*} \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{Y})} = \frac{\color{#0072B2}{p(\boldsymbol{Y} \mid \boldsymbol{\theta})} \times \color{#56B4E9}{p(\boldsymbol{\theta})}}{\color{#E69F00}{\int p(\boldsymbol{Y} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})\mathrm{d} \boldsymbol{\theta}}}, \end{align*}\]
where \[\color{#D55E00}{\text{posterior}} \propto \color{#0072B2}{\text{likelihood}} \times \color{#56B4E9}{\text{prior}}\]
We need to determine a suitable prior.
The posterior is a compromise prior and likelihood:
Infinite number of choice, but many default choices…
We term hyperparameters the parameters of the (hyper)priors.
How to elicit reasonable values for them?
Working with standardized response and inputs \[x_i \mapsto (x_i - \overline{x})/\mathrm{sd}(\boldsymbol{x}),\]
Consider the relationship between height (\(Y,\) in cm) and weight (\(X,\) in kg) among humans adults.1
Model using a simple linear regression
\[\begin{align*} Y_i &\sim \mathsf{Gauss}(\mu_i, \sigma^2) \\ \mu_i &= \beta_0 + \beta_1(\mathrm{x}_i - \overline{x}) \\ \beta_0 &\sim \mathsf{Gauss}(178, 20^2) \\ \sigma &\sim \mathsf{unif}(0, 50) \end{align*}\] What about the slope parameter prior \(p(\beta_1)\)?
Figure 1: Prior draws of linear regressions with different priors: vague \(\beta_1 \sim \mathsf{Gauss}(0, 100)\) (left) and lognormal \(\ln(\beta_1) \sim \mathsf{Gauss}(0,1)\) (right). Figure 4.5 of McElreath (2020). The Guiness record for the world’s tallest person is 272cm.
A prior density \(p(\boldsymbol{\theta})\) is conjugate for likelihood \(L(\boldsymbol{\theta}; \boldsymbol{y})\) if the product \(L(\boldsymbol{\theta}; \boldsymbol{y})p(\boldsymbol{\theta}),\) after renormalization, is of the same parametric family as the prior.
Distributions that are exponential family admit conjugate priors.
A distribution is an exponential family if it’s density can be written \[\begin{align*} f(y; \boldsymbol{\theta}) = \exp\left\{ \sum_{k=1}^K Q_k(\boldsymbol{\theta}) t_k(y) + D(\boldsymbol{\theta}) + h(y)\right\}. \end{align*}\] The support of \(f\) must not depend on \(\boldsymbol{\theta}.\)
distribution | unknown parameter | conjugate prior |
---|---|---|
\(Y \sim \mathsf{expo}(\lambda)\) | \(\lambda\) | \(\lambda \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{Poisson}(\mu)\) | \(\mu\) | \(\mu \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{binom}(n, \theta)\) | \(\theta\) | \(\theta \sim \mathsf{Be}(\alpha, \beta)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\mu\) | \(\mu \sim \mathsf{Gauss}(\nu, \omega^2)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\sigma\) | \(\sigma^{-2} \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\mu, \sigma\) | \(\mu \mid \sigma^2 \sim \mathsf{Gauss}(\nu, \omega \sigma^2),\) \(\sigma^{-2} \sim \mathsf{gamma}(\alpha, \beta)\) |
If \(Y \sim \mathsf{Poisson}(\mu)\) with density \(f(y) = \mu^x\exp(-\mu x)/x!,\) then for \(\mu \sim \mathsf{gamma}(\alpha, \beta)\) with \(\alpha, \beta\) fixed. Consider an i.i.d. sample with mean \(\overline{y}.\) The posterior density is
\[ p(\mu \mid y) \stackrel{\mu}{\propto} \mu^{n\overline{y}} \exp\left(-\mu n\overline{y}\right) \mu^{\alpha-1} \exp(-\beta \mu) \] so must be gamma \(\mathsf{gamma}(n\overline{y} + \alpha, n\overline{y} + \beta).\)
Parameter interpretation: \(\alpha\) events in \(\beta\) time intervals.
Consider an iid sample, \(Y_i \sim \mathsf{Gauss}(\mu, \sigma^2)\) and let \(\mu \mid \sigma \sim \mathsf{Gauss}(\nu, \sigma^2\tau^2).\) Then, \[\begin{align*} p(\mu, \sigma) &\propto \frac{p(\sigma)}{\sigma^{n+1}} \exp\left\{ -\frac{1}{2\sigma^2}\sum_{i=1}^n (y_{i}-\mu)^2\right\} \exp\left\{-\frac{1}{2\sigma^2\tau^2}(\mu - \nu)^2\right\} \\&\propto \frac{p(\sigma)}{\sigma^{n+1}} \exp\left\{\left(\sum_{i=1}^n y_{i} + \frac{\nu}{\tau^2}\right)\frac{\mu}{\sigma^2} - \left( \frac{n}{2} +\frac{1}{2\tau^2}\right)\frac{\mu^2}{\sigma^2}\right\}. \end{align*}\]
The conditional posterior \(p(\mu \mid \sigma)\) is Gaussian with
We treat \(\texttt{impression}\) as a known offset.
Consider an A/B test from November 23st, 2014, that compared four different headlines for a story on Sesame Street workshop with interviews of children whose parents were in jail and visiting them in prisons. The headlines tested were:
headline | impressions | clicks |
---|---|---|
H1 | 3060 | 49 |
H2 | 2982 | 20 |
H3 | 3112 | 31 |
H4 | 3083 | 9 |
For \(Y \sim \mathsf{gamma}(\alpha, \beta)\) with \(\beta\) the rate parameter, we have \[\begin{align*} \mathsf{E}(Y)=\alpha/\beta, \qquad \mathsf{Va}(Y)=\alpha/\beta^2. \end{align*}\] We can solve for \(\beta =\mathsf{E}_0(\lambda)/\mathsf{Va}_0(\lambda)\) and then use the mean relationship to retrieve $.
Moment matching gives \(\alpha = 1.65\) and \(\beta = 104.44.\)
Figure 2: Gamma posteriors of the conversion rate for the Upworthy Sesame street headline.
Theorem 1 A sufficient condition for a prior to yield a proper (i.e., integrable) posterior density function is that it is (proportional) to a density function.
Consider a Gaussian random effect model with \(n\) independent observations in \(J\) groups
The \(i\)th observation in group \(j\) is \[\begin{align*} Y_{ij} &\sim \mathsf{Gauss}(\mu_{ij}, \sigma^2) \\ \mu_{ij}&= \mathbf{X}_i \boldsymbol{\beta} + \alpha_j, \\ \alpha_j &\sim \mathsf{Gauss}(0, \tau^2)\\ ... \end{align*}\]
As Gelman (2006) states:
in a hierarchical model the data can never rule out a group-level variance of zero, and so [a] prior distribution cannot put an infinite mass in this area
We can view the improper prior as a limiting case \[\sigma \sim \mathsf{unif}(0, t), \qquad t \to \infty.\]
The Haldane prior for \(\theta\) in a binomial model is \(\theta^{-1}(1-\theta)^{-1},\) a limiting \(\mathsf{Be}(0,0)\) distribution.
The improper prior \(p(\sigma) \propto \sigma^{-1}\) is equivalent to an inverse gamma \(\mathsf{inv. gamma}(\epsilon, \epsilon)\) when \(\epsilon \to 0.\)
The limiting posterior is thus improper for random effects scales, so the value of \(\epsilon\) matters.
Let \(Y_i \sim \mathsf{GP}(\sigma, \xi)\) be generalized Pareto with density \[f(x) = \sigma^{-1}(1+\xi x/\sigma)_{+}^{-1/\xi-1}\] for \(\sigma>0\) and \(\xi \in \mathbb{R},\) and \(x_{+} =\max\{0, x\}.\)
Consider the maximum data information (MDI) \[p(\xi) \propto \exp(-\xi).\]
Since \(\lim_{\xi \to -\infty} \exp(-\xi) = \infty,\) the prior density increases without bound as \(\xi\) becomes smaller.
The MDI prior leads to an improper posterior without modification.
Figure 3: Unscaled maximum data information (MDI) prior density.
If we restrict the range of the MDI prior \(p(\xi)\) to \(\xi \geq -1,\) then \(p(\xi + 1) \sim \mathsf{expo}(1)\) and posterior is proper.
Uniform prior over the support of \(\theta,\) \[p(\theta) \propto 1.\]
Improper prior unless \(\theta \in [a,b]\) for finite \(a, b.\)
Consider a scale parameter \(\sigma > 0.\)
Vague priors are very diffuse proper prior.
For example, a vague Gaussian prior for regression coefficients on standardized data, \[\boldsymbol{\beta} \sim \mathsf{Gauss}_p(\mathbf{0}_p, 100\mathbf{I}_p).\]
In single-parameter models, the Jeffrey’s prior \[p(\theta) \propto |\imath(\theta)|^{1/2},\] proportional to the square root of the determinant of the Fisher information matrix, is invariant to any (differentiable) reparametrization.
Consider \(Y \sim \mathsf{binom}(1, \theta).\) The negative of the second derivative of the log likelihood with respect to \(p\) is \[ \jmath(\theta) = - \partial^2 \ell(\theta; y) / \partial \theta^2 = y/\theta^2 + (1-y)/(1-\theta)^2. \]
Since \(\mathsf{E}(Y)=\theta,\) the Fisher information is \[\imath(\vartheta) = \mathsf{E}\{\jmath(\theta)\}=1/\theta + 1/(1-\theta) = 1/\{\theta(1-\theta)\}.\] Jeffrey’s prior is therefore \(p(\theta) \propto \theta^{-1/2}(1-\theta)^{-1/2},\) a conjugate Beta prior \(\mathsf{Be}(0.5,0.5).\)
For a location-scale family with location \(\mu\) and scale \(\sigma,\) the independent priors \[\begin{align*} p(\mu) &\propto 1\\ p(\sigma) &\propto \sigma^{-1} \end{align*}\] are location-scale invariant.
The results are invariant to affine transformations of the units, \(\vartheta = a + b \theta.\)
Simpson et al. (2017) consider a principled way of constructing priors that penalized model complexity for stable inference and limit over-specification.
Computes Kullback–Leibler divergence between \(f\) and base model \(f_0\) densities, builds an exponential prior on the distance scale and backtransform.
The resulting prior is scale-invariant, but it’s derivation is nontrivial.
If \(\alpha_j \sim \mathsf{Gauss}(0, \zeta^2),\) the penalized complexity prior for the scale \(\zeta \sim \mathsf{expo}(\lambda).\)
Elicit \(Q,\) a high quantile of the standard deviation \(\zeta\) with tail probability \(\alpha\) and set \(\lambda = -\log(\alpha/Q).\)
The conjugate inverse gamma prior \(p(\zeta^2) \sim \mathsf{inv. gamma}(\alpha, \beta)\) is such that the mode for \(\zeta^2\) is \(\beta/(1+\alpha).\)
Often, we take \(\beta=\alpha = 0.01\) or \(0.001,\) but this leads to near-improper priors, so small values of the parameters are not optimal for ‘random effects’.
The inverse gamma prior cannot provide shrinkage or allow for no variability between groups.
A popular suggestion, due to Gelman (2006), is to take a centered Student-\(t\) distribution with \(\nu\) degrees of freedoms, truncated over \([0, \infty)\) with scale \(s.\)
Does the priors matter? As robustness check, one can fit the model with
Costly, but may be needed to convince reviewers ;)
We consider an experimental study conducted at Tech3Lab on road safety.
We model the number of violations, nviolation
as a function of distraction type (task
) and participant id
. \[\begin{align*}
\texttt{nviolation}_{ij} &\sim \mathsf{Poisson}(\mu_{ij})\\
\mu_{ij} &= \exp(\beta_{j} + \alpha_i),\\
\beta_j &\sim \mathsf{Gauss}(0, 100), \\
\alpha_i &\sim \mathsf{Gauss}(0, \tau^2).
\end{align*}\]
Specifically,
task
\(j\) (distraction type),Consider different priors for \(\tau\)
Figure 4: Posterior density of \(\tau\) for four different priors. The circle denotes the median and the bars the 50% and 95% percentile credible intervals.
Basically indistinguishable results for the random scale..
Average results on SAT program, for eight schools (Rubin, 1981).
The hierarchical model is
\[\begin{align*} Y_i &\sim \mathsf{Gauss}(\mu + \eta_i, \sigma_i^2)\\ \mu &\sim \mathsf{Gauss}(0, 100)\\ \eta_i & \sim \mathsf{Gauss}(0, \tau^2) \end{align*}\] Given the large sample in each school, we treat \(\sigma_i\) as fixed data by using the sample standard deviation.
Figure 5: Posterior density of the school-specific random effects standard deviation \(\tau\) under different priors.