headline | impressions | clicks |
---|---|---|
H1 | 3060 | 49 |
H2 | 2982 | 20 |
H3 | 3112 | 31 |
H4 | 3083 | 9 |
Priors
Last compiled Tuesday Feb 4, 2025
The posterior density is
\[\begin{align*} \color{#D55E00}{p(\boldsymbol{\theta} \mid \boldsymbol{Y})} = \frac{\color{#0072B2}{p(\boldsymbol{Y} \mid \boldsymbol{\theta})} \times \color{#56B4E9}{p(\boldsymbol{\theta})}}{\color{#E69F00}{\int p(\boldsymbol{Y} \mid \boldsymbol{\theta}) p(\boldsymbol{\theta})\mathrm{d} \boldsymbol{\theta}}}, \end{align*}\]
where \[\color{#D55E00}{\text{posterior}} \propto \color{#0072B2}{\text{likelihood}} \times \color{#56B4E9}{\text{prior}}\]
We need to determine a suitable prior.
The posterior is a compromise prior and likelihood:
Infinite number of choice, but many default choices…
We term hyperparameters the parameters of the (hyper)priors.
How to elicit reasonable values for them?
Working with standardized response and inputs \[x_i \mapsto (x_i - \overline{x})/\mathrm{sd}(\boldsymbol{x}),\]
Consider the relationship between height (\(Y,\) in cm) and weight (\(X,\) in kg) among humans adults.1
Model using a simple linear regression
\[\begin{align*} Y_i &\sim \mathsf{Gauss}(\mu_i, \sigma^2) \\ \mu_i &= \beta_0 + \beta_1(\mathrm{x}_i - \overline{x}) \\ \beta_0 &\sim \mathsf{Gauss}(178, 20^2) \\ \sigma &\sim \mathsf{unif}(0, 50) \end{align*}\] What about the slope parameter prior \(p(\beta_1)\)?
A prior density \(p(\boldsymbol{\theta})\) is conjugate for likelihood \(L(\boldsymbol{\theta}; \boldsymbol{y})\) if the product \(L(\boldsymbol{\theta}; \boldsymbol{y})p(\boldsymbol{\theta}),\) after renormalization, is of the same parametric family as the prior.
Distributions that are exponential family admit conjugate priors.
A distribution is an exponential family if it’s density can be written \[\begin{align*} f(y; \boldsymbol{\theta}) = \exp\left\{ \sum_{k=1}^K Q_k(\boldsymbol{\theta}) t_k(y) + D(\boldsymbol{\theta}) + h(y)\right\}. \end{align*}\] The support of \(f\) must not depend on \(\boldsymbol{\theta}.\)
distribution | unknown parameter | conjugate prior |
---|---|---|
\(Y \sim \mathsf{expo}(\lambda)\) | \(\lambda\) | \(\lambda \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{Poisson}(\mu)\) | \(\mu\) | \(\mu \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{binom}(n, \theta)\) | \(\theta\) | \(\theta \sim \mathsf{Be}(\alpha, \beta)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\mu\) | \(\mu \sim \mathsf{Gauss}(\nu, \omega^2)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\sigma\) | \(\sigma^{-2} \sim \mathsf{gamma}(\alpha, \beta)\) |
\(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) | \(\mu, \sigma\) | \(\mu \mid \sigma^2 \sim \mathsf{Gauss}(\nu, \omega \sigma^2),\) \(\sigma^{-2} \sim \mathsf{gamma}(\alpha, \beta)\) |
If \(Y \sim \mathsf{Poisson}(\mu)\) with density \(f(y) = \mu^x\exp(-\mu x)/x!,\) then for \(\mu \sim \mathsf{gamma}(\alpha, \beta)\) with \(\alpha, \beta\) fixed. Consider an i.i.d. sample with mean \(\overline{y}.\) The posterior density is
\[ p(\mu \mid y) \stackrel{\mu}{\propto} \mu^{n\overline{y}} \exp\left(-\mu n\overline{y}\right) \mu^{\alpha-1} \exp(-\beta \mu) \] so must be gamma \(\mathsf{gamma}(n\overline{y} + \alpha, n\overline{y} + \beta).\)
Parameter interpretation: \(\alpha\) events in \(\beta\) time intervals.
Consider an iid sample, \(Y_i \sim \mathsf{Gauss}(\mu, \sigma^2)\) and let \(\mu \mid \sigma \sim \mathsf{Gauss}(\nu, \sigma^2\tau^2).\) Then, \[\begin{align*} p(\mu, \sigma) &\propto \frac{p(\sigma)}{\sigma^{n+1}} \exp\left\{ -\frac{1}{2\sigma^2}\sum_{i=1}^n (y_{i}-\mu)^2\right\} \exp\left\{-\frac{1}{2\sigma^2\tau^2}(\mu - \nu)^2\right\} \\&\propto \frac{p(\sigma)}{\sigma^{n+1}} \exp\left\{\left(\sum_{i=1}^n y_{i} + \frac{\nu}{\tau^2}\right)\frac{\mu}{\sigma^2} - \left( \frac{n}{2} +\frac{1}{2\tau^2}\right)\frac{\mu^2}{\sigma^2}\right\}. \end{align*}\]
The conditional posterior \(p(\mu \mid \sigma)\) is Gaussian with
We treat \(\texttt{impression}\) as a known offset.
Consider an A/B test from November 23st, 2014, that compared four different headlines for a story on Sesame Street workshop with interviews of children whose parents were in jail and visiting them in prisons. The headlines tested were:
headline | impressions | clicks |
---|---|---|
H1 | 3060 | 49 |
H2 | 2982 | 20 |
H3 | 3112 | 31 |
H4 | 3083 | 9 |
For \(Y \sim \mathsf{gamma}(\alpha, \beta)\) with \(\beta\) the rate parameter, we have \[\begin{align*} \mathsf{E}(Y)=\alpha/\beta, \qquad \mathsf{Va}(Y)=\alpha/\beta^2. \end{align*}\] We can solve for \(\beta =\mathsf{E}_0(\lambda)/\mathsf{Va}_0(\lambda)\) and then use the mean relationship to retrieve $.
Moment matching gives \(\alpha = 1.65\) and \(\beta = 104.44.\)
Theorem 1 A sufficient condition for a prior to yield a proper (i.e., integrable) posterior density function is that it is (proportional) to a density function.
Consider a Gaussian random effect model with \(n\) independent observations in \(J\) groups
The \(i\)th observation in group \(j\) is \[\begin{align*} Y_{ij} &\sim \mathsf{Gauss}(\mu_{ij}, \sigma^2) \\ \mu_{ij}&= \mathbf{X}_i \boldsymbol{\beta} + \alpha_j, \\ \alpha_j &\sim \mathsf{Gauss}(0, \tau^2)\\ ... \end{align*}\]
As Gelman (2006) states:
in a hierarchical model the data can never rule out a group-level variance of zero, and so [a] prior distribution cannot put an infinite mass in this area
We can view the improper prior as a limiting case \[\sigma \sim \mathsf{unif}(0, t), \qquad t \to \infty.\]
The Haldane prior for \(\theta\) in a binomial model is \(\theta^{-1}(1-\theta)^{-1},\) a limiting \(\mathsf{Be}(0,0)\) distribution.
The improper prior \(p(\sigma) \propto \sigma^{-1}\) is equivalent to an inverse gamma \(\mathsf{inv. gamma}(\epsilon, \epsilon)\) when \(\epsilon \to 0.\)
The limiting posterior is thus improper for random effects scales, so the value of \(\epsilon\) matters.
Let \(Y_i \sim \mathsf{GP}(\sigma, \xi)\) be generalized Pareto with density \[f(x) = \sigma^{-1}(1+\xi x/\sigma)_{+}^{-1/\xi-1}\] for \(\sigma>0\) and \(\xi \in \mathbb{R},\) and \(x_{+} =\max\{0, x\}.\)
Consider the maximum data information (MDI) \[p(\xi) \propto \exp(-\xi).\]
Since \(\lim_{\xi \to -\infty} \exp(-\xi) = \infty,\) the prior density increases without bound as \(\xi\) becomes smaller.
The MDI prior leads to an improper posterior without modification.
If we restrict the range of the MDI prior \(p(\xi)\) to \(\xi \geq -1,\) then \(p(\xi + 1) \sim \mathsf{expo}(1)\) and posterior is proper.
Uniform prior over the support of \(\theta,\) \[p(\theta) \propto 1.\]
Improper prior unless \(\theta \in [a,b]\) for finite \(a, b.\)
Consider a scale parameter \(\sigma > 0.\)
Vague priors are very diffuse proper prior.
For example, a vague Gaussian prior for regression coefficients on standardized data, \[\boldsymbol{\beta} \sim \mathsf{Gauss}_p(\mathbf{0}_p, 100\mathbf{I}_p).\]
In single-parameter models, the Jeffrey’s prior \[p(\theta) \propto |\imath(\theta)|^{1/2},\] proportional to the square root of the determinant of the Fisher information matrix, is invariant to any (differentiable) reparametrization.
Consider \(Y \sim \mathsf{binom}(1, \theta).\) The negative of the second derivative of the log likelihood with respect to \(p\) is \[ \jmath(\theta) = - \partial^2 \ell(\theta; y) / \partial \theta^2 = y/\theta^2 + (1-y)/(1-\theta)^2. \]
Since \(\mathsf{E}(Y)=\theta,\) the Fisher information is \[\imath(\vartheta) = \mathsf{E}\{\jmath(\theta)\}=1/\theta + 1/(1-\theta) = 1/\{\theta(1-\theta)\}.\] Jeffrey’s prior is therefore \(p(\theta) \propto \theta^{-1/2}(1-\theta)^{-1/2},\) a conjugate Beta prior \(\mathsf{Be}(0.5,0.5).\)
For a location-scale family with location \(\mu\) and scale \(\sigma,\) the independent priors \[\begin{align*} p(\mu) &\propto 1\\ p(\sigma) &\propto \sigma^{-1} \end{align*}\] are location-scale invariant.
The results are invariant to affine transformations of the units, \(\vartheta = a + b \theta.\)
Simpson et al. (2017) consider a principled way of constructing priors that penalized model complexity for stable inference and limit over-specification.
Computes Kullback–Leibler divergence between \(f\) and base model \(f_0\) densities, builds an exponential prior on the distance scale and backtransform.
The resulting prior is scale-invariant, but it’s derivation is nontrivial.
If \(\alpha_j \sim \mathsf{Gauss}(0, \zeta^2),\) the penalized complexity prior for the scale \(\zeta \sim \mathsf{expo}(\lambda).\)
Elicit \(Q,\) a high quantile of the standard deviation \(\zeta\) with tail probability \(\alpha\) and set \(\lambda = -\log(\alpha/Q).\)
The conjugate inverse gamma prior \(p(\zeta^2) \sim \mathsf{inv. gamma}(\alpha, \beta)\) is such that the mode for \(\zeta^2\) is \(\beta/(1+\alpha).\)
Often, we take \(\beta=\alpha = 0.01\) or \(0.001,\) but this leads to near-improper priors, so small values of the parameters are not optimal for ‘random effects’.
The inverse gamma prior cannot provide shrinkage or allow for no variability between groups.
A popular suggestion, due to Gelman (2006), is to take a centered Student-\(t\) distribution with \(\nu\) degrees of freedoms, truncated over \([0, \infty)\) with scale \(s.\)
Does the priors matter? As robustness check, one can fit the model with
Costly, but may be needed to convince reviewers ;)
We consider an experimental study conducted at Tech3Lab on road safety.
We model the number of violations, nviolation
as a function of distraction type (task
) and participant id
. \[\begin{align*}
\texttt{nviolation}_{ij} &\sim \mathsf{Poisson}(\mu_{ij})\\
\mu_{ij} &= \exp(\beta_{j} + \alpha_i),\\
\beta_j &\sim \mathsf{Gauss}(0, 100), \\
\alpha_i &\sim \mathsf{Gauss}(0, \tau^2).
\end{align*}\]
Specifically,
task
\(j\) (distraction type),Consider different priors for \(\tau\)
Basically indistinguishable results for the random scale..
Average results on SAT program, for eight schools (Rubin, 1981).
The hierarchical model is
\[\begin{align*} Y_i &\sim \mathsf{Gauss}(\mu + \eta_i, \sigma_i^2)\\ \mu &\sim \mathsf{Gauss}(0, 100)\\ \eta_i & \sim \mathsf{Gauss}(0, \tau^2) \end{align*}\] Given the large sample in each school, we treat \(\sigma_i\) as fixed data by using the sample standard deviation.