We term hyperparameters the parameters of the (hyper)priors.
How to elicit reasonable values for them?
use moment matching to get sensible values
trial-and-error using the prior predictive
draw a parameter value from the prior
for each, generate a new observation from the model
Example of simple linear regression
Working with standardized response and inputs
the slope is the correlation between explanatory and response
the intercept should be mean zero
are there sensible bounds for the range of the response?
Example - simple linear regression
Consider the relationship between height ( in cm) and weight ( in kg) among humans adults.1
Model using a simple linear regression
What about the slope parameter prior ?
Priors for the slope
Figure 1: Prior draws of linear regressions with different priors: vague (left) and lognormal (right). Figure 4.5 of McElreath (2020). The Guiness record for the world’s tallest person is 272cm.
Conjugate priors
A prior density is conjugate for likelihood if the product after renormalization, is of the same parametric family as the prior.
Distributions that are exponential family admit conjugate priors.
A distribution is an exponential family if it’s density can be written The support of must not depend on
Conjugate priors for common exponential families
distribution
unknown parameter
conjugate prior
Conjugate prior for the Poisson
If with density then for with fixed. Consider an i.i.d. sample with mean The posterior density is
so must be gamma
Parameter interpretation: events in time intervals.
Conjugate prior for Gaussian (known variance)
Consider an iid sample, and let Then,
The conditional posterior is Gaussian with
mean and
precision (reciprocal variance)
Upworthy examples
The Upworthy Research Archive (Matias et al., 2021) contains results for 22743 experiments, with a click through rate of 1.58% on average and a standard deviation of 1.23%.
We consider an A/B test that compared four different headlines for a story.
We model the conversion rate for each using
We treat as a known offset.
Headlines
Consider an A/B test from November 23st, 2014, that compared four different headlines for a story on Sesame Street workshop with interviews of children whose parents were in jail and visiting them in prisons. The headlines tested were:
Some Don’t Like It When He Sees His Mom. But To Him? Pure Joy. Why Keep Her From Him?
They’re Not In Danger. They’re Right. See True Compassion From The Children Of The Incarcerated.
Kids Have No Place In Jail … But In This Case, They Totally Deserve It.
Going To Jail Should Be The Worst Part Of Their Life. It’s So Not. Not At All.
A/B test: Sesame street example
headline
impressions
clicks
H1
3060
49
H2
2982
20
H3
3112
31
H4
3083
9
Moment matching for gamma distribution
For with the rate parameter, we have We can solve for and then use the mean relationship to retrieve $.
mu <-0.0158; sd <-0.0123(beta <- mu/sd^2)
[1] 104.4352
(alpha <- mu * beta)
[1] 1.650076
Moment matching gives and
Posterior distributions for Sesame Street
Figure 2: Gamma posteriors of the conversion rate for the Upworthy Sesame street headline.
Proper priors
Theorem 1 A sufficient condition for a prior to yield a proper (i.e., integrable) posterior density function is that it is (proportional) to a density function.
If we pick an improper prior, we need to check that the posterior is well-defined.
The answer to this question may depend on the sample size.
Proper posterior in a random effect model
Consider a Gaussian random effect model with independent observations in groups
The th observation in group is
Conditions for a proper posterior
for we need at least ‘groups’ for the posterior to be proper.
in a hierarchical model the data can never rule out a group-level variance of zero, and so [a] prior distribution cannot put an infinite mass in this area
Improper priors as limiting cases
We can view the improper prior as a limiting case
The Haldane prior for in a binomial model is a limiting distribution.
The improper prior is equivalent to an inverse gamma when
The limiting posterior is thus improper for random effects scales, so the value of matters.
MDI prior for generalized Pareto
Let be generalized Pareto with density for and and
Consider the maximum data information (MDI)
Since the prior density increases without bound as becomes smaller.
Truncated MDI for generalized Pareto distribution
The MDI prior leads to an improper posterior without modification.
Figure 3: Unscaled maximum data information (MDI) prior density.
If we restrict the range of the MDI prior to then and posterior is proper.
Flat priors
Uniform prior over the support of
Improper prior unless for finite
Flat priors for scale parameters
Consider a scale parameter
We could truncate the range, e.g., but this is not ‘uninformative’, as extreme values of are as likely as small ones.
These priors are not invariant: if implies so can be informative on another scale.
Vague priors
Vague priors are very diffuse proper prior.
For example, a vague Gaussian prior for regression coefficients on standardized data,
if we consider a logistic regression with a binary variable then gives odds ratios of 150, and of around 22K…
Invariance and Jeffrey’s prior
In single-parameter models, the Jeffrey’s prior proportional to the square root of the determinant of the Fisher information matrix, is invariant to any (differentiable) reparametrization.
Jeffrey’s prior for the binomial distribution
Consider The negative of the second derivative of the log likelihood with respect to is
Since the Fisher information is Jeffrey’s prior is therefore a conjugate Beta prior
Invariant priors for location-scale families
For a location-scale family with location and scale the independent priors are location-scale invariant.
The results are invariant to affine transformations of the units,
Penalized complexity priors
Simpson et al. (2017) consider a principled way of constructing priors that penalized model complexity for stable inference and limit over-specification.
Computes Kullback–Leibler divergence between and base model densities, builds an exponential prior on the distance scale and backtransform.
The resulting prior is scale-invariant, but it’s derivation is nontrivial.
Penalized complexity prior for random effect scale
If the penalized complexity prior for the scale
Elicit a high quantile of the standard deviation with tail probability and set
Priors for scale of random effects
The conjugate inverse gamma prior is such that the mode for is
Often, we take or but this leads to near-improper priors, so small values of the parameters are not optimal for ‘random effects’.
The inverse gamma prior cannot provide shrinkage or allow for no variability between groups.
Priors for scale of random effects
A popular suggestion, due to Gelman (2006), is to take a centered Student- distribution with degrees of freedoms, truncated over with scale
since the mode is at zero, provides support for the base model
we want small degrees of freedom preferable to take ? Cauchy model () still popular.
Prior sensitivity
Does the priors matter? As robustness check, one can fit the model with
different priors function
different hyperparameter values
Costly, but may be needed to convince reviewers ;)
Distraction from smartwach
We consider an experimental study conducted at Tech3Lab on road safety.
In Brodeur et al. (2021), 31 participants were asked to drive in a virtual environment.
The number of road violation was measured for 4 different type of distractions (phone notification, phone on speaker, texting and smartwatch).
Balanced data, random order of tasks
Poisson mixed model
We model the number of violations, nviolation as a function of distraction type (task) and participant id.
Specifically,
is the coefficient for task (distraction type),
is the random effect of participant
Priors for random effect scale
Consider different priors for
flat uniform prior
conjugate inverse gamma prior
a truncated Student- on with degrees of freedom,
a penalized complexity prior such that the 0.95 percentile of the scale is 5, corresponding to
Sensitivity analysis for smartwatch data
Figure 4: Posterior density of for four different priors. The circle denotes the median and the bars the 50% and 95% percentile credible intervals.
Basically indistinguishable results for the random scale..
Eight schools example
Average results on SAT program, for eight schools (Rubin, 1981).
The hierarchical model is
Given the large sample in each school, we treat as fixed data by using the sample standard deviation.
Sensibility analysis for eight schools example
Figure 5: Posterior density of the school-specific random effects standard deviation under different priors.
References
Brodeur, M., Ruer, P., Léger, P.-M., & Sénécal, S. (2021). Smartwatches are more distracting than mobile phones while driving: Results from an experimental study. Accident Analysis & Prevention, 149, 105846. https://doi.org/10.1016/j.aap.2020.105846
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3), 515–534. https://doi.org/10.1214/06-ba117a
Matias, J. N., Munger, K., Le Quere, M. A., & Ebersole, C. (2021). The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media. Scientific Data, 8(195). https://doi.org/10.1038/s41597-021-00934-7
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and STAN (2nd ed.). Chapman; Hall/CRC.
Simpson, D., Rue, H., Riebler, A., Martins, T. G., & Sørbye, S. H. (2017). Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science, 32(1), 1–28. https://doi.org/10.1214/16-sts576