2 Bayesics
The Bayesian paradigm is an inferential framework that is widely used in data science. It builds on likelihood-based inference, offers a natural framework for prediction and for uncertainty quantification. The interpretation is more natural than that of classical (i.e., frequentist) paradigm, and it is more easy to generalized models to complex settings, notably through hierarchical constructions. The main source of controversy is the role of the prior distribution, which allows one to incorporate subject-matter expertise but leads to different inferences being drawn by different practitioners; this subjectivity is not to the taste of many and has been the subject of many controversies.
The Bayesian paradigm includes multiples notions that are not covered in undergraduate introductory courses. The purpose of this chapter is to introduce these concepts and put them in perspective; the reader is assumed to be familiar with basics of likelihood-based inference. We begin with a discussion of the notion of probability, then define priors, posterior distributions, marginal likelihood and posterior predictive distributions. We focus on the interpretation of posterior distributions and explain how to summarize the posterior, leading leading to definitions of high posterior density region, credible intervals, posterior mode for cases where we either have a (correlated) sample from the posterior, or else have access to the whole distribution. Several notions, including sequentiality, prior elicitation and estimation of the marginal likelihood, are mentioned in passing. A brief discussion of Bayesian hypothesis testing (and alternatives) is presented.
At the end of the chapter, students should be able to
- define key notions (prior, marginal likelihood, Bayes factor, credible intervals, etc.) of Bayesian inference.
- distinguish between questions that relate to the posterior distribution versus the posterior predictive.
- calculate and compute numerically summaries of posterior distributions (point estimators and credible intervals) given a posterior sample.
- explain some of the conceptual differences between frequentist and Bayesian inference.
2.1 Probability and frequency
In classical (frequentist) parametric statistic, we treat observations
The interpretation of probability in the classical statistic is in terms of long run frequency, which is why we term this approach frequentist statistic. Think of a fair die: when we state that values
In practice, the true value of the parameter
Probabilistic reasoning — always to be understood as subjective — merely stems from our being uncertain about something. It makes no difference whether the uncertainty relates to an unforseeable future, or to an unnoticed past, or to a past doubtfully reported or forgotten: it may even relate to something more or less knowable (by means of a computation, a logical deduction, etc.) but for which we are not willing or able tho make the effort; and so on […] The only relevant thing is uncertainty — the extent of our knowledge and ignorance. The actual fact of whether or not the events considered are in some sense determined, or known by other people, and so on, is of no consequence.
On page 3, de Finetti continues (Finetti 1974)
only subjective probabilities exist — i.e., the degree of belief in the occurrence of an event attributed by a given person at a given instant and with a given set of information.
2.2 Posterior distribution
We consider a parametric model with parameters
In a Bayesian analysis, observations are random variables but inference is performed conditional on the observed sample values. By Bayes’ theorem, our target is therefore the posterior density
The posterior
For the posterior to be proper, we need the product of the prior and the likelihood on the right hand side to be integrable as a function of
If
To fix ideas, we consider next a simple one-parameter model where the marginal likelihood can be computed explicitly.
Example 2.1 (Binomial model with beta prior) Consider a binomial likelihood with probability of success
If
Another way to see this is to track moments (expectation, variance, etc.) From Definition 1.3, the posterior mean is
Figure 2.1 shows three different posterior distributions with different beta priors: the first prior, which favors values closer to 1/2, leads to a more peaked posterior density, contrary to the second which is symmetric, but concentrated toward more extreme values near endpoints of the support. The rightmost panel is truncated: as such, the posterior is zero for any value of
Remark (Proportionality). Any term appearing in the likelihood times prior function that does not depend on parameters can be omitted since they will be absorbed by the normalizing constant. This makes it useful to compute normalizing constants or likelihood ratios.
Remark. An alternative parametrization for the beta distribution sets
Remark. A density integrates to 1 over the range of possible outcomes, but there is no guarantee that the likelihood function, as a function of
For example, the binomial likelihood with
Moreover, the binomial distribution is discrete with support
Definition 2.1 (Bayes factor and model comparison) The marginal likelihood enters in the comparison of different models. Suppose that we have models
While Bayes factors are used for model comparison, the answers depend very strongly on the prior
The Bayes factor require that we compare the same data, but both likelihood and priors could be different from one model to the next.
Example 2.2 (Bayes factor for the binomial model) The marginal likelihood for the
Consider three models with
If
# Log of marginal posterior for binom with beta prior (default is uniform)
<- function(n, y, alpha = 1, beta = 1){
log_marg_post_beta lchoose(n, y) + lbeta(alpha + y, beta + n - y) - lbeta(alpha, beta)
}# Log of Bayes factor
<- function(y, n){ # model 2 (beta(1.5,1.5) vs 3 (point mass at 0.5)
logBF2vs3 log_marg_post_beta(n = n, y = y, alpha = 1.5, beta = 1.5) - dbinom(x = y, size = n, prob = 0.5, log = TRUE)
}
Proposition 2.1 (Sequentiality and Bayesian updating) The likelihood is invariant to the order of the observations if they are independent. Thus, if we consider two blocks of observations
By Bayes’ rule, we can consider updating the posterior by adding terms to the likelihood, noting that
Example 2.3 (Numerical integration) While we can calculate analytically the value of the normalizing constant for the beta-binomial model, we could also for arbitrary priors use numerical integration or Monte Carlo methods in the event the parameter vector
While estimation of the normalizing constant is possible in simple models, the following highlights some challenges that are worth keeping in mind. In a model for discrete data (that is, assigning probability mass to a countable set of outcomes), the terms in the likelihood are probabilities and thus the likelihood becomes smaller as we gather more observations (since we multiply terms between zero or one). The marginal likelihood term becomes smaller and smaller, so it’s reciprocal is big and this can lead to arithmetic underflow.
<- 6L # number of successes
y <- 14L # number of trials
n <- beta <- 1.5 # prior parameters
alpha <- function(theta){
unnormalized_posterior ^(y+alpha-1) * (1-theta)^(n-y + beta - 1)
theta
}integrate(f = unnormalized_posterior,
lower = 0,
upper = 1)
1.066906e-05 with absolute error < 1e-12
# Compare with known constant
beta(y + alpha, n - y + beta)
[1] 1.066906e-05
# Monte Carlo integration
mean(unnormalized_posterior(runif(1e5)))
[1] 1.064067e-05
When
Example 2.4 (Importance of selling format) Duke and Amir (2023) consider the difference between integrated and sequential format for sales. The sellingformat
dataset contains 1
, or do not buy 0
, is recorded.
We consider the number of purchased out of the total, treating records as independent Bernoulli observations with a flat (uniform prior).
With a beta-binomial model, the posterior for the probability of buying is
data(sellingformat, package = "hecbayes")
<- with(sellingformat, table(format, purchased))
contingency # Posterior draws of the parameters
<- rbeta(n = 1e4, shape1 = 47, shape2 = 153)
post_p_int <- rbeta(n = 1e4, shape1 = 24, shape2 = 177)
post_p_seq # Reparametrization
<- (post_p_int / (1 - post_p_int))
post_odds_int <- (post_p_seq / (1 - post_p_seq))
post_odds_seq <- post_odds_int / post_odds_seq post_oddsratio
Figure 2.5 shows the posterior of the probability of buying for each group, and the odds. It is clear that the integrated format leads to much more sales in the experiment, with a posterior ratio exceeding 1 with probability
2.3 Posterior predictive distribution
Prediction in the Bayesian paradigm is obtained by considering the posterior predictive distribution,
Given draws from the posterior distribution, say
<- 1e4L
npost # Sample draws from the posterior distribution
<- rbeta(n = npost, y + alpha, n - y + beta)
post_samp # For each draw, sample new observation
<- rbinom(n = npost, size = n, prob = post_samp) post_pred
Given the
Example 2.5 (Posterior predictive distribution of univariate Gaussian with known mean) Consider an
The posterior predictive is
2.4 Summarizing posterior distributions
The output of the Bayesian learning problem will be either of:
- a fully characterized distribution
- a numerical approximation to the posterior distribution (pointwise)
- an exact or approximate sample drawn from the posterior distribution
In the first case, we will be able to directly evaluate quantities of interest if there are closed-form expressions for the latter, or else we could draw samples from the distribution and evaluate them via Monte-Carlo. In case of numerical approximations, we will need to resort to numerical integration or otherwise to get our answers.
Often, we will also be interested in the marginal posterior distribution of each component
Most of the field of Bayesian statistics revolves around the creation of algorithms that either circumvent the calculation of the normalizing constant (notably using Monte Carlo and Markov chain Monte Carlo methods) or else provide accurate numerical approximation of the posterior pointwise, including for marginalizing out all but one parameters (integrated nested Laplace approximations, variational inference, etc.) The target of inference is the whole posterior distribution, a potentially high-dimensional object which may be difficult to summarize or visualize. We can thus report only characteristics of the the latter.
The choice of point summary to keep has it’s root in decision theory.
Definition 2.2 (Loss function) A loss function
For example, in a univariate setting, the quadratic loss
For example consider the quadratic loss function which is differentiable. Provided we can interchange differential operator and integral sign,
All of these point estimators are central tendency measures, but some may be more adequate depending on the setting as they can correspond to potentially different values, as shown in the left-panel of Figure 2.7. The choice is application specific: for multimodal distributions, the mode is likely a better choice.
If we know how to evaluate the distribution numerically, we can optimize to find the mode or else return the value for the pointwise evaluation on a grid at which the density achieves it’s maximum. The mean and median would have to be evaluated by numerical integration if there is no closed-form expression for the latter.
If we have rather a sample from the posterior with associated posterior density values, then we can obtain the mode as the parameter combination with the highest posterior, the median from the value at rank
The loss function is often a functional (meaning a one-dimensional summary) from the posterior. The following example shows how it reduces a three-dimensional problem into a single risk measure.
Example 2.6 (Value-at-risk for Danish insurance losses) In extreme value, we are often interested in assessing the risk of events that are rare enough that they lie beyond the range of observed data. To provide a scientific extrapolation, it often is justified to fit a generalized Pareto distribution to exceedances of
Insurance companies provide coverage in exchange for premiums, but need to safeguard themselves against very high claims by buying reinsurance products. These risks are often communicated through the value-at-risk (VaR), a high quantile exceeded with probability evir
package and analyzed in Example 7.23 of McNeil, Frey, and Embrechts (2005). These claims are denoted
We fit a generalized Pareto distribution to exceedances above 10 millions krones, keeping 109 observations or roughly the largest 5% of the original sample. Preliminary analysis shows that we can treat data as roughly independent and identically distributed and goodness-of-fit diagnostics (not shown) for the generalized Pareto suggest that the fit is adequate for all but the three largest observations, which are (somewhat severely) underestimated by the model.
The generalized Pareto model only describes the
Suppose for now that we set a
We consider the modelling of insurance losses exceeding danish
fire insurance data with some prior; see Definition 1.11 for the model. The model has three parameters: the scale
Our aim is to evaluate the posterior distribution for the value-at-risk, the post_samp
that contains exact samples from the posterior distribution of
Suppose that we prefer to under-estimate the value-at-risk rather than overestimate: this could be captured by the custom loss function
# Compute value at risk from generalized Pareto distribution quantile fn
<- with(post_samp, # data frame of posterior draws
VaR_post ::qgp( # with columns 'probexc', 'scale', 'shape'
revdbayesp = 0.01/probexc,
loc = 10,
scale = scale,
shape = shape,
lower.tail = FALSE))
# Loss function
<- function(qhat, q){
loss mean(ifelse(q > qhat,
0.5*(0.99*q-qhat),
0.75*(qhat-1.01*q)))
}# Create a grid of values over which to estimate the loss for VaR
<- 101L
nvals <- seq(
VaR_grid from = quantile(VaR_post, 0.01),
to = quantile(VaR_post, 0.99),
length.out = nvals)
# Create a container to store results
<- numeric(length = nvals)
risk for(i in seq_len(nvals)){
# Compute integral (Monte Carlo average over draws)
<- loss(q = VaR_post, qhat = VaR_grid[i])
risk[i] }
To communicate uncertainty, we may resort to credible regions and intervals.
Definition 2.3 A
These intervals are not unique, as are confidence sets. In the univariate setting, the central or equitailed interval are the most popular, and easily obtained by considering the
set.seed(2023)
<- rbeta(n = 1000, shape1 = 0.5, shape2 = 0.2)
postsamp <- 0.11
alpha # Compute equitailed interval bounds
quantile(postsamp, probs = c(alpha/2, 1-alpha/2))
5.5% 94.5%
0.0246807 0.9999980
qbeta(p = c(alpha/2, 1-alpha/2), shape1 = 0.5, shape2 = 0.2)
[1] 0.02925205 0.99999844
# Highest posterior density intervals
<- HDInterval::hdi(density(postsamp), credMass = 1-alpha, allowSplit = TRUE) hdiD
The equitailed intervals for a known posterior can be obtained directly from the quantile function or via Monte Carlo simply by querying sample quantiles. The HPD region is more complicated to obtain and requires dedicated software, which in the above case may fail to account for the support of the posterior!
- Bayesians treat both parameter and observations as random, the former due to uncertainty about the true value. Inference is performed conditional on observed data, and summarized in the posterior.
- Bayesians specify both a prior for the parameter, and a likelihood specifying the data generating mechanism. It is thus an extension of likelihood-based inference.
- Information from the prior is updated in light of new data, which is encoded by the likelihood. Sequential updating leads.
- Under weak conditions on the prior, large-sample behaviour of Bayesian and frequentist.
- Bayesian inference is complicated by the fact that there is more often than not no closed-form expression for the posterior distribution. Evaluation of the normalizing constant, the so-called marginal likelihood, is challenging, especially in high dimensional settings.
- Rather than hypothesis testing, Bayesian methods rely on the posterior distribution of parameters, or on Bayes factor for model comparisons.
- The posterior predictive distribution, used for model assessment and prediction, and it has a higher variance than the data generating distribution from the likelihood, due to parameter uncertainty.
- Bayesians typically will have approximations to the posterior distribution, or samples drawn from it.
- Loss functions can be used to summarize a posterior distribution into a numerical summary of interest, which may vary depending on the objective.
- Uncertainty is reflected by credible sets or credible intervals, which encode the posterior probability that the true value
belongs to the set.