Probabilistic reasoning — always to be understood as subjective — merely stems from our being uncertain about something.
Why opt for the Bayesian paradigm?
Satisfies the likelihood principle
Generative approach naturally extends to complex settings (hierarchical models)
Uncertainty quantification and natural framework for prediction
Capability to incorporate subject-matter expertise
Bayesian versus frequentist
Frequentist
Parameters treated as fixed, data as random
true value of parameter is unknown.
Target is point estimator
Bayesian
Both parameters and data are random
inference is conditional on observed data
Target is a distribution
Joint and marginal distribution
The joint density of data and parameters is
where the marginal .
Posterior
Using Bayes’ theorem, the posterior density is
meaning that
Evaluating the marginal likelihood, is challenging when is high-dimensional.
Updating beliefs and sequentiality
By Bayes’ rule, we can consider updating the posterior by adding terms to the likelihood, noting that for independent and , The posterior is be updated in light of new information.
Binomial distribution
A binomial variable with probability of success has mass function Moments of the number of successes out of trials are
The binomial coefficient , where .
Beta distribution
The beta distribution with shapes and , denoted , has density
expectation: ;
mode if , else, , or none;
variance: .
Beta-binomial example
We write for ; the likelihood is
Consider a beta prior, , with density
Density versus likelihood
The binomial distribution is discrete with support , whereas the likelihood is continuous over .
Figure 1: Binomial density function (left) and scaled likelihood function (right).
If the density or mass function integrates to 1 over the range of , the integral of the likelihood over does not.
Posterior density and proportionality
Any term not a function of can be dropped, since it will absorbed by the normalizing constant. The posterior density is proportional to
the kernel of a beta density with shape parameters and .
The symbol , for proportionality, means dropping all terms not an argument of the left hand side.
Marginal likelihood
The marginal likelihood for the model with prior is where is the beta function.
Experiments and likelihoods
Consider the following sampling mechanism, which lead to successes out of independent trials, with the same probability of success .
Bernoulli: sample fixed number of observations with
binomial: same, but record only total number of successes so
negative binomial: sample data until you obtain a predetermined number of successes, whence
Likelihood principle
Two likelihoods that are proportional, up to a constant not depending on unknown parameters, yield the same evidence.
In all cases, , so these yield the same inference for Bayesian.
For a more in-depth discussion, see Section 6.3.2 of Casella & Berger (2002)
Integration
We could approximate the through either
numerical integration (cubature)
Monte Carlo simulations
In more complicated models, we will try to sample observations by bypassing completely this calculation.
The likelihood terms can be small (always less than one and decreasing for discrete data), so watch out for numerical overflow when evaluating normalizing constants.
Numerical example of (Monte Carlo) integration
y <-6L # number of successes n <-14L # number of trialsalpha <- beta <-1.5# prior parametersunnormalized_posterior <-function(theta){ theta^(y+alpha-1) * (1-theta)^(n-y + beta -1)}integrate(f = unnormalized_posterior,lower =0,upper =1)
1.066906e-05 with absolute error < 1e-12
# Compare with known constantbeta(y + alpha, n - y + beta)
[1] 1.066906e-05
# Monte Carlo integrationmean(unnormalized_posterior(runif(1e5)))
[1] 1.061693e-05
Marginal posterior
In multi-parameter models, additional integration is needed to get the marginal posterior
Marginalization is trivial when we have a joint sample: simply keep the column corresponding to .
Prior, likelihood and posterior
Figure 2: Scaled Binomial likelihood for six successes out of 14 trials, prior and corresponding posterior distribution from a beta-binomial model.
Proper prior
We could define the posterior simply as the normalized product of the likelihood and some prior function.
The prior function need not even be proportional to a density function (i.e., integrable as a function of ).
For example,
is improper because it is not integrable.
is a proper prior over (uniform).
Validity of the posterior
The marginal likelihood does not depend on
(a normalizing constant)
For the posterior density to be proper,
the marginal likelihood must be a finite!
in continuous models, the posterior is proper whenever the prior function is proper.
Different priors give different posteriors
Figure 3: Scaled binomial likelihood for six successes out of 14 trials, with (left), (middle) and (right) priors and posterior density.
Role of the prior
The posterior is beta, with expected value a weighted average of
the maximum likelihood estimator and
the prior mean.
Posterior concentration
Except for stubborn priors, the likelihood contribution dominates in large samples. The impact of the prior is then often negligible.
Figure 4: Beta posterior and binomial likelihood with a uniform prior for increasing number of observations (from left to right).
Model comparison
Suppose that we have models to be compared, with parameter vectors and data vector and prior probability .
The for models vs is equal to the times the .
Bayes factors
The is the ratio of marginal likelihoods, as Values of correspond to model being more likely than .
Strong dependence on the prior .
Must use proper priors.
Bayes factor for the binomial model
Consider two models with and
.
Summarizing posterior distributions
The output of the Bayesian learning will be either of:
a fully characterized distribution (in toy examples).
a numerical approximation to the posterior distribution.
an exact or approximate sample drawn from the posterior distribution.
Bayesian inference in practice
Most of the field revolves around the creation of algorithms that either
circumvent the calculation of the normalizing constant
(Monte Carlo and Markov chain Monte Carlo methods)
provide accurate numerical approximation, including for marginalizing out all but one parameter.
Define the , and the is useful for determining whether the prior is sensical.
Analytical derivation of predictive distribution
Given the prior or posterior, the predictive for trials is beta-binomial with density
Replace and to get the posterior predictive distribution.
Posterior predictive distribution
Figure 5: Beta-binomial posterior predictive distribution with corresponding binomial mass function evaluated at the maximum likelihood estimator.
Posterior predictive distribution via simulation
The posterior predictive carries over the parameter uncertainty so will typically be wider and overdispersed relative to the corresponding distribution.
Given a draw from the posterior, simulate a new observation from the distribution .
npost <-1e4L# Sample draws from the posterior distributionpost_samp <-rbeta(n = npost, y + alpha, n - y + beta)# For each draw, sample new observationpost_pred <-rbinom(n = npost, size = n, prob = post_samp)
The beta-binomial is used to model overdispersion in binary regression models.
Summarizing posterior distributions
The output of a Bayesian procedure is a distribution for the parameters given the data.
We may wish to return different numerical summaries (expected value, variance, mode, quantiles, …)
The question: which point estimator to return?
Decision theory and loss functions
A loss function assigns a weight to each value , corresponding to the regret or loss.
The point estimator is the minimizer of the expected loss
Point estimators and loss functions
In a univariate setting, the most widely used point estimators are
mean: quadratic loss
median: absolute loss
mode: 0-1 loss
The posterior mode is the maximum a posteriori or MAP estimator.
Measures of central tendency
Figure 6: Point estimators from a right-skewed distribution (left) and from a multimodal distribution (right).
Example of loss functions
Figure 7: Posterior density with mean, mode and median point estimators (left) and corresponding loss functions, scaled to have minimum value of zero (right).
Credible regions
The freshman dream comes true!
A credible region give a set of parameter values which contains the “true value” of the parameter with probability .
Caveat: McElreath (2020) suggests the term ‘compatibility’, as it
returns the range of parameter values compatible with the model and data.
Which credible intervals?
Multiple intervals, most common are
equitailed: region and quantiles and
highest posterior density interval (HPDI), which gives the smallest interval probability
If we accept to have more than a single interval, the highest posterior density region can be a set of disjoint intervals. The HDPI is more sensitive to the number of draws and more computationally intensive (see R package HDinterval). See Hyndman (1996) for computations.
Illustration of credible regions
Figure 8: Density plots with 89% (top) and 50% (bottom) equitailed or central credible (left) and highest posterior density (right) regions for two data sets, highlighted in grey.
# Highest posterior density intervals - note values are outside of the support!(hdiD <- HDInterval::hdi(density(postsamp), credMass =1-alpha, allowSplit =TRUE))
begin end
[1,] -0.04331573 0.2800577
[2,] 0.47816030 1.1423868
attr(,"credMass")
[1] 0.89
attr(,"height")
[1] 0.3898784
References
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.
Finetti, B. de. (1974). Theory of probability: A critical introductory treatment (Vol. 1). Wiley.