Bayesian modelling
Introduction
Léo Belzile, HEC Montréal
Last compiled Thursday Apr 3, 2025
Distribution and density function
Let be a random vector with distribution function
If the distribution of is absolutely continuous, where is the joint density function.
Mass function
By abuse of notation, we denote the mass function in the discrete case
The support is the set of non-zero density/probability total probability over all points in the support,
Marginal distribution
The marginal distribution of a subvector is
Marginal density
The marginal density of an absolutely continuous subvector is through integration from the joint density.
Conditional distribution
The conditional distribution function of given , is for any value of in the support of .
Conditional and marginal for contingency table
Consider a bivariate distribution for supported on whose joint probability mass function is given in Table 1
Calculations for the marginal distribution
The marginal distribution of is obtain by looking at the total probability for each row/column, e.g.,
Conditional distribution
The conditional distribution so
Independence
Vectors and are independent if for any value of , .
The joint density, if it exists, also factorizes
If two subvectors and are independent, then the conditional density equals the marginal .
Expected value
If has density then a weighted integral of with weight
The identity function gives the expected value
Covariance matrix
We define the covariance matrix of as which reduces in the unidimensional setting to
Law of iterated expectation and variance
Let and be random vectors. The expected value of is
The tower property gives a law of iterated variance
Poisson distribution
The Poisson distribution has mass where denotes the gamma function.
The parameter of the Poisson distribution is both the expectation and the variance of the distribution, meaning
Gamma distribution
A gamma distribution with shape and rate , denoted , has density where is the gamma function.
Poisson with random scale
To handle overdispersion in count data, take
The joint density of and on is
Conditional distribution
The conditional distribution of can be found by considering only terms that are function of , whence so .
Marginal density of Poisson mean mixture
Marginally, where
Moments of negative binomial
By the laws of iterated expectation and iterative variance, The marginal distribution of , unconditionally, has a variance which exceeds its mean.
Gaussian location-scale
Consider independent standard Gaussian variates for with joint density function Consider the transformation with an invertible matrix.
Change of variable for Gaussian
- The inverse transformation is
- The Jacobian is simply so the joint density of is Since and we recover
Conditional distribution of Gaussian subvectors
Let and consider the partition where is a and is a vector for some
Then, we have the conditional distribution
Likelihood
The likelihood is a function of the parameter vector that gives the ‘density’ of a sample under a postulated distribution, treating the observations as fixed,
Likelihood for independent observations
If the joint density factorizes, The corresponding log likelihood function for independent and identically distributions observations is
Score
Let be the log likelihood function. The gradient of the log likelihood, termed score is the -vector
Example: random right-censoring
Consider a survival analysis problem for independent time-to-event data subject to (noninformative) random right-censoring. We observe
- failure times drawn from supported on
- independent binary censoring indicators , with indicating right-censoring and observed failure time.
Likelihood contribution with censoring
If individual observation has not experienced the event at the end of the collection period, then the likelihood contribution is , where is the maximum time observed for . We write the log likelihood
Censoring and exponential data
Suppose for simplicity that and let denote the number of observed failure times. Then, the log likelihood and the Fisher information are and the right-censored observations for the exponential model do not contribute to the information.
Example: first-order autoregressive process
Consider an model of the form where
- is the lag-one correlation,
- the global mean and
- is an iid innovation with mean zero and variance .
If , the process is stationary, and the variance does not increase with .
Markov property and likelihood decomposition
The Markov property states that the current realization depends on the past, only through the most recent value The log likelihood thus becomes
Marginal of AR(1)
The stationarity process has unconditional moments
The process is first-order Markov since the conditional distribution equals .
Log likelihood of AR(1)
If innovations are Gaussian, we have so the log-likelihood is
Estimation of integrals
Suppose we can simulate i.i.d. variables with the same distribution, with distribution .
We want to compute for some functional
- (mean)
- (probability of event)
- etc.
Vanilla Monte Carlo integration
We substitute expected value by sample average of
- law of large number guarantees convergence of if the latter is finite.
- Under finite second moments, central limit theorem gives
Importance sampling
Consider density instead with Then, and we can proceed similarly by drawing samples from .
Importance sampling estimator
An alternative Monte Carlo estimator uses the weighted average with weights . The latter equal 1 on average, so one could omit the denominator without harm.
Standard errors
If the variance of is finite, we can approximate the latter by the sample variance of the simple random sample and obtain the Monte Carlo standard error of the estimator
Precision of Monte Carlo integration
We want to have an estimator as precise as possible.
- but we can’t control the variance of , say
- the more simulations , the lower the variance of the mean.
- sample average for i.i.d. data has variance
- to reduce the standard deviation by a factor 10, we need times more draws!
Remember: the answer is random.
Example: functionals of gamma distribution
![]()
Figure 1: Running mean trace plots for (left), (middle) and (right) for a Gamma distribution with shape 0.5 and rate 2, as a function of the Monte Carlo sample size.
Recap
- We can specify distribution using hierarchies, with marginal conditional.
- Most density and mass functions for can be identified from their support and their kernel, i.e., terms that depend on , ignoring normalizing constants. We then match expressions.
- Expectations can be calculated analytically, or approximated via Monte Carlo simulations.
Bayesian modelling Introduction Léo Belzile, HEC Montréal Last compiled Thursday Apr 3, 2025