1 Introduction

This section review basic concepts in probability theory that will be used throughout the course. The overview begins with basic statistical concepts, random variables, their distribution and density, moments and likelihood derivations.

Learning objectives:

At the end of the chapter, students should be able to

review notions of joint, marginal and conditional distributions.
identify the conditional from a joint density.
calculate analytically the marginal distribution by completing a density function.
calculate the density after a change of variables.
calculate of moments, using the definition or using the tower property.
write down the likelihood for simple settings involving independent observations and other simple scenarios (truncation, censoring, non-identical, serial dependence).

1.1 Random vectors

We begin with a characterization of random vectors and their marginal, conditional and joint distributions. A good reference for this material is Chapter 3 of McNeil, Frey, and Embrechts (2005), and Appendix A of Held and Bové (2020).

Definition 1.1 (Density and distribution function) Let \(\boldsymbol{X}\) denote a \(d\)-dimensional vector with real entries in \(\mathbb{R}^d.\) The distribution function of \(\boldsymbol{X}\) is \[\begin{align*} F_{\boldsymbol{X}}(\boldsymbol{x}) = \Pr(\boldsymbol{X} \leq \boldsymbol{x}) = \Pr(X_1 \leq x_1, \ldots, X_d \leq x_d). \end{align*}\]

If the distribution of \(\boldsymbol{X}\) is absolutely continuous, we may write \[\begin{align*} F_{\boldsymbol{X}}(\boldsymbol{x}) = \int_{-\infty}^{x_d} \cdots \int_{-\infty}^{x_1} f_{\boldsymbol{X}}(z_1, \ldots, z_d) \mathrm{d} z_1 \cdots \mathrm{d} z_d, \end{align*}\] where \(f_{\boldsymbol{X}}(\boldsymbol{x})\) is the joint density function. The density function can be obtained as the derivative of the distribution function with respect to all of it’s arguments.

We use the same notation for the mass function in the discrete case where \(f_{\boldsymbol{X}}(\boldsymbol{x}) = \Pr(X_1 = x_1, \ldots, X_d = x_d),\) where the integral is understood to mean a summation over all values lower or equal to \(\boldsymbol{x}\) in the support. In the discrete case, \(0 \leq f_{\boldsymbol{X}}(\boldsymbol{x}) \leq 1\) is a probability and the total probability over all points in the support sum to one, meaning \(\sum_{\boldsymbol{x} \in \mathsf{supp}(\boldsymbol{X})} f_{\boldsymbol{X}}(\boldsymbol{x}) = 1.\)

1.1.1 Common distributions

Definition 1.2 (Gamma, chi-square and exponential distributions) A random variable follows a gamma distribution with shape \(\alpha>0\) and rate \(\beta>0,\) denoted \(Y \sim \mathsf{gamma}(\alpha, \beta),\) if it’s density is \[\begin{align*} f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}\exp(-\beta x), \qquad x \in (0, \infty), \end{align*}\] where \(\Gamma(\alpha)\coloneqq\int_0^\infty t^{\alpha-1}\exp(-t)\mathrm{d} t\) is the gamma function.

If \(\alpha=1,\) the density simplifies to \(\beta \exp(-\beta x)\) and we recover the exponential distribution, denote \(\mathsf{expo}(\beta).\) The case \(\mathsf{gamma}(\nu/2, 1/2)\) corresponds to the chi-square distribution \(\chi^2_\nu.\)

The mean and variance of a gamma are \(\mathsf{E}(Y)=\alpha/\beta\) and \(\mathsf{Va}(Y)=\alpha/\beta^2\).

Definition 1.3 (Beta and uniform distribution) The beta distribution \(\mathsf{beta}(\alpha_1, \alpha_2)\) is a distribution supported on the unit interval \([0,1]\) with shape parameters \(\alpha_1>0\) and \(\alpha_2>0.\) It’s density is \[\begin{align*} f(x) = \frac{\Gamma(\alpha_1+\alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)}x^{\alpha_1-1}(1-x)^{1-\alpha_2}, \qquad x \in [0,1]. \end{align*}\] The case \(\mathsf{beta}(1,1),\) also denoted \(\mathsf{unif}(0,1),\) corresponds to a standard uniform distribution. The beta distribution \(Y \sim \mathsf{beta}(\alpha, \beta)\) has expectation \(\mathsf{E}(Y)=\alpha/(\alpha+\beta)\) and variance \(\mathsf{Va}(Y)=\alpha\beta/\{(\alpha+\beta)^2(\alpha+\beta+1)\}\).

The beta distribution is commonly used to model proportions, and can be generalized to the multivariate setting as follows.

Definition 1.4 (Dirichlet distribution) Let \(\boldsymbol{\alpha} \in (0, \infty)^d\) denote shape parameters and consider a random vector of size \(d\) with positive components on the simplex \[\mathbb{S}_{d-1}: \{ 0 \leq x_j \leq 1; j=1, \ldots, d: x_1 + \cdots + x_d=1\}.\] The density of a Dirichlet random vector, denoted \(\boldsymbol{Y} \sim \mathsf{Dirichlet}(\boldsymbol{\alpha}),\) is \[\begin{align*} f(\boldsymbol{x}) = \frac{\prod_{j=1}^{d-1}\Gamma(\alpha_j)}{\Gamma(\alpha_1 + \cdots + \alpha_d)}\prod_{j=1}^{d} x_j^{\alpha_j-1}, \qquad \boldsymbol{x} \in \mathbb{S}_{d-1} \end{align*}\]

Due to the linear dependence, the \(d\)th component \(x_d = 1- x_1 - \cdots - x_{d-1}\) is fully determined.

Definition 1.5 (Binomial distribution) The density of the binomial distribution, denoted \(Y \sim \mathsf{binom}(n, p),\) is \[\begin{align*} f(x) = \mathsf{Pr}(Y=x) = \binom{m}{x}p^x (1-p)^{m-x}, \quad x=0, 1, \ldots, n. \end{align*}\]

If \(n=1,\) we recover the Bernoulli distribution with density \(f(x) = p^{y}(1-p)^{1-y}.\) The binomial distribution is closed under convolution, meaning that the number the number of successes \(Y\) out of \(n\) Bernoulli trials is binomial

Definition 1.6 (Multinomial distribution) If there are more than two outcomes, say \(d,\) we can generalize this mass function. Suppose that \(\boldsymbol{Y}=(Y_1, \ldots, Y_d)\) denotes the number of realizations of each of the \(d\) outcomes based on \(n\) trials, so that \(0 \leq Y_j \leq n (j=1, \ldots, d)\) and \(Y_1 + \cdots + Y_d=n.\) The joint density of the multinomial vector \(\boldsymbol{Y} \sim \mathsf{multinom}(\boldsymbol{p})\) with probability vector \(\boldsymbol{p} \in \mathbb{S}_{d-1}\) is \[\begin{align*} f(\boldsymbol{x}) = \frac{n!}{\prod_{j=1}^d x_j!} \prod_{j=1}^d p_j^{x_j}, \qquad \boldsymbol{y}/n \in \mathbb{S}_{d-1}, \end{align*}\] where \(x! = \Gamma(x+1)\) denotes the factorial function.

Definition 1.7 (Poisson distribution) If the probability of success \(p\) of a Bernoulli event is small in the sense that \(np \to \lambda\) when the number of trials \(n\) increases, then the number of success follows approximately a Poisson distribution with mass function \[\begin{align*} f(x)=\mathsf{Pr}(Y=x) = \frac{\exp(-\lambda)\lambda^y}{\Gamma(y+1)}, \quad x=0, 1, 2, \ldots \end{align*}\] where \(\Gamma(\cdot)\) denotes the gamma function. The parameter \(\lambda\) of the Poisson distribution is both the expectation and the variance of the distribution, meaning \(\mathsf{E}(Y)=\mathsf{Va}(Y)=\lambda.\) We denote the distribution as \(Y \sim \mathsf{Poisson}(\lambda).\)

Definition 1.8 (Gaussian distribution) Consider a \(d\) dimensional vector \(\boldsymbol{Y} \sim \mathsf{Gauss}_d(\boldsymbol{\mu}, \boldsymbol{Q}^{-1})\) with density \[\begin{align*} f(\boldsymbol{x}) = (2\pi)^{-d/2} |\boldsymbol{Q}|^{1/2} \exp \left\{ - \frac{1}{2} (\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{Q}(\boldsymbol{x}-\boldsymbol{\mu})\right\}, \qquad \boldsymbol{x} \in \mathbb{R}^d \end{align*}\]

The mean vector \(\boldsymbol{\mu}\) is the vector of expectation of individual observations, whereas \(\boldsymbol{Q}^{-1}\equiv \boldsymbol{\Sigma}\) is the \(d \times d\) covariance matrix of \(\boldsymbol{Y}\) and \(\boldsymbol{Q},\) the canonical parameter, is called the precision matrix.

In the univariate case, the density of \(\mathsf{Gauss}(\mu, \sigma^2)\) reduces to \[\begin{align*} f(x) = (2\pi\sigma^2)^{-1/2} \exp \left\{ - \frac{(x-\mu)^2}{2\sigma^2}\right\}, \qquad x \in \mathbb{R}. \end{align*}\] Although the terminology “normal” is frequent, we will stick to Gaussian in these course notes.

Definition 1.9 (Student-\(t\) distribution) The name “Student” comes from the pseudonym used by William Gosset in Gosset (1908), who introduced the asymptotic distribution of the \(t\)-statistic. The density of the Student-\(t\) univariate distribution with \(\nu\) degrees of freedom, location \(\mu\) and scale \(\sigma,\) denoted \(\mathsf{Student}(\mu, \sigma, \nu),\) is \[\begin{align*} f(y; \mu, \sigma, \nu) = \frac{\Gamma \left( \frac{\nu+1}{2}\right)}{\sigma\Gamma\left(\frac{\nu}{2}\right) \sqrt{\nu\pi}}\left(1+\frac{1}{\nu}\left(\frac{y-\mu}{\sigma}\right)^2\right)^{-\frac{\nu+1}{2}}. \end{align*}\] We also write \(\mathsf{Student}_{+}\) to denote the truncated distribution on the positive half-line, \([0, \infty).\)

The density of the random vector \(\boldsymbol{Y} \sim \mathsf{Student}_d(\boldsymbol{\mu}, \boldsymbol{Q}^{-1}, \nu),\) with location vector \(\boldsymbol{\mu},\) scale matrix \(\boldsymbol{Q}^{-1}\) and \(\nu\) degrees of freedom is \[\begin{align*} f(\boldsymbol{x}) = \frac{\Gamma \left( \frac{\nu+d}{2}\right)|\boldsymbol{Q}|^{1/2}}{\Gamma\left(\frac{\nu}{2}\right) (\nu\pi)^{d/2}}\left(1+\frac{(\boldsymbol{x}-\boldsymbol{\mu})^\top\boldsymbol{Q}(\boldsymbol{x}-\boldsymbol{\mu})}{\nu} \right)^{-\frac{\nu+d}{2}}, \qquad \boldsymbol{x} \in \mathbb{R}^d \end{align*}\] The Student distribution is a location-scale family and an elliptical distribution. The distribution has polynomial tails, is symmetric around \(\boldsymbol{\mu}\) and is unimodal. As \(\nu \to \infty,\) the Student distribution converges to a normal distribution. It has heavier tails than the normal distribution and only the first \(\nu-1\) moments of the distribution exist. The case \(\nu=1\) is termed Cauchy distribution.

Definition 1.10 (Weibull distribution) The distribution function of a Weibull random variable with scale \(\lambda>0\) and shape \(\alpha>0\) is \[\begin{align*} F(x; \lambda, \alpha) &= 1 - \exp\left\{-(x/\lambda)^\alpha\right\}, \qquad x \geq 0, \end{align*}\] while the corresponding density is \[\begin{align*} f(x; \lambda, \alpha) &= \frac{\alpha}{\lambda^\alpha} x^{\alpha-1}\exp\left\{-(x/\lambda)^\alpha\right\}, \qquad x \geq 0. \end{align*}\]

The quantile function, the inverse of the distribution function, is \(Q(p) = \lambda\{-\log(1-p)\}^{1/\alpha}.\) The Weibull distribution includes the exponential as special case when \(\alpha=1.\) The expected value of \(Y \sim \mathsf{Weibull}(\lambda, \alpha)\) is \(\mathsf{E}(Y) = \lambda \Gamma(1+1/\alpha).\)

Definition 1.11 (Generalized Pareto distribution) The generalized Pareto distribution with scale \(\tau>0\) and shape \(\xi \in \mathbb{R}\) has distribution and density functions equal to, respectively \[\begin{align*} F(x) &= \begin{cases} 1 - \left(1+\frac{\xi}{\tau}x\right)_{+}^{-1/\xi} & \xi \neq 0 \\ 1 - \exp(-x/\tau) & \xi=0 \end{cases}, \quad x \geq 0; \\ f(x) &= \begin{cases} \tau^{-1}\left(1+\frac{\xi}{\tau}x\right)_{+}^{-1/\xi-1} & \xi \neq 0 \\ \tau^{-1}\exp(-x/\tau) & \xi=0 \end{cases}\quad x \geq 0; \end{align*}\] with \(x_{+} = \max\{x, 0\}.\) The case \(\xi=0\) corresponding to the exponential distribution with rate \(\tau^{-1}\). The distribution is used to model excesses over a large threshold \(u\), as extreme value theory dictates that, under broad conditions, \(Y-u \mid Y > u \sim \mathsf{gen. Pareto}(\tau_u, \xi)\) as \(u\) tends to the endpoint of the support of \(Y\), regardless of the underlying distribution of \(Y\). See Example 2.6 for an application of this model.

Definition 1.12 (Location and scale distribution) A random variable \(Y\) is said to belong to a location scale family with location parameter \(b\) and scale \(a>0\) if it is equal in distribution to a location and scale transformation of a standard variable \(X\) with location zero and unit scale, denoted \(Y {=}_d\, aX + b\) and meaning, \[\Pr(Y \leq y) = \Pr(aX + b \leq y).\] If the density exists, then \(f_Y(y) = a^{-1}f_X\{(y-b)/a\}.\)

We can extend this definition to the multivariate setting for location vector \(\boldsymbol{b} \in \mathbb{R}^d\) and positive definite scale matrix \(\mathbf{A},\) such that \[\Pr(\boldsymbol{Y} \leq \boldsymbol{y}) = \Pr(\mathbf{A}\boldsymbol{X} + \boldsymbol{b} \leq \boldsymbol{y}).\]

Definition 1.13 (Exponential family) A univariate distribution is an exponential family if it’s density or mass function can be written for all \(\boldsymbol{\theta} \in \boldsymbol{\Theta}\) and \(y \in \mathbb{R}\) as \[\begin{align*} f(y; \boldsymbol{\theta}) = \exp\left\{ \sum_{k=1}^K Q_k(\boldsymbol{\theta}) t_k(y) + D(\boldsymbol{\theta}) + h(y)\right\}, \end{align*}\] where functions \(Q_1(\cdot), \ldots, Q_K(\cdot)\) and \(D(\cdot)\) depend only on \(\boldsymbol{\theta}\) and not on the data, and conversely \(t_1(\cdot), \ldots, t_K(\cdot)\) and \(h(\cdot)\) do not depend on the vector of parameters \(\boldsymbol{\theta}.\)

The support of \(f\) must not depend on \(\boldsymbol{\theta}.\) The transformed parameters \(Q_k(\boldsymbol{\theta})\) \((k=1, \ldots, K)\) are termed canonical parameters.

If we have an independent and identically distributed sample of observations \(y_1, \ldots, y_n\), the log likelihood is thus of the form \[\begin{align*} \ell(\boldsymbol{\theta}) = \sum_{k=1}^K Q_k(\boldsymbol{\theta}) \sum_{i=1}^n t_k(y_i) + n D(\boldsymbol{\theta}), \end{align*}\] where the collection \(\sum_{i=1}^n t_k(y_i)\) (\(k=1, \ldots, K\)) are sufficient statistics.

We term conjugate family families of distribution on \(\boldsymbol{\Theta}\) with parameters \(\boldsymbol{\chi}, \gamma\) if their density is proportional to \[\begin{align*} \exp\left\{ \sum_{k=1}^K Q_k(\boldsymbol{\theta}) \chi_k + \gamma D(\boldsymbol{\theta})\right\} \end{align*}\] A log prior density with parameters \(\eta, \nu_1, \ldots, \nu_K\) that is proportional to \[\begin{align*} \log p(\boldsymbol{\theta}) \propto \eta D(\boldsymbol{\theta}) + \sum_{k=1}^K Q_k(\boldsymbol{\theta}) \nu_k \end{align*}\] is conjugate.

Exponential families play a crucial role due to the fact that the vector of sufficient statistics \(\boldsymbol{t}\) for a random sample allows for data compression. They feature prominently in generalized linear models.

Example 1.1 (Gaussian as exponential family) We can rewrite the density of \(\mathsf{Gauss}(\mu, \sigma^2)\) as \[\begin{align*} f(y; \mu, \sigma^2) = (2\pi)^{-1/2}\exp\left\{ \frac{-y^2 + 2y\mu - \mu^2}{2\sigma^2}-\log \sigma\right\}, \end{align*}\] so taking \(Q_1(\mu, \sigma^2)=\mu/\sigma^2\) and \(Q_2(\mu, \sigma^2)=1/\sigma^2\) and \(t_1(y) = y\) and \(t_2(y) = -y^2/2.\) The Gaussian-inverse-gamma distribution is a conjugate family.

Example 1.2 (Binomial as exponential family) The binomial log density with \(y\) successes out of \(n\) trials is proportional to \[\begin{align*} y \log(p) + (n-y) \log(1-p) = y\log\left( \frac{p}{1-p}\right) + n \log(1-p) \end{align*}\] with canonical parameter \(Q_1(p) = \mathrm{logit}(p) = \log\{p/(1-p)\}\) with \(t_1(y) = y.\) The canonical link function for Bernoulli gives rise to logistic regression model. The binomial distribution is thus an exponential family. The beta distribution is conjugate to the binomial.

Example 1.3 (Poisson as exponential family) Consider \(Y \sim \mathsf{Poisson}(\mu)\) with mass function \[\begin{align*} f(y; p)=\exp \left\{ - \mu + y \log \mu - \log \Gamma(x+1)\right\}. \end{align*}\] and so the canonical parameter is \(Q_1(p) = \log \mu\) with the gamma distribution as conjugate family.

Proposition 1.1 (Change of variable formula) Consider an injective (one-to-one) differentiable function \(\boldsymbol{g}: \mathbb{R}^d \to \mathbb{R}^d,\) with inverse \(\boldsymbol{g}^{-1}.\) Then, if \(\boldsymbol{Y}=\boldsymbol{g}(\boldsymbol{X}),\) \[\begin{align*} \Pr(\boldsymbol{Y} \leq \boldsymbol{y}) = \Pr\{\boldsymbol{g}(\boldsymbol{X}) \leq \boldsymbol{y}\} = \Pr\{\boldsymbol{X} \leq \boldsymbol{x} = \boldsymbol{g}^{-1}(\boldsymbol{y})\}. \end{align*}\] Using the chain rule, we get that the density of \(\boldsymbol{Y}\) may be written as \[\begin{align*} f_{\boldsymbol{Y}}(\boldsymbol{y}) = f_{\boldsymbol{X}}\left\{\boldsymbol{g}^{-1}(\boldsymbol{y})\right\} \left| \mathbf{J}_{\boldsymbol{g}^{-1}}(\boldsymbol{y})\right| = f_{\boldsymbol{X}}(\boldsymbol{x}) \left| \mathbf{J}_{\boldsymbol{g}}(\boldsymbol{x})\right|^{-1} \end{align*}\] where \(\mathbf{J}_{\boldsymbol{g}}(\boldsymbol{x})\) is the Jacobian matrix with \((i,j)\)th element \(\partial [\boldsymbol{g}(\boldsymbol{x})]_i / \partial x_j.\)

Example 1.4 (Location-scale transformation of Gaussian vectors) Consider \(d\) independent standard Gaussian variates \(X_j \sim \mathsf{Gauss}(0, 1)\) for \(j=1, \ldots, d,\) with joint density function \[\begin{align*} f_{\boldsymbol{X}}(\boldsymbol{x})= (2\pi)^{-d/2} \exp \left( - \frac{\boldsymbol{x}^\top\boldsymbol{x}}{2}\right). \end{align*}\] Consider the transformation \(\boldsymbol{Y} = \mathbf{A}\boldsymbol{X}+\boldsymbol{b},\) with \(\mathbf{A}\) an invertible matrix. The inverse transformation is \(\boldsymbol{g}^{-1}(\boldsymbol{y}) = \mathbf{A}^{-1}(\boldsymbol{y}-\boldsymbol{b}).\) The Jacobian \(\mathbf{J}_{\boldsymbol{g}}(\boldsymbol{x})\) is simply \(\mathbf{A},\) so the joint density of \(\boldsymbol{Y}\) is \[\begin{align*} f_{\boldsymbol{Y}}(\boldsymbol{y}) &= (2\pi)^{-d/2} |\mathbf{A}|^{-1}\exp \left\{ - \frac{(\boldsymbol{y}-\boldsymbol{b})^\top\mathbf{A}^{-\top}\mathbf{A}^{-1}(\boldsymbol{y}-\boldsymbol{b})}{2}\right\}. \end{align*}\] Since \(|\mathbf{A}^{-1}| = |\mathbf{A}|^{-1}\) and \(\mathbf{A}^{-\top}\mathbf{A}^{-1} = (\mathbf{AA}^\top)^{-1},\) we recover that \(\boldsymbol{Y} \sim \mathsf{Gauss}_d(\boldsymbol{b}, \mathbf{AA}^\top).\)

Example 1.5 (Inverse gamma distribution) Consider \(Y \sim \mathsf{gamma}(\alpha, \beta)\) and the reciprocal \(g(x) = 1/x\). The Jacobian of the transformation is \(|g'(y)| = 1/y^2.\) The density of the inverse gamma \(\mathsf{inv. gamma}(\alpha, \beta)\) with shape \(\alpha>0\) and scale \(\beta>0\) is thus \[\begin{align*} f_Y(y) = \frac{\beta^\alpha}{\Gamma(\alpha)} y^{-\alpha-1} \exp(-\beta/y), \qquad y > 0. \end{align*}\] The expected value and variance are \(\mathsf{E}(Y)=\beta/(\alpha-1)\) for \(\alpha > 1\) and \(\mathsf{Va}(Y) = \beta^2/\{(\alpha-1)^2(\alpha-2)\}\) for \(\alpha>2.\)

Proposition 1.2 (Simulation of Gaussian vectors) Example 1.4 shows that the Gaussian distribution is a location-scale family: if \(\boldsymbol{L} = \mathrm{chol}(\boldsymbol{Q}),\) meaning \(\boldsymbol{Q}=\boldsymbol{LL}^\top\) for some lower triangular matrix \(\boldsymbol{L},\) then \[\boldsymbol{L}^\top(\boldsymbol{Y}-\boldsymbol{\mu}) \sim \mathsf{Gauss}_d(\boldsymbol{0}_d, \mathbf{I}_d).\] Conversely, we can use the Cholesky root to sample multivariate Gaussian vectors by first drawing \(d\) independent standard Gaussians \(\boldsymbol{Z}=(Z_1, \ldots, Z_d)^\top,\) then computing \[\boldsymbol{Y} \gets \boldsymbol{L}^{-1}\boldsymbol{Z}+ \boldsymbol{\mu}.\]

Example 1.6 (Dirichlet vectors from Gamma random variables) Consider \(\boldsymbol{X}\) a \(d\) vector of independent gamma random variables \(\mathsf{gamma}(\alpha_i, 1).\) Then, if \(Z = X_1 + \cdots + X_d,\) we have \((X_1, \ldots, X_{d-1}) / Z \sim \mathsf{Dirichlet}(\boldsymbol{\alpha})\) and \(Z \sim \mathsf{gamma}(1, \alpha_1 + \cdots + \alpha_d).\)

Proof. The joint density for \(\boldsymbol{X}\) is \[\begin{align*} f_{\boldsymbol{X}}(\boldsymbol{x}) = \prod_{j=1}^d \frac{x_j^{\alpha_j-1}\exp(-x_j)}{\Gamma(\alpha_j)}. \end{align*}\] Let \(\boldsymbol{g}(\cdot)\) be a \(d\) place function with \(i\)th element \(g_i(\boldsymbol{x}) = x_j/(x_1 + \cdots x_d)\) for \(j=1, \ldots, d-1\) and \(g_d=x_1 + \cdots x_d\) and write the transformation as \(g(\boldsymbol{X}) = (\boldsymbol{Y}^\top, Z)^\top\) with \(\boldsymbol{y} = (y_1, \ldots, y_{d-1})^\top\) and the redundant coordinate \(y_d = 1- y_1 - \cdots y_{d-1}\) to simplify the notation. The inverse transformation yields \(x_j = zy_j\) for \(j=1, \ldots, d-1\) and \(x_d = z(1-y_1-\cdots - y_{d-1}).\) The Jabobian matrix is \[\begin{align*} \mathbf{J}_{\boldsymbol{g}^{-1}}(\boldsymbol{y}, z) = \begin{pmatrix} z \mathbf{I}_{d-1} & \boldsymbol{y} \\ \boldsymbol{0}^\top_{d-1} & y_d \end{pmatrix}. \end{align*}\] The absolute value of the determinant is then \(z^{d-1}y_d.\) Using the change of variable formula, the joint density is \[\begin{align*} f_{\boldsymbol{Y}, Z}(\boldsymbol{y}, z) &= \prod_{j=1}^{d} \frac{(zy_j)^{\alpha_j-1}\exp(-zy_j)}{\Gamma(\alpha_j)} \times z^{d-1}y_d \\&= z^{\alpha_1 + \cdots + \alpha_d - 1}\exp(-z) \prod_{j=1}^d \frac{y_j^{\alpha_j-1}}{\Gamma(\alpha_j)}. \end{align*}\] Since the density factorizes, we find the result upon multiplying and dividing by the normalizing constant \(\Gamma(\alpha_1 + \cdots + \alpha_d),\) which yields both the Dirichlet for \(\boldsymbol{Y}\) and a gamma for \(Z.\)

1.1.2 Marginal and conditional distributions

Definition 1.14 (Marginal distribution) The marginal distribution of a subvector \(\boldsymbol{X}_{1:k}=(X_1, \ldots, X_k)^\top,\) without loss of generality consisting of the \(k\) first components of \(\boldsymbol{X}\) \((1 \leq k < d)\) is \[\begin{align*} F_{\boldsymbol{X}_{1:k}}(\boldsymbol{x}_{1:k}) = \Pr(\boldsymbol{X}_{1:k} \leq \boldsymbol{x}_{1:k}) = F_{\boldsymbol{X}}(x_1, \ldots, x_k, \infty, \ldots, \infty). \end{align*}\] and thus the marginal distribution of component \(j,\) \(F_j(x_j),\) is obtained by evaluating all components but the \(j\)th at \(\infty.\)

We likewise obtain the marginal density \[\begin{align*} f_{1:k}(\boldsymbol{x}_{1:k}) = \frac{\partial^k F_{1:k}(\boldsymbol{x}_{1:k})}{\partial x_1 \cdots \partial x_{k}}, \end{align*}\] or through integration from the joint density as \[\begin{align*} f_{1:k}(\boldsymbol{x}_{1:k}) = \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty f_{\boldsymbol{X}}(x_1, \ldots, x_k, z_{k+1}, \ldots, z_{d}) \mathrm{d} z_{k+1} \cdots \mathrm{d}z_d. \end{align*}\]

Definition 1.15 (Conditional distribution) Let \((\boldsymbol{X}^\top, \boldsymbol{Y}^\top)^\top\) be a \(d\)-dimensional random vector with joint density or mass function \(f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y})\) and marginal distribution \(f_{\boldsymbol{X}}(\boldsymbol{x}).\) The conditional distribution function of \(\boldsymbol{Y}\) given \(\boldsymbol{X}=\boldsymbol{x},\) is \[\begin{align*} f_{\boldsymbol{Y} \mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x}) = \frac{f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y})}{f_{\boldsymbol{X}}(\boldsymbol{x})} \end{align*}\] for any value of \(\boldsymbol{x}\) in the support of \(\boldsymbol{X},\) i.e., the set of values with non-zero density or mass, meaning \(f_{\boldsymbol{X}}(\boldsymbol{x})>0\); it is undefined otherwise.

Theorem 1.1 (Bayes’ theorem) Denote by \(f_{\boldsymbol{X}}\) and \(f_{\boldsymbol{Y}}\) denotes the marginal density of \(\boldsymbol{X}\) and \(\boldsymbol{Y},\) respectively, \(f_{\boldsymbol{X} \mid \boldsymbol{Y}}\) the conditional of \(\boldsymbol{X}\) given \(\boldsymbol{Y}\) and \(f_{\boldsymbol{X},\boldsymbol{Y}}\) the joint density. Bayes’ theorem states that for \(\boldsymbol{y}\) in the support of \(\boldsymbol{Y},\) \[\begin{align*} f_{\boldsymbol{X}\mid \boldsymbol{Y}}(\boldsymbol{x}; \boldsymbol{y}) = \frac{f_{\boldsymbol{Y}\mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x})f_{\boldsymbol{X}}(\boldsymbol{x})}{f_{\boldsymbol{Y}}(\boldsymbol{y})} \end{align*}\] which follows since \(f_{\boldsymbol{X}|\boldsymbol{Y}}(\boldsymbol{x}; \boldsymbol{y})f_{\boldsymbol{Y}}(\boldsymbol{y}) = f_{\boldsymbol{X},\boldsymbol{Y}}(\boldsymbol{x},\boldsymbol{y})\) and likewise \(f_{\boldsymbol{Y} \mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x})f_{\boldsymbol{X}}(\boldsymbol{x}) = f_{\boldsymbol{X},\boldsymbol{Y}}(\boldsymbol{x},\boldsymbol{y}).\)

In the case of a discrete random variable \(X\) with support \(\mathcal{X},\) the denominator can be evaluated using the law of total probability, and

\[\begin{align*} \Pr(X = x \mid Y=y) &= \frac{\Pr(Y=y \mid X=x)\Pr(X=x)}{\Pr(Y=y)} \\&= \frac{\Pr(Y=y \mid X=x)\Pr(X=x)}{\sum_{x \in \mathcal{X}}\Pr(Y =y \mid X=x)\Pr(X=x)}. \end{align*}\]

Example 1.7 (Covid rapid tests) Back in January 2021, the Quebec government was debating whether or not to distribute antigen rapid test, with strong reluctance from authorities given the paucity of available resources and the poor sensitivity.

A Swiss study analyse the efficiency of rapid antigen tests, comparing them to repeated polymerase chain reaction (PCR) test output, taken as benchmark (Jegerlehner et al. 2021). The results are presented in Table 1.1

Table 1.1: Confusion matrix of Covid test results for PCR tests versus rapid antigen tests, from Jegerlehner et al. (2021).

	PCR \(+\)	PCR \(-\)
rapid \(+\)	92	2
rapid \(-\)	49	1319
total	141	1321

Estimated seropositivity at the end of January 2021 according to projections of the Institute for Health Metrics and Evaluation (IHME) of 8.18M out of 38M inhabitants (Mathieu et al. 2020), a prevalence of 21.4%. Assuming the latter holds uniformly over the country, what is the probability of having Covid if I get a negative result to a rapid test?

Let \(R^{-}\) (\(R^{+}\)) denote a negative (positive) rapid test result and \(C^{+}\) (\(C^{-}\)) Covid positivity (negativity). Bayes’ formula gives \[\begin{align*} \Pr(C^{+} \mid R^{-}) & = \frac{\Pr(R^{-} \mid C^{+})\Pr(C^{+})}{\Pr(R^{-} \mid C^{+})\Pr(C^{+}) + \Pr(R^{-} \mid C^{-})\Pr(C^{-})} \\&= \frac{49/141 \cdot 0.214}{49/141 \cdot 0.214 + 1319/1321 \cdot 0.786} \end{align*}\] so there is a small, but non-negligible probability of 8.66% that the rapid test result is misleading. Jegerlehner et al. (2021) indeed found that the sensitivity was 65.3% among symptomatic individuals, but dropped down to 44% for asymptomatic cases. This may have fueled government experts skepticism.

Bayes’ rule is central to updating beliefs: given initial beliefs (priors) and information in the form of data, we update our beliefs iteratively in light of new information.

Example 1.8 (Conditional and marginal for contingency table) Consider a bivariate distribution for \((Y_1, Y_2)\) supported on \(\{1, 2, 3\} \times \{1, 2\},\) whose joint probability mass function is given in Table 1.2

Table 1.2: Bivariate mass function with probability of each outcome for \((Y_1, Y_2).\)

	\(Y_1=1\)	\(Y_1=2\)	\(Y_1=3\)	total
\(Y_2=1\)	0.20	0.3	0.10	0.6
\(Y_2=2\)	0.15	0.2	0.05	0.4
total	0.35	0.5	0.15	1.0

The marginal distribution of \(Y_1\) is obtain by looking at the total probability for each column, as \[\Pr(Y_1=i) = \Pr(Y_1=i, Y_2=1)+ \Pr(Y_1=i, Y_2=2).\] This gives \(\Pr(Y_1=1)=0.35,\) \(\Pr(Y_1=2)=0.5\) and \(\Pr(Y_1=3) = 0.15.\) Similarly, we find that \(\Pr(Y_2=1)=0.6\) and \(\Pr(Y_2=2)=0.4\) for the other random variable.

The conditional distribution \[\Pr(Y_2 = i \mid Y_1=2) = \frac{\Pr(Y_1=2, Y_2=i)}{\Pr(Y_1=2)},\] so \(\Pr(Y_2 = 1 \mid Y_1=2) = 0.3/0.5 = 0.6\) and \(\Pr(Y_2=2 \mid Y_1=2) = 0.4.\) We can condition on more complicated events, for example \[\begin{align*} \Pr(Y_2 = i \mid Y_1 \ge 2) = \frac{\Pr(Y_1=2, Y_2=i) + \Pr(Y_1=3, Y_2=i)}{\Pr(Y_1=2) + \Pr(Y_1=3)}. \end{align*}\]

Example 1.9 (Margins and conditional distributions of multinomial vectors) Consider \(\boldsymbol{Y} = (Y_1, Y_2, n-Y_1-Y_2)\) a trinomial vector giving the number of observations in group \(j \in \{1,2,3\}\) with \(n\) trials and probabilities of each component respectively \((p_1, p_2, 1-p_1-p_2).\) The marginal distribution of \(Y_2\) is obtained by summing over all possible values of \(Y_1,\) which ranges from \(0\) to \(n,\) so \[\begin{align*} f(y_2) &= \frac{n!p_2^{y_2}}{y_2!}\sum_{y_1=1}^n \frac{p_1^{y_1}(1-p_1 -p_2)^{n-y_1-y_2}}{y_1!(n-y_1-y_2)!} \end{align*}\]

A useful trick is to complete the expression on the right so that it sum (in the discrete case) or integrate (in the continuous case) to \(1.\) If we multiply and divide by \((1-p_2)^{n-y_2} /(n-y_2)!,\) we get \(p_1^*=p_1/(1-p_2)\) and

\[\begin{align*} f(y_2) &= \frac{n!p_2^{y_2}}{(1-p_2)^{n-y_2} y_2!(n-y_2)!}\sum_{y_1=1}^n \binom{n-y_2}{y_1} p_1^{\star y_1}(1-p^{\star}_1)^{n-y_2} \\&= \frac{n!p_2^{y_2}}{(1-p_2)^{n-y_2} y_2!(n-y_2)!} \end{align*}\] is binomial with \(n\) trials and probability of success \(p_2.\) We can generalize this argument to multinomials of arbitrary dimensions.

The conditional density of \(Y_2 \mid Y_1=y_1\) is, up to proportionality, \[\begin{align*} f_{Y_2 \mid Y_1}(y_2; y_1) &\propto \frac{p_2^{y_2}(1-p_1 -p_2)^{n-y_1-y_2}}{y_2!(n-y_1-y_2)!} \end{align*}\] If we write \(p_2^\star=p_2/(1-p_1),\) we find that \(Y_2 \mid Y_1 \sim \mathsf{binom}(n-y_1, p_2^\star).\) Indeed, we can see that \[\begin{align*} f_{\boldsymbol{Y}}(\boldsymbol{y}) &= f_{Y_2 \mid Y_1}(y_2; y_1) f_{Y_1}(y_1) \\&= \binom{n-y_1}{y_2} \left( \frac{p_2}{1-p_1}\right)^{y_2}\left(\frac{1-p_1-p_2}{1-p_1}\right)^{n-y_1-y_2}\!\!\cdot\binom{n}{y_1} p_1^{y_1}(1-p_1)^{n-y_1}. \end{align*}\]

Example 1.10 (Gaussian-gamma model) Consider the bivariate density function of the pair \((X, Y),\) where for \(\lambda>0,\) \[\begin{align*} f(x,y) = \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\}, \qquad x \in \mathbb{R}, y>0. \end{align*}\] We see that the conditional distribution of \(X \mid Y=y \sim \mathsf{Gauss}(0, y^{-1}).\) The marginals are \[\begin{align*} f(y) &= \int_{-\infty}^\infty \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\} \mathrm{d} x \\&= \lambda \exp(-\lambda y) \end{align*}\] so marginally \(Y\) follows an exponential distribution with rate \(\lambda.\) The marginal of \(X\) can be obtained by noting that the joint distribution, as a function of \(y,\) is proportional to the kernel of a gamma distribution with shape \(3/2\) and rate \(x^2+\lambda,\) with \(Y \mid X=x \sim \mathsf{gamma}(3/2, x^2+\lambda).\) If we pull out the normalizing constant, we find \[\begin{align*} f(x) &= \int_{0}^{\infty} \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\} \mathrm{d} y\\&= \frac{\lambda \Gamma(3/2)}{(2\pi)^{1/2}(x^2+\lambda)^{3/2}} \int_0^\infty f_{Y \mid X}(y \mid x) \mathrm{d} y \\ &= \frac{\lambda}{2^{3/2}(x^2+\lambda)^{3/2}} \end{align*}\] since \(\Gamma(a+1) = a\Gamma(a)\) for \(a>0\) and \(\Gamma(1/2) = \sqrt{\pi}.\) We conclude that marginally \(X \sim \mathsf{Student}(0, \lambda, 2),\) a Student distribution with scale \(\lambda\) and two degrees of freedom. The Student-\(t\) can be viewed as a scale mixture of Gaussian, where the variance is inverse gamma distributed.

Example 1.11 (Bivariate geometric distribution of Marshall and Olkin) Consider a couple \((U_1, U_2)\) of Bernoulli random variables whose mass function is \(\Pr(U_1=i, U_2=j=p_{ij}\) for \((i,j) \in \{0,1\}^2.\) The marginal distributions are, by the law of total probability \[\begin{align*} \Pr(U_1=i) &= \Pr(U_1=i, U_2=0) + \Pr(U_1=i, U_2=1) = p_{i0} + p_{i1} = p_{i\bullet}\\ \Pr(U_2=j) &= \Pr(U_1=0, U_2=j) + \Pr(U_1=1, U_2=j) = p_{0j} + p_{1j} = p_{\bullet j} \end{align*}\] We consider a joint geometric distribution (Marshall and Olkin (1985), Section 6) and the pair \((Y_1, Y_2)\) giving the number of zeros for \((U_1, U_2)\) before the variable equals one for the first time. The bivariate mass function is (Nadarajah 2008) \[\begin{align*} \Pr(Y_1 = k, Y_2 = l) &= \begin{cases} p_{00}^kp_{01}p_{0 \bullet}^{l-k-1}p_{1 \bullet} & 0 \leq k < l; \\ p_{00}^kp_{11} & k=l; \\ p_{00}^lp_{10}p_{\bullet 0}^{k-l-1}p_{\bullet 1} & 0 \leq l < k. \end{cases} \end{align*}\] We can compute the joint survival function \(\Pr(Y_1 \ge k, Y_2 \ge l)\) by using properties of the partial sum of geometric series, using the fact \(\sum_{i=0}^n p^i = p^n/(1-p).\) Thus, for the case \(0\leq k < l,\) we have \[\begin{align*} \Pr(Y_1 \ge k, Y_2 \ge l) &= \sum_{i=k}^\infty \sum_{j=l}^\infty \Pr(Y_1 = i, Y_2 = j)\\ &= \sum_{i=k}^\infty p_{00}^ip_{01}p_{0 \bullet}^{-i-1}p_{1 \bullet} \sum_{j=l}^\infty p_{0 \bullet}^{j} \\&=\sum_{i=k}^\infty p_{00}^ip_{01}p_{0 \bullet}^{-i-1}p_{1 \bullet} \frac{p_{0 \bullet}^{l}}{1-p_{0\bullet}} \\&= p_{0 \bullet}^{l-1}p_{01} \sum_{i=k}^\infty \left(\frac{p_{00}}{p_{0 \bullet}}\right)^{i} \\& = p_{00}^k p_{0 \bullet}^{l-k} \end{align*}\] since \(p_{0\bullet} + p_{1\bullet} = 1.\) We can proceed similarly with other subcases to find \[\begin{align*} \Pr(Y_1 \ge k, Y_2 \ge l) = \begin{cases} p_{00}^k p_{0 \bullet}^{l-k} & 0\leq k < l\\ p_{00}^k & 0 \leq k=l \\ p_{00}^l p_{\bullet 0}^{k-l} & 0\leq l < k\\ \end{cases} \end{align*}\] and we can obtain the marginal survival function by considering \(\Pr(Y_1 \ge 0, Y_2 \ge l),\) etc., which yields \(\Pr(Y_2 \ge l) = p_{0 \bullet}^{l},\) whence \[\begin{align*} \Pr(Y_2 = l) &= \Pr(Y_2 \ge l) -\Pr(Y_2 \ge l+1) \\&= p_{0 \bullet}^{l} (1-p_{0 \bullet}) \\&=p_{0 \bullet}^{l}p_{1\bullet} \end{align*}\] and so both margins are geometric.

Definition 1.16 (Independence) We say that \(\boldsymbol{Y}\) and \(\boldsymbol{X}\) are independent if their joint distribution function factorizes as \[\begin{align*} F_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y}) = F_{\boldsymbol{X}}(\boldsymbol{x})F_{\boldsymbol{Y}}(\boldsymbol{y}) \end{align*}\] for any value of \(\boldsymbol{x},\) \(\boldsymbol{y}.\) It follows from the definition of joint density that, should the latter exists, it also factorizes as \[\begin{align*} f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y}) = f_{\boldsymbol{X}}(\boldsymbol{x})f_{\boldsymbol{Y}}(\boldsymbol{y}). \end{align*}\]

If two subvectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are independent, then the conditional density \(f_{\boldsymbol{Y} \mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x})\) equals the marginal \(f_{\boldsymbol{Y}}(\boldsymbol{y}).\)

Proposition 1.3 (Gaussian vectors, independence and conditional independence properties)

A unique property of the multivariate normal distribution is the link between independence and the covariance matrix: components \(Y_i\) and \(Y_j\) are independent if and only if the \((i,j)\) off-diagonal entry of the covariance matrix \(\boldsymbol{Q}^{-1}\) is zero.

If \(q_{ij}=0,\) then \(Y_i\) and \(Y_j\) are conditionally independent given the other components.

1.2 Expectations

The expected value of some function of a random vector \(g(\boldsymbol{Y}),\) where \(\boldsymbol{Y}\) has density \(f_{\boldsymbol{Y}},\) is \[\begin{align*} \mathsf{E}\{g(\boldsymbol{Y})\} = \int g(\boldsymbol{y}) f_{\boldsymbol{Y}}(\boldsymbol{y}) \mathrm{d} \boldsymbol{y} \end{align*}\] and can be understood as a weighted integral of \(g\) with weight \(f_{\boldsymbol{Y}}\); the latter does not exist unless the integral is finite.

Taking \(g(\boldsymbol{y}) = \boldsymbol{y}\) yields the expected value of the random variable \(\mathsf{E}(\boldsymbol{Y}).\) We define the covariance matrix of \(\boldsymbol{Y}\) as \[\begin{align*} \mathsf{Va}(\boldsymbol{Y}) = \mathsf{E}\left[\left\{\boldsymbol{Y} - \mathsf{E}(\boldsymbol{Y})\right\}\left\{\boldsymbol{Y} - \mathsf{E}(\boldsymbol{Y})\right\}^\top\right], \end{align*}\] which reduces in the unidimensional setting to \[\mathsf{Va}(Y) = \mathsf{E}\{Y - \mathsf{E}(Y)\}^2 = \mathsf{E}(Y^2) - \mathsf{E}(Y)^2.\]

More generally, the \(k \times m\) covariance matrix between two random vectors \(\boldsymbol{Y}\) of size \(k\) and \(\boldsymbol{X}\) of size \(m\) is \[\begin{align*} \mathsf{Co}(\boldsymbol{Y}, \boldsymbol{X}) = \mathsf{E}\left[\left\{\boldsymbol{Y} - \mathsf{E}(\boldsymbol{Y})\right\}\left\{\boldsymbol{X} - \mathsf{E}(\boldsymbol{X})\right\}^\top\right], \end{align*}\]

If \(\boldsymbol{Y}\) is \(d\)-dimensional and \(\mathbf{A}\) is \(p \times d\) and \(\boldsymbol{b}\) is a \(p\) vector, then

\[\begin{align*} \mathsf{E}(\boldsymbol{AY} + \boldsymbol{b}) &= \boldsymbol{A}\mathsf{E}(\boldsymbol{Y}) + \boldsymbol{b},\\ \mathsf{Va}(\boldsymbol{AY} + \boldsymbol{b}) &= \boldsymbol{A}\mathsf{Va}(\boldsymbol{Y})\boldsymbol{A}^\top. \end{align*}\]

The expected value (theoretical mean) of the vector \(\boldsymbol{Y}\) is thus calculated componentwise using each marginal density, i.e., \[\begin{align*} \mathsf{E}(\boldsymbol{Y}) &= \boldsymbol{\mu}= \begin{pmatrix} \mathsf{E}(Y_1) & \cdots & \mathsf{E}(Y_n) \end{pmatrix}^\top \end{align*}\] whereas the second moment of \(\boldsymbol{Y}\) is encoded in the \(n \times n\) covariance matrix \[\begin{align*} \mathsf{Va}(\boldsymbol{Y}) &= \boldsymbol{\Sigma} = \begin{pmatrix} \mathsf{Va}(Y_1) & \mathsf{Co}(Y_1, Y_2) & \cdots & \mathsf{Co}(Y_1, Y_n) \\ \mathsf{Co}(Y_2, Y_1) & \mathsf{Va}(Y_2) & \ddots & \vdots \\ \vdots & \ddots & \ddots & \vdots \\ \mathsf{Co}(Y_n, Y_1) & \mathsf{Co}(Y_n, Y_2) &\cdots & \mathsf{Va}(Y_n) \end{pmatrix} \end{align*}\] The \(i\)th diagonal element of \(\boldsymbol{\Sigma},\) \(\sigma_{ii}=\sigma_i^2,\) is the variance of \(Y_i,\) whereas the off-diagonal entries \(\sigma_{ij}=\sigma_{ji}\) \((i \neq j)\) are the covariance of pairwise entries, with \[\begin{align*} \mathsf{Co}(Y_i, Y_j) &= \int_{\mathbb{R}^2} (y_i-\mu_i)(y_j-\mu_j) f_{Y_i, Y_j}(y_i, y_j) \mathrm{d} y_i \mathrm{d} y_j \\ &= \mathsf{E}_{Y_i, Y_j}\left[\left\{Y_i-\mathsf{E}_{Y_i}(Y_i)\right\}\left\{Y_j-\mathsf{E}_{Y_j}(Y_j)\right\}\right] \end{align*}\] The covariance matrix \(\boldsymbol{\Sigma}\) is thus symmetric. It is customary to normalize the pairwise dependence so they do not depend on the component variance. The linear correlation between \(Y_i\) and \(Y_j\) is \[\begin{align*} \rho_{ij}=\mathsf{Cor}(Y_i,Y_j)=\frac{\mathsf{Co}(Y_i, Y_j)}{\sqrt{\mathsf{Va}(Y_i)}\sqrt{\mathsf{Va}(Y_j)}}=\frac{\sigma_{ij}}{\sigma_i\sigma_j}. \end{align*}\]

Proposition 1.4 (Law of iterated expectation and variance) Let \(\boldsymbol{Z}\) and \(\boldsymbol{Y}\) be random vectors. The expected value of \(\boldsymbol{Y}\) is \[\begin{align*} \mathsf{E}_{\boldsymbol{Y}}(\boldsymbol{Y}) = \mathsf{E}_{\boldsymbol{Z}}\left\{\mathsf{E}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\}. \end{align*}\]

The tower property gives a law of iterated variance \[\begin{align*} \mathsf{Va}_{\boldsymbol{Y}}(\boldsymbol{Y}) = \mathsf{E}_{\boldsymbol{Z}}\left\{\mathsf{Va}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\} + \mathsf{Va}_{\boldsymbol{Z}}\left\{\mathsf{E}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\}. \end{align*}\]

In a hierarchical model, the variance of the unconditional distribution is thus necessarily larger than that of the conditional distribution.

Example 1.12 Let \(Y \mid X\sim \mathsf{Gauss}(X, \sigma^2)\) and \(X \sim \mathsf{Gauss}(0, \tau^2).\) The unconditional mean and variance of \(Y\) are \[\begin{align*} \mathsf{E}(Y) = \mathsf{E}_{X}\{\mathsf{E}_{Y\mid X}(Y)\}= \mathsf{E}_{X}(X) = 0 \end{align*}\] and \[\begin{align*} \mathsf{Va}(Y) &= \mathsf{E}_{X}\{\mathsf{Va}_{Y\mid X}(Y)\} + \mathsf{Va}_{X}\{\mathsf{E}_{Y\mid X}(Y)\} \\&= \mathsf{E}_{X}(\sigma^2) + \mathsf{Va}_{X}(X) \\&= \sigma^2 + \tau^2 \end{align*}\]

Example 1.13 (Negative binomial as a Poisson mixture)

One restriction of the Poisson model is that the restriction on its moments is often unrealistic. The most frequent problem encountered is that of overdispersion, meaning that the variability in the counts is larger than that implied by a Poisson distribution.

One common framework for handling overdispersion is to have \(Y \mid \Lambda = \lambda \sim \mathsf{Poisson}(\lambda),\) where the mean of the Poisson distribution is itself a positive random variable with mean \(\mu,\) if \(\Lambda\) follows a gamma distribution with shape \(k\mu\) and rate \(k>0,\) \(\Lambda \sim \mathsf{gamma}(k\mu, k).\) Since the joint density of \(Y\) and \(\Lambda\) can be written \[\begin{align*} p(y, \lambda) &= p(y \mid \lambda)p(\lambda) \\ &= \frac{\lambda^y\exp(-\lambda)}{\Gamma(y+1)} \frac{k^{k\mu}\lambda^{k\mu-1}\exp(-k\lambda)}{\Gamma(k\mu)} \end{align*}\] so the conditional distribution of \(\Lambda \mid Y=y\) can be found by considering only terms that are function of \(\lambda,\) whence \[\begin{align*} f(\lambda \mid Y=y) \stackrel{\lambda}{\propto}\lambda^{y+k\mu-1}\exp(-(k+1)\lambda) \end{align*}\] and the conditional distribution is \(\Lambda \mid Y=y \sim \mathsf{gamma}(k\mu + y, k+1).\)

We can isolate the marginal density \[\begin{align*} p(y) &= \frac{p(y, \lambda)}{p(\lambda \mid y)} \\&= \frac{\frac{\lambda^y\exp(-\lambda)}{\Gamma(y+1)} \frac{k^{k\mu}\lambda^{k\mu-1}\exp(-k\lambda)}{\Gamma(k\mu)}}{ \frac{(k+1)^{k\mu+y}\lambda^{k\mu+y-1}\exp\{-(k+1)\lambda\}}{\Gamma(k\mu+y)}}\\ &= \frac{\Gamma(k\mu+y)}{\Gamma(k\mu)\Gamma(y+1)}k^{k\mu} (k+1)^{-k\mu-y}\\&= \frac{\Gamma(k\mu+y)}{\Gamma(k\mu)\Gamma(y+1)}\left(1-\frac{1}{k+1}\right)^{k\mu} \left(\frac{1}{k+1}\right)^y \end{align*}\] and this is the density of a negative binomial distribution with probability of success \(1/(k+1).\) We can thus view the negative binomial as a Poisson mean mixture.

By the laws of iterated expectation and iterative variance, \[\begin{align*} \mathsf{E}(Y) &= \mathsf{E}_{\Lambda}\{\mathsf{E}(Y \mid \Lambda\} \\& = \mathsf{E}(\Lambda) = \mu\\ \mathsf{Va}(Y) &= \mathsf{E}_{\Lambda}\{\mathsf{Va}(Y \mid \Lambda)\} + \mathsf{Va}_{\Lambda}\{\mathsf{E}(Y \mid \Lambda)\} \\&= \mathsf{E}(\Lambda) + \mathsf{Va}(\Lambda) \\&= \mu + \mu/k. \end{align*}\] The marginal distribution of \(Y,\) unconditionally, has a variance which exceeds its mean, as \[\begin{align*} \mathsf{E}(Y) = \mu, \qquad \mathsf{Va}(Y) = \mu (1+1/k). \end{align*}\] In a negative binomial regression model, the term \(k\) is a dispersion parameter, which is fixed for all observations, whereas \(\mu = \exp(\boldsymbol{\beta}\mathbf{X})\) is a function of covariates \(\mathbf{X}.\) As \(k \to \infty,\) the distribution of \(\Lambda\) degenerates to a constant at \(\mu\) and we recover the Poisson model.

Proposition 1.5 (Partitioning of covariance matrices) Let \(\boldsymbol{\Sigma}\) be a \(d \times d\) positive definite covariance matrix. We define the precision matrix \(\boldsymbol{Q} = \boldsymbol{\Sigma}^{-1}.\) Suppose the matrices are partitioned into blocks, \[\begin{align*} \boldsymbol{\Sigma}= \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \text{ and } \boldsymbol{\Sigma}^{-1}= \boldsymbol{Q} = \begin{pmatrix} \boldsymbol{Q}_{11} &\boldsymbol{Q}_{12} \\ \boldsymbol{Q}_{21} & \boldsymbol{Q}_{22} \end{pmatrix} \end{align*}\] with \(\dim(\boldsymbol{\Sigma}_{11})=k\times k\) and \(\dim(\boldsymbol{\Sigma}_{22})=(d-k) \times (d-k).\) The following relationships hold:

\(\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}=-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12}\)
\(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}=\boldsymbol{Q}_{11}^{-1}\)
\(\det(\boldsymbol{\Sigma})=\det(\boldsymbol{\Sigma}_{22})\det(\boldsymbol{\Sigma}_{1|2})\) where \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}.\)

Proof. By writing explicitly the relationship \(\mathbf{Q} \boldsymbol{\Sigma}=\mathbf{I}_n,\) we get \[\begin{eqnarray*} \mathbf{Q}_{11}\boldsymbol{\Sigma}_{11}+\mathbf{Q}_{12}\boldsymbol{\Sigma}_{21}&=&\mathbf{I}_{k}\\ \mathbf{Q}_{21}\boldsymbol{\Sigma}_{12}+\mathbf{Q}_{22}\boldsymbol{\Sigma}_{22}&=&\mathbf{I}_{p-k}\\ \mathbf{Q}_{21}\boldsymbol{\Sigma}_{11}+\mathbf{Q}_{22}\boldsymbol{\Sigma}_{21}&=&\mathbf{O}_{p-k, k}\\ \mathbf{Q}_{11}\boldsymbol{\Sigma}_{12}+\mathbf{Q}_{12}\boldsymbol{\Sigma}_{22}&=&\mathbf{O}_{k, p-k}. \end{eqnarray*}\] Recall that we can only invert matrices whose double indices are identical and that both \(\mathbf{Q}\) and \(\boldsymbol{\Sigma}\) are symmetric, so \(\boldsymbol{\Sigma}_{12} = \boldsymbol{\Sigma}_{21}^\top.\) One easily obtains \(\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}=-\mathbf{Q}_{11}^{-1}\mathbf{Q}_{12}\) making use of the last equation. Then, \(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}=\mathbf{Q}_{11}^{-1}\) by substituting \(\mathbf{Q}_{12}\) from the last equation into the first.

For the last result, take \(\boldsymbol{B}\coloneqq \left(\begin{smallmatrix} \mathbf{I} & -\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}^{-1}_{22} \\ \mathbf{O} & \mathbf{I} \end{smallmatrix}\right),\) noting that \(\det (\boldsymbol{B})=\det\big( \boldsymbol{B}^\top\big)=1 .\) Computing the quadratic form \(\boldsymbol{B}\boldsymbol{\Sigma}\boldsymbol{B}^\top,\) we get \(\det(\boldsymbol{\Sigma})=\det(\boldsymbol{\Sigma}_{22})\det(\boldsymbol{\Sigma}_{1|2})\) where \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}.\)

Proposition 1.6 (Conditional distribution of Gaussian vectors) Let \(\boldsymbol{Y} \sim \mathsf{Gauss}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) and consider the partition \[\begin{align*} \boldsymbol{Y} = \begin{pmatrix} \boldsymbol{Y}_1 \\ \boldsymbol{Y}_2\end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22}\end{pmatrix}, \end{align*}\] where \(\boldsymbol{Y}_1\) is a \(k \times 1\) and \(\boldsymbol{Y}_2\) is a \((d-k) \times 1\) vector for some \(1\leq k < d.\) Then, we have the conditional distribution \[\begin{align*} \boldsymbol{Y}_1 \mid \boldsymbol{Y}_2 =\boldsymbol{y}_2 &\sim \mathsf{Gauss}_k(\boldsymbol{\mu}_1+\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}(\boldsymbol{y}_2-\boldsymbol{\mu}_2), \boldsymbol{\Sigma}_{1|2}) \\& \sim \mathsf{Gauss}_k(\boldsymbol{\mu}_1-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12}(\boldsymbol{y}_2-\boldsymbol{\mu}_2), \boldsymbol{Q}^{-1}_{11}) \end{align*}\] and \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\) is the Schur complement of \(\boldsymbol{\Sigma}_{22}.\)

Proof. It is easier to obtain this result by expressing the density of the Gaussian distribution in terms of the precision matrix \(\boldsymbol{Q}= \left(\begin{smallmatrix}\boldsymbol{Q}_{11} & \boldsymbol{Q}_{12}\\ \boldsymbol{Q}_{21} & \boldsymbol{Q}_{22}\end{smallmatrix}\right)\) rather than in terms of the covariance matrix \(\boldsymbol{\Sigma}.\)

Consider the partition \(\boldsymbol{Y}=(\boldsymbol{Y}_1, \boldsymbol{Y}_2).\) The log conditional density \(\log f(\boldsymbol{y}_1 \mid \boldsymbol{y}_2)\) as a function of \(\boldsymbol{y}_1\) is, up to proportionality, \[\begin{align*} &-\frac{1}{2}\left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right)^\top \boldsymbol{Q}_{11}\left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right) - \left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right)^\top \boldsymbol{Q}_{12}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right)\\ &-\frac{1}{2}\boldsymbol{y}_1^\top\boldsymbol{Q}_{11}\boldsymbol{y}_1-\boldsymbol{y}_1^\top \left\{\boldsymbol{Q}_{11}\boldsymbol{\mu}_1-\boldsymbol{Q}_{12}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right)\right\} \end{align*}\] upon completing the square in \(\boldsymbol{y}_1.\) This integrand is proportional to the density of a Gaussian distribution (and hence must be Gaussian) with precision matrix \(\boldsymbol{Q}_{11},\) while the mean vector and covariance matrix are \[\begin{align*} \boldsymbol{\mu}_{1|2} &= \boldsymbol{\mu}_1-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12} \left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right) \\&=\boldsymbol{\mu}_1+ \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right) \\\boldsymbol{\Sigma}_{1|2} &= \boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}. \end{align*}\] Note that \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{Q}_{11}^{-1}\) corresponds to the Schur complement of \(\boldsymbol{\Sigma}_{22}.\)

Remark that the above is sufficient (why?) The quadratic form appearing in the exponential term of the density of a Gaussian vector with mean \(\boldsymbol{\nu}\) and precision \(\boldsymbol{\varPsi}\) is \[\begin{align*} (\boldsymbol{x}-\boldsymbol{\nu})^\top\boldsymbol{\varPsi}(\boldsymbol{x}-\boldsymbol{\nu})= \boldsymbol{x}^\top\boldsymbol{\varPsi}\boldsymbol{x} - \boldsymbol{x}^\top\boldsymbol{\varPsi}\boldsymbol{\nu} - \boldsymbol{\nu}^\top\boldsymbol{\varPsi}\boldsymbol{x} + \boldsymbol{\nu}^\top\boldsymbol{\varPsi}\boldsymbol{\nu}. \end{align*}\] uniquely determines the parameters of the Gaussian distribution. The quadratic term in \(\boldsymbol{x}\) forms a sandwich around the precision matrix, while the linear term identifies the location vector. Since any (conditional) density function integrates to one, there is a unique normalizing constant and the latter need not be computed.

1.3 Likelihood

Definition 1.17 (Likelihood) The likelihood \(L(\boldsymbol{\theta})\) is a function of the parameter vector \(\boldsymbol{\theta}\) that gives the probability (or density) of observing a sample under a postulated distribution, treating the observations as fixed, \[\begin{align*} L(\boldsymbol{\theta}; \boldsymbol{y}) = f(\boldsymbol{y}; \boldsymbol{\theta}), \end{align*}\] where \(f(\boldsymbol{y}; \boldsymbol{\theta})\) denotes the joint density or mass function of the \(n\)-vector containing the observations.

If the latter are independent, the joint density factorizes as the product of the density of individual observations, and the likelihood becomes \[\begin{align*} L(\boldsymbol{\theta}; \boldsymbol{y})=\prod_{i=1}^n f_i(y_i; \boldsymbol{\theta}) = f_1(y_1; \boldsymbol{\theta}) \times \cdots \times f_n(y_n; \boldsymbol{\theta}). \end{align*}\] The corresponding log likelihood function for independent and identically distributions observations is \[\begin{align*} \ell(\boldsymbol{\theta}; \boldsymbol{y}) = \sum_{i=1}^n \log f(y_i; \boldsymbol{\theta}) \end{align*}\]

Definition 1.18 (Score and information matrix) Let \(\ell(\boldsymbol{\theta}),\) \(\boldsymbol{\theta} \in \boldsymbol{\Theta} \subseteq \mathbb{R}^p,\) be the log likelihood function. The gradient of the log likelihood \(U(\boldsymbol{\theta}) = \partial \ell(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}\) is termed score function.

The observed information matrix is the hessian of the negative log likelihood \[\begin{align*} j(\boldsymbol{\theta}; \boldsymbol{y})=-\frac{\partial^2 \ell(\boldsymbol{\theta}; \boldsymbol{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top}, \end{align*}\] evaluated at the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}},\) so \(j(\widehat{\boldsymbol{\theta}}).\) Under regularity conditions, the expected information, also called Fisher information matrix, is \[\begin{align*} i(\boldsymbol{\theta}) = \mathsf{E}\left\{U(\boldsymbol{\theta}; \boldsymbol{Y}) U(\boldsymbol{\theta}; \boldsymbol{Y})^\top\right\} = \mathsf{E}\left\{j(\boldsymbol{\theta}; \boldsymbol{Y})\right\} \end{align*}\] Both the Fisher (or expected) and the observed information matrices are symmetric and encode the curvature of the log likelihood and provide information about the variability of \(\widehat{\boldsymbol{\theta}}.\)

The information of an independent and identically distributed sample of size \(n\) is \(n\) times that of a single observation, so information accumulates at a linear rate.

Example 1.14 (Likelihood for right-censoring) Consider a survival analysis problem for independent time-to-event data subject to (noninformative) random right-censoring. We assume failure times \(Y_i (i=1, \ldots, n)\) are drawn from a common distribution \(F(\cdot; \boldsymbol{\theta})\) supported on \((0, \infty)\) and complemented with an independent censoring indicator \(C_i \in \{0,1\},\) with \(0\) indicating right-censoring and \(C_i=1\) observed failure time. If individual observation \(i\) has not experienced the event at the end of the collection period, then the likelihood contribution \(\Pr(Y > y) = 1-F(y; \boldsymbol{\theta}),\) where \(y_i\) is the maximum time observed for \(Y_i.\)

We write the log likelihood in terms of the right-censoring binary indicator as \[\begin{align*} \ell(\boldsymbol{\theta}) = \sum_{i: c_i=0} \log \{1- F(y_i; \boldsymbol{\theta})\} + \sum_{i: c_i=1} \log f(y_i; \boldsymbol{\theta}) \end{align*}\]

Suppose for simplicity that \(Y_i \sim \mathsf{expo}(\lambda)\) and let \(m=c_1 + \cdots + c_n\) denote the number of observed failure times. Then, the log likelihood and the Fisher information are \[\begin{align*} \ell(\lambda) &= \lambda \sum_{i=1}^n y_i + \log \lambda m\\ i(\lambda) &= m/\lambda^2 \end{align*}\] and the right-censored observations for the exponential model do not contribute to the information.

Example 1.15 (Information for the Gaussian distribution) Consider \(Y \sim \mathsf{Gauss}(\mu, \tau^{-1}),\) parametrized in terms of precision \(\tau.\) The likelihood contribution for an \(n\) sample is, up to proportionality, \[\begin{align*} \ell(\mu, \tau) \propto \frac{n}{2}\log(\tau) - \frac{\tau}{2}\sum_{i=1}^n(Y_i^2-2\mu Y_i+\mu^2) \end{align*}\]

The observed and Fisher information matrices are \[\begin{align*} j(\mu, \tau) &= \begin{pmatrix} n\tau & -\sum_{i=1}^n (Y_i-\mu)\\ -\sum_{i=1}^n (Y_i-\mu) & \frac{n}{2\tau^2} \end{pmatrix}, \\ i(\mu, \tau) &= n\begin{pmatrix} \tau & 0\\ 0 & \frac{1}{2\tau^2} \end{pmatrix} \end{align*}\] Since \(\mathsf{E}(Y_i) = \mu,\) the expected value of the off-diagonal entries of the Fisher information matrix are zero.

Example 1.16 (Likelihood, score and information of the Weibull distribution) The log likelihood for a simple random sample whose realizations are \(y_1, \ldots, y_n\) of size \(n\) from a \(\mathsf{Weibull}(\lambda, \alpha)\) model is \[\begin{align*} \ell(\lambda, \alpha) = n \log(\alpha) - n\alpha\log(\lambda) + (\alpha-1) \sum_{i=1}^n \log y_i - \lambda^{-\alpha}\sum_{i=1}^n y_i^\alpha. \end{align*}\]

The score, which is the gradient of the log likelihood, is easily obtained by differentiation¹ \[\begin{align*} U(\lambda, \alpha) &= \begin{pmatrix}\frac{\partial \ell(\lambda, \alpha)}{\partial \lambda} \\ \frac{\partial \ell(\lambda, \alpha)}{\partial \alpha} \end{pmatrix} \\&= \begin{pmatrix} -\frac{n\alpha}{\lambda} +\alpha\lambda^{-\alpha-1}\sum_{i=1}^n y_i^\alpha \\ \frac{n}{\alpha} + \sum_{i=1}^n \log (y_i/\lambda) - \sum_{i=1}^n \left(\frac{y_i}{\lambda}\right)^{\alpha} \times\log\left(\frac{y_i}{\lambda}\right). \end{pmatrix} \end{align*}\] and the observed information is the \(2 \times 2\) matrix-valued function \[\begin{align*} j(\lambda, \alpha) &= - \begin{pmatrix} \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \lambda^2} & \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \lambda \partial \alpha} \\ \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \alpha \partial \lambda} & \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \alpha^2} \end{pmatrix} = \begin{pmatrix} j_{\lambda, \lambda} & j_{\lambda, \alpha} \\ j_{\lambda, \alpha} & j_{\alpha, \alpha} \end{pmatrix} \end{align*}\] whose entries are \[\begin{align*} j_{\lambda, \lambda} &= \lambda^{-2}\left\{-n\alpha + \alpha(\alpha+1)\sum_{i=1}^n (y_i/\lambda)^\alpha\right\} \\ j_{\lambda, \alpha} &= \lambda^{-1}\sum_{i=1}^n [1-(y_i/\lambda)^\alpha\{1+\alpha\log(y_i/\lambda)\}] \\ j_{\alpha,\alpha} &= n\alpha^{-2} + \sum_{i=1}^n (y_i/\lambda)^\alpha \{\log(y_i/\lambda)\}^2 \end{align*}\] To compute the expected information matrix, we need to compute expectation of \(\mathsf{E}\{(Y/\lambda)^\alpha\},\) \(\mathsf{E}[(Y/\lambda)^\alpha\log\{(Y/\lambda)^\alpha\}]\) and \(\mathsf{E}[(Y/\lambda)^\alpha\log^2\{(Y/\lambda)^\alpha\}].\) By definition, \[\begin{align*} \mathsf{E}\left\{(Y/\lambda)^\alpha\right\} & = \int_0^\infty (x/\lambda)^\alpha \frac{\alpha}{\lambda^\alpha} x^{\alpha-1}\exp\left\{-(x/\lambda)^\alpha\right\} \mathrm{d} x \\ &= \int_0^\infty s\exp(-s) \mathrm{d} s =1 \end{align*}\] making a change of variable \(S = (Y/\lambda)^\alpha\sim \mathsf{Exp}(1).\) The two other integrals are tabulated in Gradshteyn and Ryzhik (2014), and are equal to \(1-\gamma\) and \(\gamma^2-2\gamma + \pi^2/6,\) respectively, where \(\gamma \approx 0.577\) is the Euler–Mascherroni constant. The expected information matrix of the Weibull distribution has entries \[\begin{align*} i_{\lambda, \lambda} & = n \lambda^{-2}\alpha\left\{ (\alpha+1)-1\right\} \\ i_{\lambda, \alpha} & = -n\lambda^{-1} (1-\gamma) \\ i_{\alpha, \alpha} & = n\alpha^{-2}(1 + \gamma^2-2\gamma+\pi^2/6) \end{align*}\]

We can check this result numerically by comparing the expected value of the observed information matrix

exp_info_weib <- function(scale, shape){
  i11 <- shape*((shape + 1) - 1)/(scale^2)
  i12 <- -(1+digamma(1))/scale
  i22 <- (1+digamma(1)^2+2*digamma(1)+pi^2/6)/(shape^2)
  matrix(c(i11, i12, i12, i22), nrow = 2, ncol = 2)
}
obs_info_weib <- function(y, scale, shape){
  ys <- y/scale # scale family
  o11 <- shape*((shape + 1)*mean(ys^shape)-1)/scale^2
  o12 <- (1-mean(ys^shape*(1+shape*log(ys))))/scale
  o22 <- 1/(shape*shape) + mean(ys^shape*(log(ys))^2)
  matrix(c(o11, o12, o12, o22), nrow = 2, ncol = 2)
}
nll_weib <- function(pars, y){
  -sum(dweibull(x = y, scale = pars[1], shape = pars[2], log = TRUE))}
# Fix parameters
scale <- rexp(n = 1, rate = 0.5)
shape <- rexp(n = 1)
nobs <- 1000L
dat <- rweibull(n = nobs, scale = scale, shape = shape)
# Compare Hessian with numerical differentiation
o_info <- obs_info_weib(dat, scale = scale, shape = shape)
all.equal(
  numDeriv::hessian(nll_weib, x = c(scale, shape), y = dat) / nobs,
  o_info)
# Compute approximation to Fisher information
exp_info_sim <- replicate(n = 1000, expr = {
  obs_info_weib(y = rweibull(n = nobs,
                        shape = shape,
                        scale = scale),
           scale = scale, shape = shape)})
all.equal(apply(exp_info_sim, 1:2, mean),
          exp_info_weib(scale, shape))

The joint density function only factorizes for independent data, but an alternative sequential decomposition can be helpful. For example, we can write the joint density \(f(y_1, \ldots, y_n)\) using the factorization \[\begin{align*} f(\boldsymbol{y}) = f(y_1) \times f(y_2 \mid y_1) \times \ldots f(y_n \mid y_1, \ldots, y_n) \end{align*}\] in terms of conditional. Such a decomposition is particularly useful in the context of time series, where data are ordered from time \(1\) until time \(n\) and models typically relate observation \(y_n\) to it’s past.

Example 1.17 (First-order autoregressive process) Consider an \(\mathsf{AR}(1)\) model of the form \[Y_t = \mu + \phi(Y_{t-1} - \mu) + \varepsilon_t,\] where \(\phi\) is the lag-one correlation, \(\mu\) the global mean and \(\varepsilon_t\) is an iid innovation with mean zero and variance \(\sigma^2.\) If \(|\phi| < 1,\) the process is stationary.

The Markov property states that the current realization depends on the past, \(Y_t \mid Y_1, \ldots, Y_{t-1},\) only through the most recent value \(Y_{t-1}.\) The log likelihood thus becomes \[\begin{align*} \ell(\boldsymbol{\theta}) = \log f(y_1) + \sum_{i=2}^n \log f(y_i \mid y_{i-1}). \end{align*}\]

The \(\mathsf{AR}(1)\) stationary process \(Y_t,\) marginally, has mean \(\mu\) and unconditional variance \(\sigma^2/(1-\phi^2).\) If we use the recursive definition, we find \[\begin{align*} Y_t &= \mu (1-\phi) + \varepsilon_t + \phi \{\mu + \phi(Y_{t-2}-\mu) + \varepsilon_{t-1}\} = \mu + \sum_{j=0}^\infty \phi^j\varepsilon_{t-j} \end{align*}\] whence \(\mathsf{E}(Y_t) = \mu\) and \[\begin{align*} \mathsf{Va}(Y_t) &= \mathsf{Va}\left(\sum_{j=0}^\infty \phi^j\varepsilon_{t-j}\right) = \sum_{j=0}^\infty \phi^{2j} \mathsf{Va}(\varepsilon_{t-j}) =\frac{\sigma^2}{(1-\phi^2)} \end{align*}\] where the geometric series converges if \(\phi<1\) and diverges otherwise.

If innovations \(\{\varepsilon_t\}\) are Gaussian, we have \[Y_t \mid Y_{t-1}=y_{t-1} \sim \mathsf{Gauss}\{\mu(1-\phi)+ \phi y_{t-1}, \sigma^2\}, \qquad t>1;\] The likelihood can then be written as \[\begin{align*} \ell(\mu, \phi,\sigma^2)& = -\frac{n}{2}\log(2\pi) - n\log \sigma + \frac{1}{2}\log(1-\phi^2) \\&\quad -\frac{(1-\phi^2)(y_1- \mu)^2}{2\sigma^2} - \sum_{i=2}^n \frac{(y_t - \mu(1-\phi)- \phi y_{t-1})^2}{2\sigma^2} \end{align*}\]

Summary

Many families of distributions are closed under conditioning and marginalization (elliptical distribution such as Gaussian or Student, Dirichlet or categorical).
Location-scale families can be standardized.
Distributions are characterized by their moments. Statistical models typically specify a parametric form for the latter, notably regression models.
Hierarchical models specify a joint distribution through marginal and conditionals.
The law of iterated expectation and variance (tower property) can be used to retrieve the first two moments more easily.
Families of distribution can be identified from their kernel (density or mass function) and support, up to normalizing constants. It suffices to identify the parameters of the distribution to retrieve the model.
Marginalization from a joint density involves integration. We can often rewrite the integral in terms of an unnormalized density, and use knowledge of the marginalization constant.
The likelihood depends on the distribution of the data. It encodes the information about the data.
The score (gradient vector of the log likelihood) can be set to zero to find the maximum likelihood estimator in regular models.
The curvature of the log likelihood (Hessian matrix) gives an idea of the information. The observed information is the sample version, the expected or Fisher information obtained by replacing data observations with their expectation.
In independent samples, the information grows linearly with the sample size and is \(\mathrm{O}(n).\) For identically distributed samples, we can compute the Fisher information for a single observation.

Using for example a symbolic calculator.↩︎