\(Y_1=1\) | \(Y_1=2\) | \(Y_1=3\) | |
---|---|---|---|
\(Y_2=1\) | 0.20 | 0.3 | 0.10 |
\(Y_2=2\) | 0.15 | 0.2 | 0.05 |
1 Introduction
This section review basic concepts in probability theory that will be used throughout the course. The overview begins with basic statistical concepts, random variables, their distribution and density, moments and likelihood derivations.
1.1 Random vectors
We begin with a characterization of random vectors and their marginal, conditional and joint distributions. A good reference for this material is Chapter 3 of McNeil, Frey, and Embrechts (2005).
Definition 1.1 (Density and distribution function) Let \(\boldsymbol{X}\) denote a \(d\)-dimensional vector with real entries in \(\mathbb{R}^d\). The distribution function of \(\boldsymbol{X}\) is \[\begin{align*} F_{\boldsymbol{X}}(\boldsymbol{x}) = \Pr(\boldsymbol{X} \leq \boldsymbol{x}) = \Pr(X_1 \leq x_1, \ldots, X_d \leq x_d). \end{align*}\]
If the distribution of \(\boldsymbol{X}\) is absolutely continuous, we may write \[\begin{align*} F_{\boldsymbol{X}}(\boldsymbol{x}) = \int_{-\infty}^{x_d} \cdots \int_{-\infty}^{x_1} f_{\boldsymbol{X}}(z_1, \ldots, z_d) \mathrm{d} z_1 \cdots \mathrm{d} z_d, \end{align*}\] where \(f_{\boldsymbol{X}}(\boldsymbol{x})\) is the joint density function. The density function can be obtained as the derivative of the distribution function with respect to all of it’s arguments.
We use the same notation for the mass function in the discrete case where \(f_{\boldsymbol{X}}(\boldsymbol{x}) = \Pr(X_1 = x_1, \ldots, X_d = x_d)\), where the integral is understood to mean a summation over all values lower or equal to \(\boldsymbol{x}\) in the support. In the discrete case, \(0 \leq f_{\boldsymbol{X}}(\boldsymbol{x}) \leq 1\) is a probability and the total probability over all points in the support sum to one, meaning \(\sum_{\boldsymbol{x} \in \mathsf{supp}(\boldsymbol{X})} f_{\boldsymbol{X}}(\boldsymbol{x}) = 1.\)
Definition 1.2 (Location and scale distribution) A random variable \(Y\) is said to belong to a location scale family with location parameter \(b\) and scale \(a>0\) if it is equal in distribution to a location and scale transformation of a standard variable \(X\) with location zero and unit scale, denoted \(Y \stackrel{\mathrm{d}}{=} aX + b\) and meaning, \[\Pr(Y \leq y) = \Pr(aX + b \leq y).\] If the density exists, then \(f_Y(y) = a^{-1}f_X\{(y-b)/a\}\).
We can extend this definition to the multivariate setting for location vector \(\boldsymbol{b} \in \mathbb{R}^d\) and positive definite scale matrix \(\mathbf{A}\), such that \[\Pr(\boldsymbol{Y} \leq \boldsymbol{y}) = \Pr(\boldsymbol{b} + \mathbf{A}\boldsymbol{X} \leq \boldsymbol{y}).\]
Definition 1.3 (Exponential family) A univariate distribution is an exponential family if it’s density or mass function can be written for all \(y \in \mathbb{R}\) as \[\begin{align*} f(y; \boldsymbol{\theta}) = \exp\left\{ \sum_{k=1}^K Q_k(\boldsymbol{\theta}) t_k(y) + D(\boldsymbol{\theta}) + h(y)\right\}, \end{align*}\] where functions \(Q_1(\cdot), \ldots, Q_K(\cdot)\) and \(D(\cdot)\) depend only on \(\boldsymbol{\theta}\) and not on the data, and conversely \(t_1(\cdot), \ldots, t_K(\cdot)\) and \(h(\cdot)\) do not depend on the vector of parameters \(\boldsymbol{\theta}\).
The support of \(f\) must not depend on \(\boldsymbol{\theta}\). The transformed parameters \(Q_k(\boldsymbol{\theta})\) \((k=1, \ldots, K)\) are termed canonical parameters.
1.1.1 Common distributions
Definition 1.4 (Gamma, chi-square and exponential distributions) A random variable follows a gamma distribution with shape \(\alpha>0\) and rate \(\beta>0\), denoted \(Y \sim \mathsf{gamma}(\alpha, \beta)\), if it’s density is \[\begin{align*} f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}\exp(-\beta x), \qquad x \in (0, \infty), \end{align*}\] where \(\Gamma(\alpha)\coloneqq\int_0^\infty t^{\alpha-1}\exp(-t)\mathrm{d} t\) is the gamma function.
If \(\alpha=1\), the density simplifies to \(\beta \exp(-\beta x)\) and we recover the exponential distribution, denote \(\mathsf{expo}(\beta)\). The case \(\mathsf{gamma}(\nu/2, 1/2)\) corresponds to the chi-square distribution \(\chi^2_\nu\).
Definition 1.5 (Beta and uniform distribution) The beta distribution \(\mathsf{beta}(\alpha_1, \alpha_2)\) is a distribution supported on the unit interval \([0,1]\) with shape parameters \(\alpha_1>0\) and \(\alpha_2>0.\) It’s density is \[\begin{align*} f(x) = \frac{\Gamma(\alpha_1)\Gamma(\alpha_2)}{\Gamma(\alpha_1+\alpha_2)}x^{\alpha_1-1}(1-x)^{1-\alpha_2}, \qquad x \in [0,1]. \end{align*}\] The case \(\mathsf{beta}(1,1)\), also denoted \(\mathsf{unif}(0,1)\), corresponds to a standard uniform distribution.
The beta distribution is commonly used to model proportions, and can be generalized to the multivariate setting as follows.
Definition 1.6 (Dirichlet distribution) Let \(\boldsymbol{\alpha} \in (0, \infty)^d\) denote shape parameters and cosider a random vector of size \(d\) with positive components on the simplex \[\mathbb{S}_{d-1}: \{ 0 \leq x_j \leq 1; j=1, \ldots, d: x_1 + \cdots + x_d=1\}.\] The density of a Dirichlet random vector, denoted \(\boldsymbol{Y} \sim \mathsf{Dirichlet}(\boldsymbol{\alpha})\), is \[\begin{align*} f(\boldsymbol{x}) = \frac{\prod_{j=1}^{d-1}\Gamma(\alpha_j)}{\Gamma(\alpha_1 + \cdots + \alpha_d)}\prod_{j=1}^{d} x_j^{\alpha_j-1}, \qquad \boldsymbol{x} \in \mathbb{S}_{d-1} \end{align*}\]
Due to the linear dependence, the \(d\)th component \(x_d = 1- x_1 - \cdots - x_{d-1}\) is fully determined.
Definition 1.7 (Binomial distribution) The density of the binomial distribution, denoted \(Y \sim \mathsf{binom}(n, p)\), is \[\begin{align*} f(x) = \mathsf{Pr}(Y=x) = \binom{m}{x}p^x (1-p)^{m-x}, \quad x=0, 1, \ldots, n. \end{align*}\]
If \(n=1\), we recover the Bernoulli distribution with density \(f(x) = p^{y}(1-p)^{1-y}\). The binomial distribution is closed under convolution, meaning that the number the number of successes \(Y\) out of \(n\) Bernoulli trials is binomial
Definition 1.8 (Multinomial distribution) If there are more than two outcomes, say \(d\), we can generalize this mass function. Suppose that \(\boldsymbol{Y}=(Y_1, \ldots, Y_d)\) denotes the number of realizations of each of the \(d\) outcomes based on \(n\) trials, so that \(0 \leq Y_j \leq n (j=1, \ldots, d)\) and \(Y_1 + \cdots + Y_d=n\). The joint density of the multinomial vector \(\boldsymbol{Y} \sim \mathsf{multinom}(\boldsymbol{p})\) with probability vector \(\boldsymbol{p} \in \mathbb{S}_{d-1}\) is \[\begin{align*} f(\boldsymbol{x}) = \frac{n!}{\prod_{j=1}^d x_j!} \prod_{j=1}^d p_j^{x_j}, \qquad \boldsymbol{y}/n \in \mathbb{S}_{d-1}, \end{align*}\] where \(x! = \Gamma(x+1)\) denotes the factorial function.
Definition 1.9 (Poisson distribution) If the probability of success \(p\) of a Bernoulli event is small in the sense that \(np \to \lambda\) when the number of trials \(n\) increases, then the number of success follows approximately a Poisson distribution with mass function \[\begin{align*} f(x)=\mathsf{Pr}(Y=x) = \frac{\exp(-\lambda)\lambda^y}{\Gamma(y+1)}, \quad x=0, 1, 2, \ldots \end{align*}\] where \(\Gamma(\cdot)\) denotes the gamma function. The parameter \(\lambda\) of the Poisson distribution is both the expectation and the variance of the distribution, meaning \(\mathsf{E}(Y)=\mathsf{Va}(Y)=\lambda.\) We denote the distribution as \(Y \sim \mathsf{Poisson}(\lambda)\).
The most frequently encountered location-scale distribution is termed normal, but we use the terminology Gaussian in these notes.
Definition 1.10 (Gaussian distribution) Consider a \(d\) dimensional vector \(\boldsymbol{Y} \sim \mathsf{Gauss}_d(\boldsymbol{\mu}, \boldsymbol{Q}^{-1})\) with density \[\begin{align*} f(\boldsymbol{x}) = (2\pi)^{-d/2} |\boldsymbol{Q}|^{1/2} \exp \left\{ - \frac{1}{2} (\boldsymbol{x}-\boldsymbol{\mu})^\top \boldsymbol{Q}(\boldsymbol{x}-\boldsymbol{\mu})\right\}, \qquad \boldsymbol{x} \in \mathbb{R}^d \end{align*}\]
The mean vector \(\boldsymbol{\mu}\) is the vector of expectation of individual observations, whereas \(\boldsymbol{Q}^{-1}\equiv \boldsymbol{\Sigma}\) is the \(d \times d\) covariance matrix of \(\boldsymbol{Y}\) and \(\boldsymbol{Q}\), the canonical parameter, is called the precision matrix.
In the univariate case, the density of \(\mathsf{Gauss}(\mu, \sigma^2)\) reduces to \[\begin{align*} f(x) = (2\pi\sigma^2)^{-1/2} \exp \left\{ - \frac{(x-\mu)^2}{2\sigma^2}\right\}, \qquad x \in \mathbb{R}. \end{align*}\]
Proposition 1.1 (Simulation of Gaussian vectors) The Gaussian distribution is an elliptical distribution and a location-scale family: if \(\boldsymbol{L} = \mathrm{chol}(\boldsymbol{Q})\), meaning \(\boldsymbol{Q}=\boldsymbol{LL}^\top\) for some lower triangular matrix \(\boldsymbol{L}\), then \[\boldsymbol{L}^\top(\boldsymbol{Y}-\boldsymbol{\mu}) \sim \mathsf{Gauss}_d(\boldsymbol{0}_d, \mathbf{I}_d).\] Conversely, we can use the Cholesky root to sample multivariate Gaussian vectors by first drawing \(d\) independent standard Gaussians \(\boldsymbol{Z}=(Z_1, \ldots, Z_d)^\top\), then computing \[\boldsymbol{Y} \gets \boldsymbol{L}^{-1}\boldsymbol{Z}+ \boldsymbol{\mu}.\]
Definition 1.11 (Student-\(t\) distribution) The name “Student” comes from the pseudonym used by William Gosset in Gosset (1908), who introduced the asymptotic distribution of the \(t\)-statistic. The density of the standard Student-\(t\) univariate distribution with \(\nu\) degrees of freedom is \[\begin{align*} f(y; \nu) = \frac{\Gamma \left( \frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right) \sqrt{\nu\pi}}\left(1+\frac{y^{2}}{\nu}\right)^{-\frac{\nu+1}{2}}. \end{align*}\] The density of the random vector \(\boldsymbol{Y} \sim \mathsf{Student}_d(\boldsymbol{\mu}, \boldsymbol{Q}^{-1}, \nu)\), with location vector \(\boldsymbol{\mu}\), scale matrix \(\boldsymbol{Q}^{-1}\) and \(\nu\) degrees of freedom is \[\begin{align*} f(\boldsymbol{x}) = \frac{\Gamma \left( \frac{\nu+d}{2}\right)|\boldsymbol{Q}|^{1/2}}{\Gamma\left(\frac{\nu}{2}\right) (\nu\pi)^{d/2}}\left(1+\frac{(\boldsymbol{x}-\boldsymbol{\mu})^\top\boldsymbol{Q}(\boldsymbol{x}-\boldsymbol{\mu})}{\nu} \right)^{-\frac{\nu+d}{2}}, \qquad \boldsymbol{x} \in \mathbb{R}^d \end{align*}\] The Student distribution is a location-scale family and an elliptical distribution. The distribution has polynomial tails, is symmetric around \(\boldsymbol{\mu}\) and is unimodal. As \(\nu \to \infty\), the Student distribution converges to a normal distribution. It has heavier tails than the normal distribution and only the first \(\nu-1\) moments of the distribution exist. The case \(\nu=1\) is termed Cauchy distribution.
Definition 1.12 (Weibull distribution) The distribution function of a Weibull random variable with scale \(\lambda>0\) and shape \(\alpha>0\) is \[\begin{align*} F(x; \lambda, \alpha) &= 1 - \exp\left\{-(x/\lambda)^\alpha\right\}, \qquad x \geq 0, \lambda>0, \alpha>0, \end{align*}\] while the corresponding density is \[\begin{align*} f(x; \lambda, \alpha) &= \frac{\alpha}{\lambda^\alpha} x^{\alpha-1}\exp\left\{-(x/\lambda)^\alpha\right\}, \qquad x \geq 0, \lambda>0, \alpha>0. \end{align*}\]
1.1.2 Marginal and conditional distributions
Definition 1.13 (Marginal distribution) The marginal distribution of a subvector \(\boldsymbol{X}_{1:k}=(X_1, \ldots, X_k)^\top\), without loss of generality consisting of the \(k\) first components of \(\boldsymbol{X}\) \((1 \leq k < d)\) is \[\begin{align*} F_{1:k}(\boldsymbol{x}_{\boldsymbol{X}}) = \Pr(\boldsymbol{X}_{1:k} \leq \boldsymbol{x}_{1:k}) = F_{\boldsymbol{X}}(x_1, \ldots, x_k, \infty, \ldots, \infty). \end{align*}\] and thus the marginal distribution of component \(j\), \(F_j(x_j)\), is obtained by evaluating all components but the \(j\)th at \(\infty\).
We likewise obtain the marginal density \[\begin{align*} f_{1:k}(\boldsymbol{x}_{1:k}) = \frac{\partial^k F_{1:k}(\boldsymbol{x}_{1:k})}{\partial x_1 \cdots \partial x_{k}}, \end{align*}\] or through integration from the joint density as \[\begin{align*} f_{1:k}(\boldsymbol{x}_{1:k}) = \int_{-\infty}^\infty \cdots \int_{-\infty}^\infty f_{\boldsymbol{X}}(x_1, \ldots, x_k, z_{k+1}, \ldots, z_{d}) \mathrm{d} z_{k+1} \cdots \mathrm{d}z_d. \end{align*}\]
Definition 1.14 (Conditional distribution) Let \((\boldsymbol{X}^\top, \boldsymbol{Y}^\top)^\top\) be a \(d\)-dimensional random vector with joint density or mass function \(f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y})\) and marginal distribution \(f_{\boldsymbol{X}}(\boldsymbol{x})\). The conditional distribution function of \(\boldsymbol{Y}\) given \(\boldsymbol{X}=\boldsymbol{x}\), is \[\begin{align*} f_{\boldsymbol{Y} \mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x}) = \frac{f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y})}{f_{\boldsymbol{X}}(\boldsymbol{x})} \end{align*}\] for any value of \(\boldsymbol{x}\) in the support of \(\boldsymbol{X}\), i.e., the set of values with non-zero density or mass, meaning \(f_{\boldsymbol{X}}(\boldsymbol{x})>0\); it is undefined otherwise.
Theorem 1.1 (Bayes theorem) Denote by \(p(X) \equiv\Pr(X)\) denotes the marginal density of \(X\), \(p(X \mid Y)\) the conditional of \(X\) given \(Y\) and \(p(X, Y)\) the joint density. Bayes’ theorem states that \[\begin{align*} p(X = x \mid Y=y) = \frac{p(Y = y \mid X = x)p(X=x)}{p(Y=y)} \end{align*}\]
In the case of discrete random variable \(X\) with support \(\mathcal{X},\) the denominator can be evaluated using the law of total probability as \[\Pr(Y=y) = \sum_{x \in \mathcal{X}}\Pr(Y =y \mid X=x)\Pr(X=x).\]
Example 1.1 (Covid rapid tests) Back in January 2021, the Quebec government was debating whether or not to distribute antigen rapid test, with strong reluctance from authorities given the paucity of available resources and the poor sensitivity.
A Swiss study analyse the efficiency of rapid antigen tests, comparing them to repeated polymerase chain reaction (PCR) test output, taken as benchmark (Jegerlehner et al. 2021). The results are presented in Table 1.1
PCR \(+\) | PCR \(-\) | |
---|---|---|
rapid \(+\) | 92 | 2 |
rapid \(-\) | 49 | 1319 |
total | 141 | 1321 |
Estimated seropositivity at the end of January 2021 according to projections of the Institute for Health Metrics and Evaluation (IHME) of 8.18M out of 38M inhabitants (Mathieu et al. 2020), a prevalence of 21.4%. Assuming the latter holds uniformly over the country, what is the probability of having Covid if I get a negative result to a rapid test?
Let \(R^{-}\) (\(R^{+}\)) denote a negative (positive) rapid test result and \(C^{+}\) (\(C^{-}\)) Covid positivity (negativity). Bayes’ formula gives \[\begin{align*} \Pr(C^{+} \mid R^{-}) & = \frac{\Pr(R^{-} \mid C^{+})\Pr(C^{+})}{\Pr(R^{-} \mid C^{+})\Pr(C^{+}) + \Pr(R^{-} \mid C^{-})\Pr(C^{-})} \\&= \frac{49/141 \cdot 0.214}{49/141 \cdot 0.214 + 1319/1321 \cdot 0.786} \end{align*}\] so there is a small, but non-negligible probability of 8.66% that the rapid test result is misleading. Jegerlehner et al. (2021) indeed found that the sensitivity was 65.3% among symptomatic individuals, but dropped down to 44% for asymptomatic cases. This may have fueled government experts skepticism.
Bayes’ rule is central to updating beliefs: given initial beliefs (priors) and information in the form of data, we update our beliefs. We can apply
Example 1.2 (Conditional and marginal for contingency table) Consider a bivariate distribution for \((Y_1, Y_2)\) supported on \(\{1,2\} \times \{1, 2, 3\}\) whose joint probability mass function is given in Table 1.2
The marginal distribution of \(Y_1\) is obtain by looking at the total probability for each column, as \[\Pr(Y_1=i) = \Pr(Y_1=i, Y_2=1)+ \Pr(Y_1=i, Y_2=2).\] This gives \(\Pr(Y_1=1)=0.35\), \(\Pr(Y_1=2)=0.5\) and \(\Pr(Y_1=3) = 0.15\). Similarly, we find that \(\Pr(Y_2=1)=0.6\) and \(\Pr(Y_2=2)=0.4\) for the other random variable.
The conditional distribution \[\Pr(Y_2 = i \mid Y_1=2) = \frac{\Pr(Y_1=2, Y_2=i)}{\Pr(Y_1=2)},\] so \(\Pr(Y_2 = 1 \mid Y_1=2) = 0.3/0.5 = 0.6\) and \(\Pr(Y_2=2 \mid Y_1=2) = 0.4\). We can condition on more complicated events, for example \[\begin{align*} \Pr(Y_2 = i \mid Y_1 \ge 2) = \frac{\Pr(Y_1=2, Y_2=i) + \Pr(Y_1=3, Y_2=i)}{\Pr(Y_1=2) + \Pr(Y_1=3)}. \end{align*}\]
Example 1.3 (Margins and conditional distributions of multinomial vectors) Consider \(\boldsymbol{Y} = (Y_1, Y_2, n-Y_1-Y_2)\) a trinomial vector giving the number of observations in group \(j \in \{1,2,3\}\) with \(n\) trials and probabilities of each component respectively \((p_1, p_2, 1-p_1-p_2)\). The marginal distribution of \(Y_2\) is obtained by summing over all possible values of \(Y_1\), which ranges from \(0\) to \(n\), so \[\begin{align*} f(y_2) &= \frac{n!p_2^{y_2}}{y_2!}\sum_{y_1=1}^n \frac{p_1^{y_1}(1-p_1 -p_2)^{n-y_1-y_2}}{y_1!(n-y_1-y_2)!} \end{align*}\]
A useful trick is to complete the expression on the right so that it sum (in the discrete case) or integrate (in the continuous case) to \(1.\) If we multiply and divide by \((1-p_2)^{n-y_2} /(n-y_2)!\), we get \(p_1^*=p_1/(1-p_2)\) and
\[\begin{align*} f(y_2) &= \frac{n!p_2^{y_2}}{(1-p_2)^{n-y_2} y_2!(n-y_2)!}\sum_{y_1=1}^n \binom{n-y_2}{y_1} p_1^{\star y_1}(1-p^{\star}_1)^{n-y_2} \\&= \frac{n!p_2^{y_2}}{(1-p_2)^{n-y_2} y_2!(n-y_2)!} \end{align*}\] is binomial with \(n\) trials and probability of success \(p_2\). We can generalize this argument to multinomials of arbitrary dimensions.
The conditional density of \(Y_2 \mid Y_1=y_1\) is, up to proportionality, \[\begin{align*} f_{Y_2 \mid Y_1}(y_2; y_1) &\propto \frac{p_2^{y_2}(1-p_1 -p_2)^{n-y_1-y_2}}{y_2!(n-y_1-y_2)!} \end{align*}\] If we write \(p_2^\star=p_2/(1-p_1)\), we find that \(Y_2 \mid Y_1 \sim \mathsf{binom}(n-y_1, p_2^\star)\). Indeed, we can see that \[\begin{align*} f_{\boldsymbol{Y}}(\boldsymbol{y}) &= f_{Y_2 \mid Y_1}(y_2; y_1) f_{Y_1}(y_1) \\&= \binom{n-y_1}{y_2} \left( \frac{p_2}{1-p_1}\right)^{y_2}\left(\frac{1-p_1-p_2}{1-p_1}\right)^{n-y_1-y_2}\!\!\cdot\binom{n}{y_1} p_1^{y_1}(1-p_1)^{n-y_1}. \end{align*}\]
Example 1.4 Consider the bivariate density function of the pair \((X, Y)\), where for \(\lambda>0\), \[\begin{align*} f(x,y) = \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\}, \qquad x \in \mathbb{R}, y>0. \end{align*}\] We see that the conditional distribution of \(X \mid Y=y \sim \mathsf{Gauss}(0, y^{-1})\). The marginals are \[\begin{align*} f(y) &= \int_{-\infty}^\infty \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\} \mathrm{d} x \\&= \lambda \exp(-\lambda y) \end{align*}\] so marginally \(Y\) follows an exponential distribution with rate \(\lambda\). The marginal of \(X\) can be obtained by noting that the joint distribution, as a function of \(y\), is proportional to the kernel of a gamma distribution with shape \(3/2\) and rate \(x^2+\lambda\), with \(Y \mid X=x \sim \mathsf{gamma}(3/2, x^2+\lambda)\). If we pull out the normalizing constant, we find \[\begin{align*} f(x) &= \int_{0}^{\infty} \frac{\lambda y^{1/2}}{(2\pi)^{1/2}}\exp\left\{ - y(x^2 + \lambda)\right\} \mathrm{d} y\\&= \frac{\lambda \Gamma(3/2)}{(2\pi)^{1/2}(x^2+\lambda)^{3/2}} \int_0^\infty f_{Y \mid X}(y \mid x) \mathrm{d} y \\ &= \frac{\lambda}{2^{3/2}(x^2+\lambda)^{3/2}} \end{align*}\] since \(\Gamma(a+1) = a\Gamma(a)\) for \(a>0\) and \(\Gamma(1/2) = \sqrt{\pi}\). We conclude that marginally \(X \sim \mathsf{Student}(0, \lambda, 2)\), a Student distribution with scale \(\lambda\) and two degrees of freedom.
Example 1.5 (Poisson counts conditional on their sum) Suppose that \(Y_j\sim \mathsf{Poisson}(\lambda_j)\) are independent \((j=1, \ldots, J),\) so that the sum \(S = Y_1 + \cdots Y_J \sim \mathsf{Poisson}(\lambda_1 + \ldots \lambda_J)\). One may prove this fact using moment generating functions, since the Poisson is infinitely divisible.
Let \(f(y; \lambda_j)\) denote the Poisson mass function with rate \(\lambda_j\). Since the random variables \(Y_1, \ldots, Y_J\) are independent, the joint mass function factorizes as \[\begin{align*} &f(y_1, \ldots, y_J=y_j \mid S=n) = \frac{\prod_{j=1}^J f(y_j; \lambda_j)}{f(n; \lambda_1 + \cdots + \lambda_J)} \\&\quad=\prod_{j=1}^J\frac{\exp(-\lambda_j)\lambda_j^{y_j}}{y_j!}\frac{n!}{\exp(-\lambda_1 + \cdots \lambda_J) \lambda^{n}} \\&\quad= \frac{n!}{ (\lambda_1 + \cdots + \lambda_J)^{n}}\prod_{j=1}^J\frac{\lambda_j^{y_j}}{y_j!} \\&\quad=\frac{n!}{y_1! \cdots y_J!}\prod_{j=1}^J \left(\frac{\lambda_j}{\lambda_1 + \cdots + \lambda_J}\right)^{y_j} \end{align*}\] for \(y_j \geq 0\) for all \(j=1, \ldots, J\) such that \(y_1 + \cdots + y_j = n\), and the joint mass is zero otherwise. If we define \(p_j = \lambda_j/(\lambda_1 + \cdots + \lambda_J)\), we recognize a multinomial distribution with vector of probability of success \(\boldsymbol{p}\) with \(p_1 + \cdots p_J=1\) and \(n\) trials.
Example 1.6 (Bivariate geometric distribution of Marshall and Olkin) Consider a couple \((U_1, U_2)\) of Bernoulli random variables whose mass function is \(\Pr(U_1=i, U_2=j=p_{ij}\) for \((i,j) \in \{0,1\}^2\). The marginal distributions are, by the law of total probability \[\begin{align*} \Pr(U_1=i) &= \Pr(U_1=i, U_2=0) + \Pr(U_1=i, U_2=1) = p_{i0} + p_{i1} = p_{i\bullet}\\ \Pr(U_2=j) &= \Pr(U_1=0, U_2=j) + \Pr(U_1=1, U_2=j) = p_{0j} + p_{1j} = p_{\bullet j} \end{align*}\] We consider a joint geometric distribution (Marshall and Olkin (1985), Section 6) and the pair \((Y_1, Y_2)\) giving the number of zeros for \((U_1, U_2)\) before the variable equals one for the first time. The bivariate mass function is (Nadarajah 2008) \[\begin{align*} \Pr(Y_1 = k, Y_2 = l) &= \begin{cases} p_{00}^kp_{01}p_{0 \bullet}^{l-k-1}p_{1 \bullet} & 0 \leq k < l; \\ p_{00}^kp_{11} & k=l; \\ p_{00}^lp_{10}p_{\bullet 0}^{k-l-1}p_{\bullet 1} & 0 \leq l < k. \end{cases} \end{align*}\] We can compute the joint survival function \(\Pr(Y_1 \ge k, Y_2 \ge l)\) by using properties of the partial sum of geometric series, using the fact \(\sum_{i=0}^n p^i = p^n/(1-p)\). Thus, for the case \(0\leq k < l\), we have \[\begin{align*} \Pr(Y_1 \ge k, Y_2 \ge l) &= \sum_{i=k}^\infty \sum_{j=l}^\infty \Pr(Y_1 = i, Y_2 = j)\\ &= \sum_{i=k}^\infty p_{00}^ip_{01}p_{0 \bullet}^{-i-1}p_{1 \bullet} \sum_{j=l}^\infty p_{0 \bullet}^{j} \\&=\sum_{i=k}^\infty p_{00}^ip_{01}p_{0 \bullet}^{-i-1}p_{1 \bullet} \frac{p_{0 \bullet}^{l}}{1-p_{0\bullet}} \\&= p_{0 \bullet}^{l-1}p_{01} \sum_{i=k}^\infty \left(\frac{p_{00}}{p_{0 \bullet}}\right)^{i} \\& = p_{00}^k p_{0 \bullet}^{l-k} \end{align*}\] since \(p_{0\bullet} + p_{1\bullet} = 1\). We can proceed similarly with other subcases to find \[\begin{align*} \Pr(Y_1 \ge k, Y_2 \ge l) = \begin{cases} p_{00}^k p_{0 \bullet}^{l-k} & 0\leq k < l\\ p_{00}^k & 0 \leq k=l \\ p_{00}^l p_{\bullet 0}^{k-l} & 0\leq l < k\\ \end{cases} \end{align*}\] and we can obtain the marginal survival function by considering \(\Pr(Y_1 \ge 0, Y_2 \ge l)\), etc., which yields \(\Pr(Y_2 \ge l) = p_{0 \bullet}^{l}\), whence \[\begin{align*} \Pr(Y_2 = l) &= \Pr(Y_2 \ge l) -\Pr(Y_2 \ge l+1) \\&= p_{0 \bullet}^{l} (1-p_{0 \bullet}) \\&=p_{0 \bullet}^{l}p_{1\bullet} \end{align*}\] and so both margins are geometric.
Definition 1.15 (Independence) We say that \(\boldsymbol{Y}\) and \(\boldsymbol{X}\) are independent if their joint distribution function factorizes as \[\begin{align*} F_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y}) = F_{\boldsymbol{X}}(\boldsymbol{x})F_{\boldsymbol{Y}}(\boldsymbol{y}) \end{align*}\] for any value of \(\boldsymbol{x}\), \(\boldsymbol{y}\). It follows from the definition of joint density that, should the latter exists, it also factorizes as \[\begin{align*} f_{\boldsymbol{X}, \boldsymbol{Y}}(\boldsymbol{x}, \boldsymbol{y}) = f_{\boldsymbol{X}}(\boldsymbol{x})f_{\boldsymbol{Y}}(\boldsymbol{y}). \end{align*}\]
If two subvectors \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) are independent, then the conditional density \(f_{\boldsymbol{Y} \mid \boldsymbol{X}}(\boldsymbol{y}; \boldsymbol{x})\) equals the marginal \(f_{\boldsymbol{Y}}(\boldsymbol{y})\).
Proposition 1.2 (Gaussian vectors, independence and conditional independence properties)
A unique property of the multivariate normal distribution is the link between independence and the covariance matrix: components \(Y_i\) and \(Y_j\) are independent if and only if the \((i,j)\) off-diagonal entry of the covariance matrix \(\boldsymbol{Q}^{-1}\) is zero.
If \(q_{ij}=0\), then \(Y_i\) and \(Y_j\) are conditionally independent given the other components.
Proposition 1.3 (Law of iterated expectation and variance) Let \(\boldsymbol{Z}\) and \(\boldsymbol{Y}\) be random vectors. The expected value of \(\boldsymbol{Y}\) is \[\begin{align*} \mathsf{E}_{\boldsymbol{Y}}(\boldsymbol{Y}) = \mathsf{E}_{\boldsymbol{Z}}\left\{\mathsf{E}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\}. \end{align*}\]
The tower property gives a law of iterated variance \[\begin{align*} \mathsf{Va}_{\boldsymbol{Y}}(\boldsymbol{Y}) = \mathsf{E}_{\boldsymbol{Z}}\left\{\mathsf{Va}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\} + \mathsf{Va}_{\boldsymbol{Z}}\left\{\mathsf{E}_{\boldsymbol{Y} \mid \boldsymbol{Z}}(\boldsymbol{Y})\right\}. \end{align*}\]
We can deduce that, in a hierarchical model, the variance of the unconditional distribution is larger
Example 1.7 Let \(Y \mid X\sim \mathsf{Gauss}(X, \sigma^2)\) and \(X \sim \mathsf{Gauss}(0, \tau^2)\). The unconditional mean and variance of \(Y\) are \[\begin{align*} \mathsf{E}(Y) = \mathsf{E}_{X}\{\mathsf{E}_{Y\mid X}(Y)\}= \mathsf{E}_{X}(X) = 0 \end{align*}\] and \[\begin{align*} \mathsf{Va}(Y) &= \mathsf{E}_{X}\{\mathsf{Va}_{Y\mid X}(Y)\} + \mathsf{Va}_{X}\{\mathsf{E}_{Y\mid X}(Y)\} \\&= \mathsf{E}_{X}(\sigma^2) + \mathsf{Va}_{X}(X) \\&= \sigma^2 + \tau^2 \end{align*}\]
Example 1.8 (Negative binomial as a Poisson mixture)
One restriction of the Poisson model is that the restriction on its moments is often unrealistic. The most frequent problem encountered is that of overdispersion, meaning that the variability in the counts is larger than that implied by a Poisson distribution.
One common framework for handling overdispersion is to have \(Y \mid \Lambda = \lambda \sim \mathsf{Poisson}(\lambda)\), where the mean of the Poisson distribution is itself a positive random variable with mean \(\mu\), if \(\Lambda\) follows a conjugate gamma distribution with shape \(k\mu\) and rate \(k>0\), \(\Lambda \sim \mathsf{Gamma}(k\mu, k)\), the posterior \(\Lambda \mid Y=y \sim \mathsf{Gamma}(k\mu + y, k+1)\).
Since the joint density of \(Y\) and \(\Lambda\) can be written \[ p(y, \lambda) = p(y \mid \lambda)p(\lambda) = p(\lambda \mid y) p(y) \] we can isolate the marginal density \[\begin{align*} p(y) &= \frac{p(y \mid \lambda)p(\lambda)}{p(\lambda \mid y)} \\&= \frac{\frac{\lambda^y\exp(-\lambda)}{\Gamma(y+1)} \frac{k^{k\mu}\lambda^{k\mu-1}\exp(-k\lambda)}{\Gamma(k\mu)}}{ \frac{(k+1)^{k\mu+y}\lambda^{k\mu+y-1}\exp\{-(k+1)\lambda\}}{\Gamma(k\mu+y)}}\\ &= \frac{\Gamma(k\mu+y)}{\Gamma(k\mu)\Gamma(y+1)}k^{k\mu} (k+1)^{-k\mu-y}\\&= \frac{\Gamma(k\mu+y)}{\Gamma(k\mu)\Gamma(y+1)}\left(1-\frac{1}{k+1}\right)^{k\mu} \left(\frac{1}{k+1}\right)^y \end{align*}\] and this is the density of a negative binomial distribution with probability of success \(1/(k+1)\). We can thus view the negative binomial as a Poisson mean mixture.
By the laws of iterated expectation and iterative variance, \[\begin{align*} \mathsf{E}(Y) &= \mathsf{E}_{\Lambda}\{\mathsf{E}(Y \mid \Lambda\} \\& = \mathsf{E}(\Lambda) = \mu\\ \mathsf{Va}(Y) &= \mathsf{E}_{\Lambda}\{\mathsf{Va}(Y \mid \Lambda)\} + \mathsf{Va}_{\Lambda}\{\mathsf{E}(Y \mid \Lambda)\} \\&= \mathsf{E}(\Lambda) + \mathsf{Va}(\Lambda) \\&= \mu + \mu/k. \end{align*}\] The marginal distribution of \(Y\), unconditionally, has a variance which exceeds its mean, as \[\begin{align*} \mathsf{E}(Y) = \mu, \qquad \mathsf{Va}(Y) = \mu (1+1/k). \end{align*}\] In a negative binomial regression model, the term \(k\) is a dispersion parameter, which is fixed for all observations, whereas \(\mu = \exp(\boldsymbol{\beta}\mathbf{X})\) is a function of covariates \(\mathbf{X}\). As \(k \to \infty\), the distribution of \(\Lambda\) degenerates to a constant at \(\mu\) and we recover the Poisson model.
Proposition 1.4 (Partitioning of covariance matrices) Let \(\boldsymbol{\Sigma}\) be a \(d \times d\) positive definite covariance matrix. We define the precision matrix \(\boldsymbol{Q} = \boldsymbol{\Sigma}^{-1}\). Suppose the matrices are partitioned into blocks, \[\begin{align*} \boldsymbol{\Sigma}= \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \text{ and } \boldsymbol{\Sigma}^{-1}= \boldsymbol{Q} = \begin{pmatrix} \boldsymbol{Q}_{11} &\boldsymbol{Q}_{12} \\ \boldsymbol{Q}_{21} & \boldsymbol{Q}_{22} \end{pmatrix} \end{align*}\] with \(\dim(\boldsymbol{\Sigma}_{11})=k\times k\) and \(\dim(\boldsymbol{\Sigma}_{22})=(d-k) \times (d-k)\). One can show the following relationships:
- \(\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}=-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12}\)
- \(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}=\boldsymbol{Q}_{11}^{-1}\)
- \(\det(\boldsymbol{\Sigma})=\det(\boldsymbol{\Sigma}_{22})\det(\boldsymbol{\Sigma}_{1|2})\) where \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\).
Proposition 1.5 (Conditional distribution of Gaussian vectors) Let \(\boldsymbol{Y} \sim \mathsf{Gauss}_d(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) and consider the partition \[\begin{align*} \boldsymbol{Y} = \begin{pmatrix} \boldsymbol{Y}_1 \\ \boldsymbol{Y}_2\end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22}\end{pmatrix}, \end{align*}\] where \(\boldsymbol{Y}_1\) is a \(k \times 1\) and \(\boldsymbol{Y}_2\) is a \((d-k) \times 1\) vector for some \(1\leq k < d.\) Then, we have the conditional distribution \[\begin{align*} \boldsymbol{Y}_1 \mid \boldsymbol{Y}_2 =\boldsymbol{y}_2 &\sim \mathsf{Gauss}_k(\boldsymbol{\mu}_1+\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}(\boldsymbol{y}_2-\boldsymbol{\mu}_2), \boldsymbol{\Sigma}_{1|2}) \\& \sim \mathsf{Gauss}_k(\boldsymbol{\mu}_1-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12}(\boldsymbol{y}_2-\boldsymbol{\mu}_2), \boldsymbol{Q}^{-1}_{11}) \end{align*}\] and \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\) is the Schur complement of \(\boldsymbol{\Sigma}_{22}\).
Proof. It is easier to obtain this result by expressing the density of the Gaussian distribution in terms of the precision matrix \(\boldsymbol{Q}= \left(\begin{smallmatrix}\boldsymbol{Q}_{11} & \boldsymbol{Q}_{12}\\ \boldsymbol{Q}_{21} & \boldsymbol{Q}_{22}\end{smallmatrix}\right)\) rather than in terms of the covariance matrix \(\boldsymbol{\Sigma}\).
Consider the partition \(\boldsymbol{Y}=(\boldsymbol{Y}_1, \boldsymbol{Y}_2)\). The log conditional density \(\log f(\boldsymbol{y}_1 \mid \boldsymbol{y}_2)\) as a function of \(\boldsymbol{y}_1\) is, up to proportionality, \[\begin{align*} &-\frac{1}{2}\left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right)^\top \boldsymbol{Q}_{11}\left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right) - \left(\boldsymbol{y}_1-\boldsymbol{\mu}_1\right)^\top \boldsymbol{Q}_{12}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right)\\ &-\frac{1}{2}\boldsymbol{y}_1^\top\boldsymbol{Q}_{11}\boldsymbol{y}_1-\boldsymbol{y}_1^\top \left\{\boldsymbol{Q}_{11}\boldsymbol{\mu}_1-\boldsymbol{Q}_{12}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right)\right\} \end{align*}\] upon completing the square in \(\boldsymbol{y}_1\). This integrand is proportional to the density of a Gaussian distribution (and hence must be Gaussian) with precision matrix \(\boldsymbol{Q}_{11}\), while the mean vector and covariance matrix are \[\begin{align*} \boldsymbol{\mu}_{1|2} &= \boldsymbol{\mu}_1-\boldsymbol{Q}_{11}^{-1}\boldsymbol{Q}_{12} \left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right) \\&=\boldsymbol{\mu}_1+ \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\left(\boldsymbol{y}_2-\boldsymbol{\mu}_2\right) \\\boldsymbol{\Sigma}_{1|2} &= \boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}. \end{align*}\] Note that \(\boldsymbol{\Sigma}_{1|2}=\boldsymbol{Q}_{11}^{-1}\) corresponds to the Schur complement of \(\boldsymbol{\Sigma}_{22}\).
Remark that the above is sufficient (why?) The quadratic form appearing in the exponential term of the density of a Gaussian vector with mean \(\boldsymbol{\nu}\) and precision \(\boldsymbol{\varPsi}\) is \[\begin{align*} (\boldsymbol{x}-\boldsymbol{\nu})^\top\boldsymbol{\varPsi}(\boldsymbol{x}-\boldsymbol{\nu})= \boldsymbol{x}^\top\boldsymbol{\varPsi}\boldsymbol{x} - \boldsymbol{x}^\top\boldsymbol{\varPsi}\boldsymbol{\nu} - \boldsymbol{\nu}^\top\boldsymbol{\varPsi}\boldsymbol{x} + \boldsymbol{\nu}^\top\boldsymbol{\varPsi}\boldsymbol{\nu}. \end{align*}\] uniquely determines the parameters of the Gaussian distribution. The quadratic term in \(\boldsymbol{x}\) forms a sandwich around the precision matrix, while the linear term identifies the location vector. Since any (conditional) density function integrates to one, there is a unique normalizing constant and the latter need not be computed.
1.2 Likelihood
Definition 1.16 (Likelihood) The likelihood \(L(\boldsymbol{\theta})\) is a function of the parameter vector \(\boldsymbol{\theta}\) that gives the probability (or density) of observing a sample under a postulated distribution, treating the observations as fixed, \[\begin{align*} L(\boldsymbol{\theta}; \boldsymbol{y}) = f(\boldsymbol{y}; \boldsymbol{\theta}), \end{align*}\] where \(f(\boldsymbol{y}; \boldsymbol{\theta})\) denotes the joint density or mass function of the \(n\)-vector containing the observations.
If the latter are independent, the joint density factorizes as the product of the density of individual observations, and the likelihood becomes \[\begin{align*} L(\boldsymbol{\theta}; \boldsymbol{y})=\prod_{i=1}^n f_i(y_i; \boldsymbol{\theta}) = f_1(y_1; \boldsymbol{\theta}) \times \cdots \times f_n(y_n; \boldsymbol{\theta}). \end{align*}\] The corresponding log likelihood function for independent and identically distributions observations is \[\begin{align*} \ell(\boldsymbol{\theta}; \boldsymbol{y}) = \sum_{i=1}^n \ln f(y_i; \boldsymbol{\theta}) \end{align*}\]
Definition 1.17 (Score and information matrix) Let \(\ell(\boldsymbol{\theta}),\) \(\boldsymbol{\theta} \in \boldsymbol{\Theta} \subseteq \mathbb{R}^p,\) be the log likelihood function. The gradient of the log likelihood \(U(\boldsymbol{\theta}) = \partial \ell(\boldsymbol{\theta}) / \partial \boldsymbol{\theta}\) is termed score function.
The observed information matrix is the hessian of the negative log likelihood \[\begin{align*} j(\boldsymbol{\theta}; \boldsymbol{y})=-\frac{\partial^2 \ell(\boldsymbol{\theta}; \boldsymbol{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^\top}, \end{align*}\] evaluated at the maximum likelihood estimate \(\widehat{\boldsymbol{\theta}},\) so \(j(\widehat{\boldsymbol{\theta}}).\) Under regularity conditions, the expected information, also called Fisher information matrix, is \[\begin{align*} i(\boldsymbol{\theta}) = \mathsf{E}\left\{U(\boldsymbol{\theta}; \boldsymbol{Y}) U(\boldsymbol{\theta}; \boldsymbol{Y})^\top\right\} = \mathsf{E}\left\{j(\boldsymbol{\theta}; \boldsymbol{Y})\right\} \end{align*}\] Both the Fisher (or expected) and the observed information matrices are symmetric and encode the curvature of the log likelihood and provide information about the variability of \(\widehat{\boldsymbol{\theta}}.\)
The information of an independent and identically distributed sample of size \(n\) is \(n\) times that of a single observation, so information accumulates at a linear rate.
Example 1.9 (Likelihood for right-censoring) Consider a survival analysis problem for independent time-to-event data subject to (noninformative) random right-censoring. We assume failure times \(Y_i (i=1, \ldots, n)\) are drawn from a common distribution \(F(\cdot; \boldsymbol{\theta})\) supported on \((0, \infty)\) and complemented with an independent censoring indicator \(C_i \in \{0,1\}\), with \(0\) indicating right-censoring and \(C_i=1\) observed failure time. If individual observation \(i\) has not experienced the event at the end of the collection period, then the likelihood contribution \(\Pr(Y > y) = 1-F(y; \boldsymbol{\theta})\), where \(y_i\) is the maximum time observed for \(Y_i\).
We write the log likelihood in terms of the right-censoring binary indicator as \[\begin{align*} \ell(\boldsymbol{\theta}) = \sum_{i: c_i=0} \log \{1- F(y_i; \boldsymbol{\theta})\} + \sum_{i: c_i=1} \log f(y_i; \boldsymbol{\theta}) \end{align*}\]
Suppose for simplicity that \(Y_i \sim \mathsf{expo}(\lambda)\) and let \(m=c_1 + \cdots + c_n\) denote the number of observed failure times. Then, the log likelihood and the Fisher information are \[\begin{align*} \ell(\lambda) &= \lambda \sum_{i=1}^n y_i + \log \lambda m\\ i(\lambda) &= m/\lambda^2 \end{align*}\] and the right-censored observations for the exponential model do not contribute to the information.
Example 1.10 (Information for the Gaussian distribution) Consider \(Y \sim \mathsf{Gauss}(\mu, \tau^{-1})\), parametrized in terms of precision \(\tau\). The likelihood contribution for an \(n\) sample is, up to proportionality, \[\begin{align*} \ell(\mu, \tau) \propto \frac{n}{2}\log(\tau) - \frac{\tau}{2}\sum_{i=1}^n(Y_i^2-2\mu Y_i+\mu^2) \end{align*}\]
The observed and Fisher information matrices are is \[\begin{align*} j(\mu, \tau) &= \begin{pmatrix} n\tau & -\sum_{i=1}^n (Y_i-\mu)\\ -\sum_{i=1}^n (Y_i-\mu) & \frac{n}{2\tau^2} \end{pmatrix}, \\ i(\mu, \tau) &= n\begin{pmatrix} \tau & 0\\ 0 & \frac{1}{2\tau^2} \end{pmatrix} \end{align*}\] Since \(\mathsf{E}(Y_i) = \mu\), the expected value of the off-diagonal entries of the Fisher information matrix are zero.
Example 1.11 (Likelihood, score and information of the Weibull distribution) The log likelihood for a simple random sample whose realizations are \(y_1, \ldots, y_n\) of size \(n\) from a \(\mathsf{Weibull}(\lambda, \alpha)\) model is \[\begin{align*} \ell(\lambda, \alpha) = n \ln(\alpha) - n\alpha\ln(\lambda) + (\alpha-1) \sum_{i=1}^n \ln y_i - \lambda^{-\alpha}\sum_{i=1}^n y_i^\alpha. \end{align*}\]
The score, which is the gradient of the log likelihood, is easily obtained by differentiation1 \[\begin{align*} U(\lambda, \alpha) &= \begin{pmatrix}\frac{\partial \ell(\lambda, \alpha)}{\partial \lambda} \\ \frac{\partial \ell(\lambda, \alpha)}{\partial \alpha} \end{pmatrix} \\&= \begin{pmatrix} -\frac{n\alpha}{\lambda} +\alpha\lambda^{-\alpha-1}\sum_{i=1}^n y_i^\alpha \\ \frac{n}{\alpha} + \sum_{i=1}^n \ln (y_i/\lambda) - \sum_{i=1}^n \left(\frac{y_i}{\lambda}\right)^{\alpha} \times\ln\left(\frac{y_i}{\lambda}\right). \end{pmatrix} \end{align*}\] and the observed information is the \(2 \times 2\) matrix-valued function \[\begin{align*} j(\lambda, \alpha) &= - \begin{pmatrix} \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \lambda^2} & \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \lambda \partial \alpha} \\ \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \alpha \partial \lambda} & \frac{\partial^2 \ell(\lambda, \alpha)}{\partial \alpha^2} \end{pmatrix} = \begin{pmatrix} j_{\lambda, \lambda} & j_{\lambda, \alpha} \\ j_{\lambda, \alpha} & j_{\alpha, \alpha} \end{pmatrix} \end{align*}\] whose entries are \[\begin{align*} j_{\lambda, \lambda} &= \lambda^{-2}\left\{-n\alpha + \alpha(\alpha+1)\sum_{i=1}^n (y_i/\lambda)^\alpha\right\} \\ j_{\lambda, \alpha} &= \lambda^{-1}\sum_{i=1}^n [1-(y_i/\lambda)^\alpha\{1+\alpha\ln(y_i/\lambda)\}] \\ j_{\alpha,\alpha} &= n\alpha^{-2} + \sum_{i=1}^n (y_i/\lambda)^\alpha \{\ln(y_i/\lambda)\}^2 \end{align*}\] To compute the expected information matrix, we need to compute expectation of \(\mathsf{E}\{(Y/\lambda)^\alpha\}\), \(\mathsf{E}[(Y/\lambda)^\alpha\ln\{(Y/\lambda)^\alpha\}]\) and \(\mathsf{E}[(Y/\lambda)^\alpha\ln^2\{(Y/\lambda)^\alpha\}]\). By definition, \[\begin{align*} \mathsf{E}\left\{(Y/\lambda)^\alpha\right\} & = \int_0^\infty (x/\lambda)^\alpha \frac{\alpha}{\lambda^\alpha} x^{\alpha-1}\exp\left\{-(x/\lambda)^\alpha\right\} \mathrm{d} x \\ &= \int_0^\infty s\exp(-s) \mathrm{d} s =1 \end{align*}\] making a change of variable \(S = (Y/\lambda)^\alpha\sim \mathsf{Exp}(1)\). The two other integrals are tabulated in Gradshteyn and Ryzhik (2014), and are equal to \(1-\gamma\) and \(\gamma^2-2\gamma + \pi^2/6\), respectively, where \(\gamma \approx 0.577\) is the Euler–Mascherroni constant. The expected information matrix of the Weibull distribution has entries \[\begin{align*} i_{\lambda, \lambda} & = n \lambda^{-2}\alpha\left\{ (\alpha+1)-1\right\} \\ i_{\lambda, \alpha} & = -n\lambda^{-1} (1-\gamma) \\ i_{\alpha, \alpha} & = n\alpha^{-2}(1 + \gamma^2-2\gamma+\pi^2/6) \end{align*}\]
We can check this result numerically by comparing the expected value of the observed information matrix
<- function(scale, shape){
exp_info_weib <- shape*((shape + 1) - 1)/(scale^2)
i11 <- -(1+digamma(1))/scale
i12 <- (1+digamma(1)^2+2*digamma(1)+pi^2/6)/(shape^2)
i22 matrix(c(i11, i12, i12, i22), nrow = 2, ncol = 2)
}<- function(y, scale, shape){
obs_info_weib <- y/scale # scale family
ys <- shape*((shape + 1)*mean(ys^shape)-1)/scale^2
o11 <- (1-mean(ys^shape*(1+shape*log(ys))))/scale
o12 <- 1/(shape*shape) + mean(ys^shape*(log(ys))^2)
o22 matrix(c(o11, o12, o12, o22), nrow = 2, ncol = 2)
}<- function(pars, y){
nll_weib -sum(dweibull(x = y, scale = pars[1], shape = pars[2], log = TRUE))}
# Fix parameters
<- rexp(n = 1, rate = 0.5)
scale <- rexp(n = 1)
shape <- 1000L
nobs <- rweibull(n = nobs, scale = scale, shape = shape)
dat # Compare Hessian with numerical differentiation
<- obs_info_weib(dat, scale = scale, shape = shape)
o_info all.equal(
::hessian(nll_weib, x = c(scale, shape), y = dat) / nobs,
numDeriv
o_info)# Compute approximation to Fisher information
<- replicate(n = 1000, expr = {
exp_info_sim obs_info_weib(y = rweibull(n = nobs,
shape = shape,
scale = scale),
scale = scale, shape = shape)})
all.equal(apply(exp_info_sim, 1:2, mean),
exp_info_weib(scale, shape))
The joint density function only factorizes for independent data, but an alternative sequential decomposition can be helpful. For example, we can write the joint density \(f(y_1, \ldots, y_n)\) using the factorization \[\begin{align*} f(\boldsymbol{y}) = f(y_1) \times f(y_2 \mid y_1) \times \ldots f(y_n \mid y_1, \ldots, y_n) \end{align*}\] in terms of conditional. Such a decomposition is particularly useful in the context of time series, where data are ordered from time \(1\) until time \(n\) and models typically relate observation \(y_n\) to it’s past.
Example 1.12 (First-order autoregressive process) Consider a Gaussian \(\mathsf{AR}(1)\) model of the form \[Y_t = \mu + \phi(Y_{t-1} - \mu) + \varepsilon_t,\] where \(\phi\) is the lag-one correlation, \(\mu\) the global mean and \(\varepsilon_t\) is an iid innovation with mean zero and variance \(\sigma^2\). If \(|\phi| < 1\), the process is stationary, and the variance does not increase with \(t\).
The Markov property states that the current realization depends on the past, \(Y_t \mid Y_1, \ldots, Y_{t-1},\) only through the most recent value \(Y_{t-1}.\) The log likelihood thus becomes \[\begin{align*} \ell(\boldsymbol{\theta}) = \ln f(y_1) + \sum_{i=2}^n f(y_i \mid y_{i-1}). \end{align*}\]
If innovations are Gaussian, we have \[Y_t \mid Y_{t-1}=y_{t-1} \sim \mathsf{Gauss}\{\mu(1-\phi)+ \phi y_{t-1}, \sigma^2\}, \qquad t>1.\]
The \(\mathsf{AR}(1)\) stationarity process \(Y_t\), marginally, has mean \(\mu\) and unconditional variance \(\sigma^2/(1-\phi^2)\). The \(\mathsf{AR}(1)\) process is first-order Markov since the conditional distribution \(p(Y_t \mid Y_{t-1}, \ldots, Y_{t-p})\) equals \(p(Y_t \mid Y_{t-1})\). This means the likelihood is \[\begin{align*} \ell(\mu, \phi,\sigma^2)& = -\frac{n}{2}\log(2\pi) - n\log \sigma + \frac{1}{2}\log(1-\phi^2) \\&\quad -\frac{(1-\phi^2)(y_1- \mu)^2}{2\sigma^2} - \sum_{i=2}^n \frac{(y_t - \mu(1-\phi)- \phi y_{t-1})^2}{2\sigma^2} \end{align*}\]
1.3 Monte Carlo methods
Consider a target distribution with finite expected value. The law of large numbers guarantees that, if we can draw observations from our target distribution, then the sample average will converge to the expected value of that distribution, as the sample size becomes larger and larger, provided the expectation is finite.
We can thus compute the probability of any event or the expected value of any (integrable) function by computing sample averages; the cost to pay for this generality is randomness.
Specifically, suppose we are interested in the average \(\mathsf{E}\{g(X)\}\) of \(X \sim F\) for some function \(g\).
Example 1.13 Consider \(X \sim \mathsf{Gamma}(\alpha, \beta)\), a gamma distribution with shape \(\alpha\) and rate \(\beta\). We can compute the probability that \(X < 1\) easily by Monte Carlo since \(\Pr(X <1) = \mathsf{E}\{\mathrm{I}(X<1)\}\) and this means we only need to compute the proportion of draws less than one. We can likewise compute the mean \(g(x) = x\) or variance.
Suppose we have drawn a Monte Carlo sample of size \(B\). If the function \(g(\cdot)\) is square integrable,2 with variance \(\sigma^2_g\), then a central limit theorem applies. In large samples and for independent observations, our Monte Carlo average \(\widehat{\mu}_g = B^{-1}\sum_{b=1}^B g(X_i)\) has variance \(\sigma^2_g/B\). We can approximate the unknown variance \(\sigma^2_g\) by it’s empirical counterpart.3. Note that, while the variance decreases linearly with \(B\), the choice of \(g\) impacts the speed of convergence: we can compute \[\sigma^2_g =\Pr(X \leq 1)\{1-\Pr(X \leq 1)\}=0.0434\] (left) and \(\sigma^2_g=\alpha/\beta^2=1/8\) (middle plot).
Figure 1.1 shows the empirical trace plot of the Monte Carlo average (note the \(\sqrt{B}\) \(x\)-axis scale!) as a function of the Monte Carlo sample size \(B\) along with 95% Wald-based confidence intervals (gray shaded region), \(\widehat{\mu}_g \pm 1.96 \times \sigma_g/\sqrt{B}\). We can see that the ‘likely region’ for the average shrinks with \(B\).
What happens if our function is not integrable? The right-hand plot of Figure 1.1 shows empirical averages of \(g(x) = x^{-1}\), which is not integrable if \(\alpha < 1\). We can compute the empirical average, but the result won’t converge to any meaningful quantity regardless of the sample size. The large jumps are testimonial of this.
We have already used Monte Carlo methods to compute posterior quantities of interest in conjugate models. Outside of models with conjugate priors, the lack of closed-form expression for the posterior precludes inference. Indeed, calculating the posterior probability of an event, or posterior moments, requires integration of the normalized posterior density and thus knowledge of the marginal likelihood. It is seldom possible to sample independent and identically distributed (iid) samples from the target, especially if the model is high dimensional: rejection sampling and the ratio of uniform method are examples of Monte Carlo methods which can be used to generate iid draws.
Proposition 1.6 (Rejection sampling) Rejection sampling (also termed accept-reject algorithm) samples from a random vector with density \(p(\cdot)\) by drawing candidates from a proposal with density \(q(\cdot)\) with nested support, \(\mathrm{supp}(p) \subseteq \mathrm{supp}(q)\). The density \(q(\cdot)\) must be such that \(p(\boldsymbol{\theta}) \leq C q(\boldsymbol{\theta})\) for \(C \geq 1\) for all values of \(\boldsymbol{\theta}\) in the support of \(p(\cdot)\). A proof can be found in Devroye (1986, Theorem 3.1)
- Generate \(\boldsymbol{\theta}^{\star}\) from the proposal with density \(q\) and \(U \sim \mathsf{U}(0,1)\)
- Compute the ratio \(R \gets p(\boldsymbol{\theta}^{\star})/ q(\boldsymbol{\theta}^{\star})\).
- If \(R \geq CU\), return \(\boldsymbol{\theta}\), else go back to step 1.
Rejection sampling requires the proposal \(q\) to have a support at least as large as that of \(p\) and resemble closely the density. It should be chosen so that the upper bound \(C\) is as sharp as possible and close to 1. The dominating density \(q\) must have heavier tails than the density of interest. The expected number of simulations needed to accept one proposal is \(C.\) Finally, for the method to be useful, we need to be able to simulate easily and cheaply from the proposal. The optimal value of \(C\) is \(C = \sup_{\boldsymbol{\theta}} p(\boldsymbol{\theta}) / q(\boldsymbol{\theta})\). This quantity may be obtained by numerical optimization, by finding the mode of the ratio of the log densities if the maximum is not known analytically.
Example 1.14 (Truncated Gaussian distribution) Consider the problem of sampling from a Gaussian distribution \(Y \sim \mathsf{Gauss}(\mu, \sigma^2)\) truncated in the interval \([a, b],\) which has density \[\begin{align*} f(x; \mu, \sigma, a, b) = \frac{1}{\sigma}\frac{\phi\left(\frac{x-\mu}{\sigma}\right)}{\Phi\{(b-\mu)/\sigma\}-\Phi\{(a-\mu)/\sigma\}}. \end{align*}\] where \(\phi(\cdot), \Phi(\cdot)\) are respectively the density and distribution function of the standard Gaussian distribution.
Since the Gaussian is a location-scale family, we can reduce the problem to sampling \(X\) from a standard Gaussian truncated on \(\alpha = (a-\mu)/\sigma\) and \(\beta = (b-\mu)/\sigma\) and back transform the result as \(Y = \mu + \sigma X\).
A crude accept-reject sampling algorithm would consider sampling from the same untruncated distribution with density \(g(X) = \sigma^{-1}\phi\{(x-\mu)/\sigma\}\), and the acceptance ratio is \(C^{-1}=\{\Phi(\beta) - \Phi(\alpha)\}\). We thus simply simulate points from the Gaussian and accept any that falls within the bounds.
# Standard Gaussian truncated on [0,1]
<- rnorm(1e5)
candidate <- candidate[candidate >= 0 & candidate <= 1]
trunc_samp # Acceptance rate
length(trunc_samp)/1e5
[1] 0.34242
# Theoretical acceptance rate
pnorm(1)-pnorm(0)
[1] 0.3413447
We can of course do better: if we consider a random variable with distribution function \(F,\) but truncated over the interval \([a,b],\) then the resulting distribution function is \[\frac{F(x) - F(a)}{F(b)-F(a)}, \qquad a \leq x \leq b,\] and we can invert this expression to get the quantile function of the truncated variable in terms of the distribution function \(F\) and the quantile function \(F^{-1}\) of the original untruncated variable.
For the Gaussian, this gives \[\begin{align*}
X \sim \Phi^{-1}\left[\Phi(\alpha) + \{\Phi(\beta)-\Phi(\alpha)\}U\right]
\end{align*}\] for \(U \sim \mathsf{U}(0,1)\). Although the quantile and distribution functions of the Gaussian, pnorm
and qnorm
in R, are very accurate, this method will fail for rare event simulation because it will return \(\Phi(x) = 0\) for \(x \leq -39\) and \(\Phi(x)=1\) for \(x \geq 8.3\), implying that \(a \leq 8.3\) for this approach to work (Botev and L’Écuyer 2017).
Consider the problem of simulating events in the right tail for a standard Gaussian where \(a > 0\); Marsaglia’s method (Devroye 1986, 381), can be used for that purpose. Write the density of the Gaussian as \(f(x) = \exp(-x^2/2)/c_1\), where \(c_1 = \int_{a}^{\infty}\exp(-z^2/2)\mathrm{d} z\), and note that \[c_1f(x) \leq \frac{x}{a}\exp\left(-\frac{x^2}{2}\right)= a^{-1}\exp\left(-\frac{a^2}{2}\right)g(x), \qquad x \geq a;\] where \(g(x)\) is the density of a Rayleigh variable shifted by \(a\), which has distribution function \(G(x) = 1-\exp\{(a^2-x^2)/2\}\) for \(x \geq a\). We can simulate such a random variate \(X\) through the inversion method. The constant \(C= \exp(-a^2/2)(c_1a)^{-1}\) approaches 1 quickly as \(a \to \infty\).
The accept-reject thus proceeds with
- Generate a shifted Rayleigh above \(a\), \(X \gets \{a^2 - 2\log(U)\}^{1/2}\) for \(U \sim \mathsf{U}(0,1)\)
- Accept \(X\) if \(XV \leq a\), where \(V \sim \mathsf{U}(0,1)\).
Should we wish to obtain samples on \([a,b]\), we could instead propose from a Rayleigh truncated above at \(b\) (Botev and L’Écuyer 2017).
<- 8.3
a <- 1000L
niter <- sqrt(a^2 + 2*rexp(niter))
X <- X[runif(niter)*X <= a] samp
For a given candidate density \(g\) which has a heavier tail than the target, we can resort to numerical methods to compute the mode of the ratio \(f/g\) and obtain the bound \(C\); see Albert (2009), Section 5.8 for an insightful example.