6 Hypothesis testing

This is a refresher on the notions related to hypothesis testing.

Trial analogy: suppose you are a member of the jury for a trial where the potential culprit stands accused of murder. If you declare him guilty, the sentence will likely be death penalty. The null hypothesis is innocence: the accusee is innocent until proven guilty. You will deliver a verdict of culpability only if the evidences are overwhelming against the accusee (you do not want to send an innocent to the death row and make a wrongful conviction).

In this setting, the verdict at the end of the trial will reflect “this innocent until proven guilty” mindset: you can usually only conclude that there are not enough proofs to sentence the accusee of murder, not that the person is innocent. This is why we “fail to reject the null hypothesis”, since you gave the accusee the benefit of the doubt in the first place and examined the evidence in this optic.

Test statistics are realizations of random variables, so the size of a test is the probability of falsely rejecting the null hypothesis, \[\alpha = \Pr(\text{reject }\mathrm{H}_0 \mid \mathrm{H}_0 \text{ is true}).\]This is fixed beforehand: if we take \(\alpha = 0.05\), there will be 95% of the cases we will correctly release an innocent and 5% of the cases where we will convict him undully (due to circumstancial factors, for example).

We illustrate various concepts with the simple model \[Y_i = \beta_0 + \varepsilon_i, \qquad\varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2) \qquad (i=1, \ldots, n)\]

The Wald test statistic for the null hypothesis \(\mathrm{H}_0: \beta_0=0\) against the alternative \(\mathrm{H}_a: \beta_0 \neq 0\) is \(t = \hat{\beta}_0/\mathrm{se}(\hat{\beta}_0) \sim \mathcal{T}(n-p)\). We can compare the Student distribution with the empirical distribution of \(t\)-test obtained by simulating a large number of test statistics from the model; these should match.

If the value \(|t|\) is very large, then there are evidences that \(\beta_0 \neq 0\). In this case, the probability of observing something larger than \(|t|\) under \(T \sim \mathcal{T}(n-p)\) is \(P = 1-\Pr(-t < T < t) = 1-2 \Pr(|T| < t)\), by virtue of the symmetry of the Student distribution. This probability \(P\) is called \(P\)-value, the probability of observing something as extreme under the null distribution.

The power of a test is \[\mathrm{power} = \Pr(\text{reject } \mathrm{H}_0 \mid \mathrm{H}_a \text{ is true}).\] Consider the alternative \(\mathrm{H}_a: \beta = \mu \neq 0\). For the \(t\)-test, the power is a function of \(\mu, \sigma^2\) and \(n\). Intuitively, the further \(\mu\) is from zero, the larger the chance of correctly detecting that \(\mu \neq 0\). Similarly, the more precise our mean estimate is (when \(\sigma^2\) is small), the more we have. Lastly, evidence accumulates with the sample size - here through the degrees of freedom parameter.

Even if we don’t know the distribution of the test statistic under the alternative, we can simulate the power curve as a function of \(\mu, \sigma\) and \(n\).

Under \(\mathrm{H}_0\), our test statistic \(T=\hat{\beta}_0/\mathrm{se}(\hat{\beta}_0)\) followed a \(\mathcal{T}(n-1)\) distribution and the cutoff value was \(\mathfrak{t}_{1-\alpha/2}\), so that under \(\mathrm{H}_0\), \(\Pr(|T| > \mathfrak{t}_{1-\alpha/2}) = \alpha\).

We can compute the power exactly as a function of \(\mu\) in this example: it is \[\begin{align*} \beta(\mu) &= 1-\Pr\left(\mathfrak{t}_{1-\alpha/2} \leq T \leq \mathfrak{t}_{1-\alpha/2}; {\mathrm{H}_a}\right) \\&=1-\Pr\left(\mathfrak{t}_{1-\alpha/2} \leq \frac{\hat{\beta}_0-\mu+\mu}{\mathrm{se}(\hat{\beta}_0)} \leq \mathfrak{t}_{1-\alpha/2}; {\mathrm{H}_a}\right) \\&=1-\Pr\left(\mathfrak{t}_{1-\alpha/2}+\frac{\mu}{\mathrm{se}(\hat{\beta}_0)} \leq \frac{\hat{\beta}_0-\mu}{\mathrm{se}(\hat{\beta}_0)} \leq \mathfrak{t}_{1-\alpha/2}+\frac{\mu}{\mathrm{se}(\hat{\beta}_0)};{\mathrm{H}_a}\right). \end{align*}\]

since now \(T^*=(\hat{\beta}_0-\mu)/\mathrm{se}(\hat{\beta}_0) \sim \mathcal{T}(n-1)\) under \(\mathrm{H}_a\). If we superimpose this curve as a function of \(\mu\), we see it matches the empirical power.

The power curve at \(\mu=0\) is 0.05, since the size of the test is \(\alpha= 0.05\) in this experiment. If we increase the size of the test, then power increases:

The probability of Type I error (falsely rejecting the null) is the size of the test, so increases with \(\alpha\). The lower the \(\alpha\), the higher the probability of Type 2 errors (not rejecting the null when the alternative is true) and the lower the power.