One way ANOVA and model assumptions

# One way ANOVA

**Session 3**

]

---

# Outline
--

.box-2.medium.sp-after-half[Hypothesis tests for ANOVA]

.box-5.medium.sp-after-half[Model assumptions]

---

---

# F-test for one way ANOVA

.box-inv-2.medium[Global null hypothesis]

No difference between treatments

- `$\mathscr{H}_0$` (null): all of the `$K$` treatment groups have the same average `$\mu$`
- `$\mathscr{H}_a$` (alternative): at least two treatments have different averages

Tacitly assume that all observations have the same standard deviation `$\sigma$`.

???

- The null hypothesis can be viewed as a special case from a bigger class of possibilities
- it always corresponds to some restrictions from the alternative class

---

# Building a statistic

.small[
- `$y_{ik}$` is observation `$i$` of group `$k$`
- `$\widehat{\mu}_1, \ldots, \widehat{\mu}_K$` are sample averages of groups `$1, \ldots, K$`
- `$\widehat{\mu}$` is the overall sample mean

]

`\begin{align*}
\underset{\text{total sum of squares}}{\sum_{i}\sum_{k} (y_{ik} - \widehat{\mu})^2} &= \underset{\text{within sum of squares}}{\sum_i \sum_k (y_{ik} - \widehat{\mu}_k)^2} +  \underset{\text{between sum of squares}}{\sum_k n_i (\widehat{\mu}_k - \widehat{\mu})^2}.
\end{align*}`

.pull-left-3[
.box-inv-2[null model]
]
.pull-middle-3[
.box-inv-2[alternative model]
]
.pull-right-3[
.box-inv-2[added variability]
]

---

# _F_-test statistic

.box-2.medium[Omnibus test]

With `$K$` groups and `$n$` observations, the statistic is

`\begin{align*}
F &= \frac{\text{between-group variability}}{\text{within-group variability}} \\&=  \frac{\text{between sum of squares}/(K-1)}{\text{within sum of squares}/(n-K)}
\end{align*}`

---

# Ratio of variance

<div class="figure" style="text-align: center">
<img src="03-slides_files/figure-html/squareddistanova-1.png" alt="Data with equal mean (left) and different mean per group (right)." width="80%" />
<p class="caption">Data with equal mean (left) and different mean per group (right).</p>
</div>

---

# What happens under the null regime?

If all groups have the same mean, both numerator and denominator are estimators of `$\sigma^2$`, thus
- the `$F$` ratio should be 1 on average if there are no mean differences.
- but the numerator is more variable because it is based on `$K$` observations
  - benchmark is skewed to the right.

---

# Null distribution and degrees of freedom

The null distribution (benchmark) is a Fisher distribution `$\mathsf{F}(\nu_1, \nu_2)$`.

The parameters `$\nu_1, \nu_2$` are called **degrees of freedom**.

For the one-way ANOVA:

- `$\nu_1 = K-1$` is the number of constraints imposed by the null (number of groups minus one)
- `$\nu_2 = n-K$` is the number of observations minus number of mean parameters estimated under alternative

???

The number of constraints come from the fact we go from K means under alternative, to 1 mean under null.

---

# Fisher distribution

.small[Note: the `$\mathsf{F}(\nu_1, \nu_2)$` distribution is indistinguishable from `$\chi^2(\nu_1)$` for `$\nu_2$` large.
]

---

# Impact of encouragement on teaching

From Davison (2008), Example 9.2

> In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.

---

# Formulating an hypothesis

Let `$\mu_A, \ldots,  \mu_E$` denote the population average (expectation) score for the test for each experimental condition.

The null hypothesis is 
`$$\mathscr{H}_0: \mu_A = \mu_B = \cdots = \mu_E$$`
against the alternative `$\mathscr{H}_a$` that at least one of the population average is different.

---

# _F_ statistic

```r
#Fit one way analysis of variance
test <- aov(data = arithmetic, 
            formula = score ~ group)
anova(test) #print anova table
```

|term      | df| sum of square| mean square| statistic|p-value |
|:---------|--:|-------------:|-----------:|---------:|:-------|
|group     |  4|        722.67|      180.67|     15.27|< 1e-04 |
|Residuals | 40|        473.33|       11.83|          |        |

---

# P-value

The _p_-value gives the probability of observing an outcome as extreme **if the null hypothesis was true**.

```r
# Compute p-value
pf(15.27, 
   df1 = 4, 
   df2 = 40, 
   lower.tail = FALSE)
```

Probability that a `$\mathsf{F}(4,40)$` exceeds 15.27.

]

.pull-right[
<img src="03-slides_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" />
]

---

layout: false
name: multiple-testing
class: center middle section-title section-title-5 animated fadeIn

# Model assumptions

---

---

# Quality of approximations

- The null and alternative hypothesis of the analysis of variance only specify the mean of each group
- We need to assume more .footnote[Read the fine print conditions!] to derive the behaviour of the statistic

.box-inv-5.medium[All statements about _p_-values<br> are **approximate**.]

---

# Model assumptions

.pull-left[
.box-inv-5.medium.sp-after[**Additivity and linearity**]
.box-inv-5.medium.sp-after[**Independence**]
]

.box-inv-5.medium.sp-after[**Equal variance**]
.box-inv-5.medium.sp-after[**Large sample size**]

]

---

# Alternative representation

Write `$i$`th observation of `$k$`th experimental group as

`\begin{align*}
\underset{\text{observation}\vphantom{gp}}{Y_{ik}\vphantom{\mu_k}} = \underset{\text{mean of group}}{\mu_k} + \underset{\text{error term}\vphantom{gp}}{\varepsilon_{ik}\vphantom{\mu_k}},
\end{align*}`

where, for `$i=1, \ldots, n_k$` and `$k=1, \ldots, K$`,

- `$\mathsf{E}(\varepsilon_{ik})=0$` (mean zero) and
- `$\mathsf{Va}(\varepsilon_{ik})= \sigma^2$` (equal variance)
- errors are independent from one another.

---

# \# 1: Additivity

Additive decomposition reads:

`\begin{align*}
\left(\begin{matrix} \text{quantity depending}\\
 \text{on the treatment used}\end{matrix}\right) +
 \left(\begin{matrix} \text{quantity depending only }\\
\text{on the particular unit}
\end{matrix}\right)
\end{align*}`

- each unit is unaffected by the treatment of the other units
- average effect of the treatment is constant

---

# Diagnostic plots for additivity

Plot group averages `$\{\widehat{\mu}_k\}$` against residuals `$\{e_{ik}\}$`, where `$e_{ik} = y_{ik}-\widehat{\mu}_k$`.

By construction, sample mean of `$e_{ik}$` is **always** zero.

]

---

# Lack of additivity

Less improvement for scores of stronger students.

???

Reading diagnostic plots requires practice (and is analogous to reading tea leaves: leaves a lot to interpretation).

---
# Interpretation of residual plots

.box-inv-5.large[

Look for potential patterns<br>on the `$y$`-axis **only**!

]

---

# Multiplicative structure

Multiplicative data of the form
`\begin{align*}
\left(\begin{matrix} \text{quantity depending}\\
 \text{on the treatment used}\end{matrix}\right) \times
 \left(\begin{matrix} \text{quantity depending only }\\
\text{on the particular unit}
\end{matrix}\right)
\end{align*}`
tend to have higher variability when the response is larger.

---

# Fixes for multiplicative data

A log-transformation of response makes the model **additive**.

For responses bounded between `$a$` and `$b$`, reduce warping effects via
`$$\ln\left\{\frac{x-a+\delta}{b+\delta-x}\right\}$$`

Careful with transformations: 
- lose interpretability
- change of meaning (different scale/units).

???

If we consider a response on the log-scale, the test is for equality of the geometric mean!

---

# Interactions

Plot residuals against other explanatories.

???

Difference in average response; while the treatment seems to lead to a decrease in the response variable, a stratification by age group reveals this only occurs in less than 25 group, with a seemingly reversed effect for the adults. Thus, the marginal model implied by the one-way analysis of variance is misleading.

---

# A note about interactions

An **interaction** occurs when the effect of experimental group depends on another variable.

In principle, randomization ensures we capture the average marginal effect (even if misleading/useless).

We could incorporate the interacting variable in the model capture it's effect (makes model more complex), e.g. using a two-way ANOVA.

---

# \# 2: Equal variance

.box-inv-5.large.sp-after[
Each observation <br>has the *same*<br> standard deviation `$\sigma$`.]

.box-inv-5.medium.sp-after-half[
ANOVA is quite sensitive to this assumption!
]

---

# Graphical diagnostics

Plot *standardized* (`rstandard`) or *studentized residuals* (`rstudent`) against fitted values.

```r
data(arithmetic, package = "hecedsm")
model <- lm(score ~ group, data = arithmetic)
data <- data.frame(
  fitted = fitted(model),
  residuals = rstudent(model))
ggplot(data = data,
       mapping = aes(x = fitted,
                     y = residuals)) +
    geom_point()
```

---

# Test diagnostics

Can use a statistical test for `$\mathscr{H}_0: \sigma_1 = \cdots = \sigma_K$`.

- Bartlett's test (assumes normal data)
- Levene's test: a one-way ANOVA for `$|y_{ik} - \widehat{\mu}_k|$`
- .col-5[Brown–Forsythe test]: a one-way ANOVA for `$|y_{ik} - \text{median}_k|$` (**more robust**)
- Fligner-Killeen test: based on ranks

.box-inv-5.medium.sp-after-half[Different tests may yield different conclusions]

???

Bartlett is uniformly most powerful for normal data.

Levene and BF are most commonly used in practice (so far of what I have seen)

---

# Example in **R**

```r
data(arithmetic, package = "hecedsm")
model <- aov(score ~ group, data = arithmetic)
car::leveneTest(model) #Brown-Forsythe by default
```

```
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  4  1.2072 0.3228
##       40
```

---

# Box's take

> To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!

.small[Box, G.E.P. (1953). *Non-Normality and Tests on Variances.* Biometrika 40 (**3**)-4: 318–335.]
---

# Solutions

- In large sample, power is large so probably always reject `$\mathscr{H}_0: \sigma_1 = \cdots = \sigma_K$`.
- If heterogeneity only per experimental condition, use **Welch's ANOVA** (`oneway.test` in **R**).
- This statistic estimates the std. deviation of each group *separately*.
- Could (should?) be the default when you have large number of observations, or enough to reliably estimate mean and std. deviation.

---

# What can go wrong? Spurious findings!

???

Histogram of the null distribution of p-values obtained through simulation using the classical analysis of variance F-test (left) and Welch's unequal variance alternative (right), based on 10 000 simulations. Each simulated sample consist of 50 observations from a standard normal distribution and 10 observations from centered normal with variance of 9. The uniform distribution would have 5% in each of the 20 bins used for the display.

---

# More complex heterogeneity patterns

- Variance-stabilizing transformations (e.g., log for counts)
- Explicit model for trend over time, etc. may be necessary in more complex design (repeated measure) where there is a learning effect.

---

# \# 3: Independence

.footnote.small[As a Quebecer, I am not allowed to talk about this topic.]

.box-inv-5.medium.sp-after-half[No visual diagnostic or test available.]

.box-inv-5.large.sp-after-half[Rather, infer from **context**.]

???

Knowing the value of one observation tells us nothing about the value taken by the others.

---

# Checking independence

- Repeated measures are **not independent**
- Group structure (e.g., people performing experiment together and exchanging feedback)
- Time dependence (time series, longitudinal data).
- Dependence on instrumentation, experimenter, time of the day, etc.

Observations close by tend to be more alike (correlated).
---

# \# 4: Sample size (normality?)

.box-inv-5.medium.sp-after-half[

Where does the `$\mathsf{F}$`-distribution come from?
]

.box-inv-5.medium.sp-after-half[Normality of **group average**]

.box-inv-5.medium.sp-after-half[
This holds (in great generality)<br>because of the<br>
**central limit theorem**]

---

# Central limit theorem

Top row shows data generating mechanism and a sample, bottom row shows the distribution of the sample mean of `$n=30$` and `$n=50$` observations.

]

---

# How large should my sample be?

.box-inv-5.large.sp-after-half[Rule of thumb: 20 or 30 per group]

.box-inv-5.large.sp-after-half[Gather sufficient number of observations.]

---

# Assessing approximate normality

The closer data are to being normal, the better the large-sample distribution approximation is.

Can check normality via quantile-quantile plot with standardized residuals `$r_i$`:

- on the `$x$`-axis, the theoretical quantiles `$\widehat{F}^{-1}\{\mathrm{rank}(r_i)/(n+1)\}$` of the residuals, where `$F^{-1}$` is the normal quantile function.
- on the `$y$`-axis, the empirical quantiles `$r_i$`

.footnote.small[
In **R**, use functions `qqnorm` or `car::qqPlot` to produce the plots.
]

---

# More about quantile-quantile plots

The ordered residuals should align on a straight line.

---

---

# Recap 1
.medium[

- One-way analysis of variance compares **average** of experimental groups
- Null hypothesis: all groups have the same average
- Easier to detect when the null hypothesis is false if:
  - large differences group average
  - small variability
  - large samples

]

---

# Recap 2

- Model assumes independent observations, additive structure and equal variability in each group.
- All statements are approximate, but if model assumptions are invalid, can lead to spurious results or lower power.

]