class: center middle main-title section-title-1 # One way ANOVA .class-info[ **Session 3** .light[MATH 80667A: Experimental Design and Statistical Methods <br> HEC Montréal ] ] --- name: outline class: title title-inv-1 # Outline -- .box-2.medium.sp-after-half[Hypothesis tests for ANOVA] -- .box-5.medium.sp-after-half[Model assumptions] --- layout: true class: title title-2 --- # F-test for one way ANOVA .box-inv-2.medium[Global null hypothesis] No difference between treatments - `\(\mathscr{H}_0\)` (null): all of the `\(K\)` treatment groups have the same average `\(\mu\)` - `\(\mathscr{H}_a\)` (alternative): at least two treatments have different averages Tacitly assume that all observations have the same standard deviation `\(\sigma\)`. ??? - The null hypothesis can be viewed as a special case from a bigger class of possibilities - it always corresponds to some restrictions from the alternative class --- # Building a statistic .small[ - `\(y_{ik}\)` is observation `\(i\)` of group `\(k\)` - `\(\widehat{\mu}_1, \ldots, \widehat{\mu}_K\)` are sample averages of groups `\(1, \ldots, K\)` - `\(\widehat{\mu}\)` is the overall sample mean ] .box-2[Decomposing variability into bits] `\begin{align*} \underset{\text{total sum of squares}}{\sum_{i}\sum_{k} (y_{ik} - \widehat{\mu})^2} &= \underset{\text{within sum of squares}}{\sum_i \sum_k (y_{ik} - \widehat{\mu}_k)^2} + \underset{\text{between sum of squares}}{\sum_k n_i (\widehat{\mu}_k - \widehat{\mu})^2}. \end{align*}` .pull-left-3[ .box-inv-2[null model] ] .pull-middle-3[ .box-inv-2[alternative model] ] .pull-right-3[ .box-inv-2[added variability] ] --- # _F_-test statistic .box-2.medium[Omnibus test] With `\(K\)` groups and `\(n\)` observations, the statistic is `\begin{align*} F &= \frac{\text{between-group variability}}{\text{within-group variability}} \\&= \frac{\text{between sum of squares}/(K-1)}{\text{within sum of squares}/(n-K)} \end{align*}` --- # Ratio of variance <div class="figure" style="text-align: center"> <img src="03-slides_files/figure-html/squareddistanova-1.png" alt="Data with equal mean (left) and different mean per group (right)." width="80%" /> <p class="caption">Data with equal mean (left) and different mean per group (right).</p> </div> --- # What happens under the null regime? If all groups have the same mean, both numerator and denominator are estimators of `\(\sigma^2\)`, thus - the `\(F\)` ratio should be 1 on average if there are no mean differences. - but the numerator is more variable because it is based on `\(K\)` observations - benchmark is skewed to the right. --- # Null distribution and degrees of freedom The null distribution (benchmark) is a Fisher distribution `\(\mathsf{F}(\nu_1, \nu_2)\)`. The parameters `\(\nu_1, \nu_2\)` are called **degrees of freedom**. For the one-way ANOVA: - `\(\nu_1 = K-1\)` is the number of constraints imposed by the null (number of groups minus one) - `\(\nu_2 = n-K\)` is the number of observations minus number of mean parameters estimated under alternative ??? The number of constraints come from the fact we go from K means under alternative, to 1 mean under null. --- # Fisher distribution <img src="03-slides_files/figure-html/unnamed-chunk-1-1.png" width="70%" style="display: block; margin: auto;" /> .small[Note: the `\(\mathsf{F}(\nu_1, \nu_2)\)` distribution is indistinguishable from `\(\chi^2(\nu_1)\)` for `\(\nu_2\)` large. ] --- # Impact of encouragement on teaching From Davison (2008), Example 9.2 > In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test. --- # Formulating an hypothesis Let `\(\mu_A, \ldots, \mu_E\)` denote the population average (expectation) score for the test for each experimental condition. The null hypothesis is `$$\mathscr{H}_0: \mu_A = \mu_B = \cdots = \mu_E$$` against the alternative `\(\mathscr{H}_a\)` that at least one of the population average is different. --- # _F_ statistic ```r #Fit one way analysis of variance test <- aov(data = arithmetic, formula = score ~ group) anova(test) #print anova table ``` |term | df| sum of square| mean square| statistic|p-value | |:---------|--:|-------------:|-----------:|---------:|:-------| |group | 4| 722.67| 180.67| 15.27|< 1e-04 | |Residuals | 40| 473.33| 11.83| | | --- # P-value .pull-left[ The _p_-value gives the probability of observing an outcome as extreme **if the null hypothesis was true**. ```r # Compute p-value pf(15.27, df1 = 4, df2 = 40, lower.tail = FALSE) ``` Probability that a `\(\mathsf{F}(4,40)\)` exceeds 15.27. ] .pull-right[ <img src="03-slides_files/figure-html/unnamed-chunk-3-1.png" width="90%" style="display: block; margin: auto;" /> ] --- layout: false name: multiple-testing class: center middle section-title section-title-5 animated fadeIn # Model assumptions --- layout: true class: title title-5 --- # Quality of approximations - The null and alternative hypothesis of the analysis of variance only specify the mean of each group - We need to assume more .footnote[Read the fine print conditions!] to derive the behaviour of the statistic .box-inv-5.medium[All statements about _p_-values<br> are **approximate**.] --- # Model assumptions .pull-left[ .box-inv-5.medium.sp-after[**Additivity and linearity**] .box-inv-5.medium.sp-after[**Independence**] ] .pull-right[ .box-inv-5.medium.sp-after[**Equal variance**] .box-inv-5.medium.sp-after[**Large sample size**] ] --- # Alternative representation Write `\(i\)`th observation of `\(k\)`th experimental group as `\begin{align*} \underset{\text{observation}\vphantom{gp}}{Y_{ik}\vphantom{\mu_k}} = \underset{\text{mean of group}}{\mu_k} + \underset{\text{error term}\vphantom{gp}}{\varepsilon_{ik}\vphantom{\mu_k}}, \end{align*}` where, for `\(i=1, \ldots, n_k\)` and `\(k=1, \ldots, K\)`, - `\(\mathsf{E}(\varepsilon_{ik})=0\)` (mean zero) and - `\(\mathsf{Va}(\varepsilon_{ik})= \sigma^2\)` (equal variance) - errors are independent from one another. --- # \# 1: Additivity Additive decomposition reads: `\begin{align*} \left(\begin{matrix} \text{quantity depending}\\ \text{on the treatment used}\end{matrix}\right) + \left(\begin{matrix} \text{quantity depending only }\\ \text{on the particular unit} \end{matrix}\right) \end{align*}` - each unit is unaffected by the treatment of the other units - average effect of the treatment is constant --- # Diagnostic plots for additivity Plot group averages `\(\{\widehat{\mu}_k\}\)` against residuals `\(\{e_{ik}\}\)`, where `\(e_{ik} = y_{ik}-\widehat{\mu}_k\)`. <img src="03-slides_files/figure-html/assumptions-1.png" width="80%" style="display: block; margin: auto;" /> .small[ By construction, sample mean of `\(e_{ik}\)` is **always** zero. ] --- # Lack of additivity Less improvement for scores of stronger students. <img src="03-slides_files/figure-html/multiplicative-1.png" width="80%" style="display: block; margin: auto;" /> .small[Plot and context suggests multiplicative structure. Tempting to diagnose unequal variance.] ??? Reading diagnostic plots requires practice (and is analogous to reading tea leaves: leaves a lot to interpretation). --- # Interpretation of residual plots .box-inv-5.large[ Look for potential patterns<br>on the `\(y\)`-axis **only**! ] --- # Multiplicative structure Multiplicative data of the form `\begin{align*} \left(\begin{matrix} \text{quantity depending}\\ \text{on the treatment used}\end{matrix}\right) \times \left(\begin{matrix} \text{quantity depending only }\\ \text{on the particular unit} \end{matrix}\right) \end{align*}` tend to have higher variability when the response is larger. --- # Fixes for multiplicative data A log-transformation of response makes the model **additive**. For responses bounded between `\(a\)` and `\(b\)`, reduce warping effects via `$$\ln\left\{\frac{x-a+\delta}{b+\delta-x}\right\}$$` Careful with transformations: - lose interpretability - change of meaning (different scale/units). ??? If we consider a response on the log-scale, the test is for equality of the geometric mean! --- # Interactions Plot residuals against other explanatories. <img src="03-slides_files/figure-html/omittedlinearity-1.png" width="80%" style="display: block; margin: auto;" /> ??? Difference in average response; while the treatment seems to lead to a decrease in the response variable, a stratification by age group reveals this only occurs in less than 25 group, with a seemingly reversed effect for the adults. Thus, the marginal model implied by the one-way analysis of variance is misleading. --- # A note about interactions An **interaction** occurs when the effect of experimental group depends on another variable. In principle, randomization ensures we capture the average marginal effect (even if misleading/useless). We could incorporate the interacting variable in the model capture it's effect (makes model more complex), e.g. using a two-way ANOVA. --- # \# 2: Equal variance .box-inv-5.large.sp-after[ Each observation <br>has the *same*<br> standard deviation `\(\sigma\)`.] .box-inv-5.medium.sp-after-half[ ANOVA is quite sensitive to this assumption! ] --- # Graphical diagnostics Plot *standardized* (`rstandard`) or *studentized residuals* (`rstudent`) against fitted values. ```r data(arithmetic, package = "hecedsm") model <- lm(score ~ group, data = arithmetic) data <- data.frame( fitted = fitted(model), residuals = rstudent(model)) ggplot(data = data, mapping = aes(x = fitted, y = residuals)) + geom_point() ``` --- # Test diagnostics Can use a statistical test for `\(\mathscr{H}_0: \sigma_1 = \cdots = \sigma_K\)`. - Bartlett's test (assumes normal data) - Levene's test: a one-way ANOVA for `\(|y_{ik} - \widehat{\mu}_k|\)` - .col-5[Brown–Forsythe test]: a one-way ANOVA for `\(|y_{ik} - \text{median}_k|\)` (**more robust**) - Fligner-Killeen test: based on ranks .box-inv-5.medium.sp-after-half[Different tests may yield different conclusions] ??? Bartlett is uniformly most powerful for normal data. Levene and BF are most commonly used in practice (so far of what I have seen) --- # Example in **R** ```r data(arithmetic, package = "hecedsm") model <- aov(score ~ group, data = arithmetic) car::leveneTest(model) #Brown-Forsythe by default ``` ``` ## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 4 1.2072 0.3228 ## 40 ``` .small[Fail to reject the null: no evidence of unequal variance] --- # Box's take > To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port! .small[Box, G.E.P. (1953). *Non-Normality and Tests on Variances.* Biometrika 40 (**3**)-4: 318–335.] --- # Solutions - In large sample, power is large so probably always reject `\(\mathscr{H}_0: \sigma_1 = \cdots = \sigma_K\)`. - If heterogeneity only per experimental condition, use **Welch's ANOVA** (`oneway.test` in **R**). - This statistic estimates the std. deviation of each group *separately*. - Could (should?) be the default when you have large number of observations, or enough to reliably estimate mean and std. deviation. --- # What can go wrong? Spurious findings! <img src="img/03/simuWelchnull.png" width="70%" style="display: block; margin: auto;" /> .small[Reject null hypothesis more often even if no difference in mean!] ??? Histogram of the null distribution of p-values obtained through simulation using the classical analysis of variance F-test (left) and Welch's unequal variance alternative (right), based on 10 000 simulations. Each simulated sample consist of 50 observations from a standard normal distribution and 10 observations from centered normal with variance of 9. The uniform distribution would have 5% in each of the 20 bins used for the display. --- # More complex heterogeneity patterns - Variance-stabilizing transformations (e.g., log for counts) - Explicit model for trend over time, etc. may be necessary in more complex design (repeated measure) where there is a learning effect. --- # \# 3: Independence .footnote.small[As a Quebecer, I am not allowed to talk about this topic.] .box-inv-5.medium.sp-after-half[No visual diagnostic or test available.] .box-inv-5.large.sp-after-half[Rather, infer from **context**.] ??? Knowing the value of one observation tells us nothing about the value taken by the others. --- # Checking independence - Repeated measures are **not independent** - Group structure (e.g., people performing experiment together and exchanging feedback) - Time dependence (time series, longitudinal data). - Dependence on instrumentation, experimenter, time of the day, etc. Observations close by tend to be more alike (correlated). --- # \# 4: Sample size (normality?) .box-inv-5.medium.sp-after-half[ Where does the `\(\mathsf{F}\)`-distribution come from? ] .box-inv-5.medium.sp-after-half[Normality of **group average**] .box-inv-5.medium.sp-after-half[ This holds (in great generality)<br>because of the<br> **central limit theorem**] --- # Central limit theorem .small[In large samples, the sample mean is approximately normally distributed.] <img src="03-slides_files/figure-html/clt-1.png" width="70%" style="display: block; margin: auto;" /> .small[ Top row shows data generating mechanism and a sample, bottom row shows the distribution of the sample mean of `\(n=30\)` and `\(n=50\)` observations. ] --- # How large should my sample be? .box-inv-5.large.sp-after-half[Rule of thumb: 20 or 30 per group] .box-inv-5.large.sp-after-half[Gather sufficient number of observations.] --- # Assessing approximate normality The closer data are to being normal, the better the large-sample distribution approximation is. Can check normality via quantile-quantile plot with standardized residuals `\(r_i\)`: - on the `\(x\)`-axis, the theoretical quantiles `\(\widehat{F}^{-1}\{\mathrm{rank}(r_i)/(n+1)\}\)` of the residuals, where `\(F^{-1}\)` is the normal quantile function. - on the `\(y\)`-axis, the empirical quantiles `\(r_i\)` .footnote.small[ In **R**, use functions `qqnorm` or `car::qqPlot` to produce the plots. ] --- # More about quantile-quantile plots The ordered residuals should align on a straight line. <img src="03-slides_files/figure-html/diagrammeqq2-1.png" width="70%" style="display: block; margin: auto;" /> .small[Normal quantile-quantile plot (left) and Tukey's mean different QQ-plot (right).] --- layout: true class: title title-1 --- # Recap 1 .medium[ - One-way analysis of variance compares **average** of experimental groups - Null hypothesis: all groups have the same average - Easier to detect when the null hypothesis is false if: - large differences group average - small variability - large samples ] --- class: title title-1 # Recap 2 .medium[ - Model assumes independent observations, additive structure and equal variability in each group. - All statements are approximate, but if model assumptions are invalid, can lead to spurious results or lower power. ]