One way ANOVA

Session 3

MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal

Hypothesis tests for ANOVA

Hypothesis tests for ANOVA

Model assumptions

F-test for one way ANOVA

Global null hypothesis

No difference between treatments

  • H0 (null): all of the K treatment groups have the same average μ
  • Ha (alternative): at least two treatments have different averages

Tacitly assume that all observations have the same standard deviation σ.

  • The null hypothesis can be viewed as a special case from a bigger class of possibilities
  • it always corresponds to some restrictions from the alternative class

Building a statistic

  • yik is observation i of group k
  • ˆμ1,,ˆμK are sample averages of groups 1,,K
  • ˆμ is the overall sample mean

Decomposing variability into bits

ik(yikˆμ)2total sum of squares=ik(yikˆμk)2within sum of squares+kni(ˆμkˆμ)2between sum of squares.

null model

alternative model

added variability

F-test statistic

Omnibus test

With K groups and n observations, the statistic is

F=between-group variabilitywithin-group variability=between sum of squares/(K1)within sum of squares/(nK)

Ratio of variance

Data with equal mean (left) and different mean per group (right).

Data with equal mean (left) and different mean per group (right).

What happens under the null regime?

If all groups have the same mean, both numerator and denominator are estimators of σ2, thus

  • the F ratio should be 1 on average if there are no mean differences.
  • but the numerator is more variable because it is based on K observations
    • benchmark is skewed to the right.
Null distribution and degrees of freedom

The null distribution (benchmark) is a Fisher distribution F(ν1,ν2).

The parameters ν1,ν2 are called degrees of freedom.

For the one-way ANOVA:

  • ν1=K1 is the number of constraints imposed by the null (number of groups minus one)
  • ν2=nK is the number of observations minus number of mean parameters estimated under alternative
The number of constraints come from the fact we go from K means under alternative, to 1 mean under null.

Fisher distribution

Note: the F(ν1,ν2) distribution is indistinguishable from χ2(ν1) for ν2 large.

Impact of encouragement on teaching

From Davison (2008), Example 9.2

In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.

Formulating an hypothesis

Let μA,,μE denote the population average (expectation) score for the test for each experimental condition.

The null hypothesis is H0:μA=μB==μE against the alternative Ha that at least one of the population average is different.

F statistic

#Fit one way analysis of variance
test <- aov(data = arithmetic,
formula = score ~ group)
anova(test) #print anova table
term df sum of square mean square statistic p-value
group 4 722.67 180.67 15.27 < 1e-04
Residuals 40 473.33 11.83
The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.

# Compute p-value
df1 = 4,
df2 = 40,
lower.tail = FALSE)

Probability that a F(4,40) exceeds 15.27.

Model assumptions

Quality of approximations

  • The null and alternative hypothesis of the analysis of variance only specify the mean of each group
  • We need to assume more Read the fine print conditions! to derive the behaviour of the statistic

All statements about p-values
are approximate.

Model assumptions

Additivity and linearity Independence

Equal variance Large sample size

Alternative representation

Write ith observation of kth experimental group as

Yikμkobservationgp=μkmean of group+εikμkerror termgp,

where, for i=1,,nk and k=1,,K,

  • E(εik)=0 (mean zero) and
  • Va(εik)=σ2 (equal variance)
  • errors are independent from one another.
# 1: Additivity

Additive decomposition reads:

(quantity dependingon the treatment used)+(quantity depending only on the particular unit)

  • each unit is unaffected by the treatment of the other units
  • average effect of the treatment is constant
Diagnostic plots for additivity

Plot group averages {ˆμk} against residuals {eik}, where eik=yikˆμk.

By construction, sample mean of eik is always zero.

Lack of additivity

Less improvement for scores of stronger students.

Plot and context suggests multiplicative structure. Tempting to diagnose unequal variance.

20 / 42

Reading diagnostic plots requires practice (and is analogous to reading tea leaves: leaves a lot to interpretation).

Interpretation of residual plots

Look for potential patterns
on the y-axis only!

Multiplicative structure

Multiplicative data of the form (quantity dependingon the treatment used)×(quantity depending only on the particular unit) tend to have higher variability when the response is larger.

Fixes for multiplicative data

A log-transformation of response makes the model additive.

For responses bounded between a and b, reduce warping effects via ln{xa+δb+δx}

Careful with transformations:

  • lose interpretability
  • change of meaning (different scale/units).
If we consider a response on the log-scale, the test is for equality of the geometric mean!


Plot residuals against other explanatories.

Difference in average response; while the treatment seems to lead to a decrease in the response variable, a stratification by age group reveals this only occurs in less than 25 group, with a seemingly reversed effect for the adults. Thus, the marginal model implied by the one-way analysis of variance is misleading.

A note about interactions

An interaction occurs when the effect of experimental group depends on another variable.

In principle, randomization ensures we capture the average marginal effect (even if misleading/useless).

We could incorporate the interacting variable in the model capture it's effect (makes model more complex), e.g. using a two-way ANOVA.

# 2: Equal variance

Each observation
has the same
standard deviation σ.

ANOVA is quite sensitive to this assumption!

26 / 42

Graphical diagnostics

Plot standardized (rstandard) or studentized residuals (rstudent) against fitted values.

data(arithmetic, package = "hecedsm")
model <- lm(score ~ group, data = arithmetic)
data <- data.frame(
fitted = fitted(model),
residuals = rstudent(model))
ggplot(data = data,
mapping = aes(x = fitted,
y = residuals)) +
Test diagnostics

Can use a statistical test for H0:σ1==σK.

  • Bartlett's test (assumes normal data)
  • Levene's test: a one-way ANOVA for |yikˆμk|
  • Brown–Forsythe test: a one-way ANOVA for |yikmediank| (more robust)
  • Fligner-Killeen test: based on ranks

Different tests may yield different conclusions

Bartlett is uniformly most powerful for normal data.

Levene and BF are most commonly used in practice (so far of what I have seen)

Example in R

data(arithmetic, package = "hecedsm")
model <- aov(score ~ group, data = arithmetic)
car::leveneTest(model) #Brown-Forsythe by default
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 4 1.2072 0.3228
## 40

Fail to reject the null: no evidence of unequal variance

Box's take

To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!

Box, G.E.P. (1953). Non-Normality and Tests on Variances. Biometrika 40 (3)-4: 318–335.

  • In large sample, power is large so probably always reject H0:σ1==σK.
  • If heterogeneity only per experimental condition, use Welch's ANOVA (oneway.test in R).
  • This statistic estimates the std. deviation of each group separately.
  • Could (should?) be the default when you have large number of observations, or enough to reliably estimate mean and std. deviation.
What can go wrong? Spurious findings!

Reject null hypothesis more often even if no difference in mean!

32 / 42

Histogram of the null distribution of p-values obtained through simulation using the classical analysis of variance F-test (left) and Welch's unequal variance alternative (right), based on 10 000 simulations. Each simulated sample consist of 50 observations from a standard normal distribution and 10 observations from centered normal with variance of 9. The uniform distribution would have 5% in each of the 20 bins used for the display.

More complex heterogeneity patterns

  • Variance-stabilizing transformations (e.g., log for counts)
  • Explicit model for trend over time, etc. may be necessary in more complex design (repeated measure) where there is a learning effect.
# 3: Independence

As a Quebecer, I am not allowed to talk about this topic.

No visual diagnostic or test available.

Rather, infer from context.

34 / 42

Knowing the value of one observation tells us nothing about the value taken by the others.

Checking independence

  • Repeated measures are not independent
  • Group structure (e.g., people performing experiment together and exchanging feedback)
  • Time dependence (time series, longitudinal data).
  • Dependence on instrumentation, experimenter, time of the day, etc.

Observations close by tend to be more alike (correlated).

# 4: Sample size (normality?)

Where does the F-distribution come from?

Normality of group average

This holds (in great generality)
because of the
central limit theorem

Central limit theorem

In large samples, the sample mean is approximately normally distributed.

Top row shows data generating mechanism and a sample, bottom row shows the distribution of the sample mean of n=30 and n=50 observations.

37 / 42

How large should my sample be?

Rule of thumb: 20 or 30 per group

Gather sufficient number of observations.

Assessing approximate normality

The closer data are to being normal, the better the large-sample distribution approximation is.

Can check normality via quantile-quantile plot with standardized residuals ri:

  • on the x-axis, the theoretical quantiles ˆF1{rank(ri)/(n+1)} of the residuals, where F1 is the normal quantile function.
  • on the y-axis, the empirical quantiles ri

In R, use functions qqnorm or car::qqPlot to produce the plots.

More about quantile-quantile plots

The ordered residuals should align on a straight line.

Normal quantile-quantile plot (left) and Tukey's mean different QQ-plot (right).

Recap 1

  • One-way analysis of variance compares average of experimental groups
  • Null hypothesis: all groups have the same average
  • Easier to detect when the null hypothesis is false if:
    • large differences group average
    • small variability
    • large samples
Recap 2

  • Model assumes independent observations, additive structure and equal variability in each group.
  • All statements are approximate, but if model assumptions are invalid, can lead to spurious results or lower power.
