Session 3
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Hypothesis tests for ANOVA
Hypothesis tests for ANOVA
Model assumptions
Global null hypothesis
No difference between treatments
Tacitly assume that all observations have the same standard deviation σ.
Decomposing variability into bits
∑i∑k(yik−ˆμ)2total sum of squares=∑i∑k(yik−ˆμk)2within sum of squares+∑kni(ˆμk−ˆμ)2between sum of squares.
null model
alternative model
added variability
Omnibus test
With K groups and n observations, the statistic is
F=between-group variabilitywithin-group variability=between sum of squares/(K−1)within sum of squares/(n−K)
If all groups have the same mean, both numerator and denominator are estimators of σ2, thus
The null distribution (benchmark) is a Fisher distribution F(ν1,ν2).
The parameters ν1,ν2 are called degrees of freedom.
For the one-way ANOVA:
The number of constraints come from the fact we go from K means under alternative, to 1 mean under null.
Note: the F(ν1,ν2) distribution is indistinguishable from χ2(ν1) for ν2 large.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
Let μA,…,μE denote the population average (expectation) score for the test for each experimental condition.
The null hypothesis is H0:μA=μB=⋯=μE against the alternative Ha that at least one of the population average is different.
#Fit one way analysis of variancetest <- aov(data = arithmetic, formula = score ~ group)anova(test) #print anova table
term | df | sum of square | mean square | statistic | p-value |
---|---|---|---|---|---|
group | 4 | 722.67 | 180.67 | 15.27 | < 1e-04 |
Residuals | 40 | 473.33 | 11.83 |
The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.
# Compute p-valuepf(15.27, df1 = 4, df2 = 40, lower.tail = FALSE)
Probability that a F(4,40) exceeds 15.27.
All statements about p-values
are approximate.
Additivity and linearity Independence
Equal variance Large sample size
Write ith observation of kth experimental group as
Yikμkobservationgp=μkmean of group+εikμkerror termgp,
where, for i=1,…,nk and k=1,…,K,
Additive decomposition reads:
(quantity dependingon the treatment used)+(quantity depending only on the particular unit)
Plot group averages {ˆμk} against residuals {eik}, where eik=yik−ˆμk.
By construction, sample mean of eik is always zero.
Less improvement for scores of stronger students.
Plot and context suggests multiplicative structure. Tempting to diagnose unequal variance.
Reading diagnostic plots requires practice (and is analogous to reading tea leaves: leaves a lot to interpretation).
Look for potential patterns
on the y-axis only!
Multiplicative data of the form (quantity dependingon the treatment used)×(quantity depending only on the particular unit) tend to have higher variability when the response is larger.
A log-transformation of response makes the model additive.
For responses bounded between a and b, reduce warping effects via ln{x−a+δb+δ−x}
Careful with transformations:
If we consider a response on the log-scale, the test is for equality of the geometric mean!
Plot residuals against other explanatories.
Difference in average response; while the treatment seems to lead to a decrease in the response variable, a stratification by age group reveals this only occurs in less than 25 group, with a seemingly reversed effect for the adults. Thus, the marginal model implied by the one-way analysis of variance is misleading.
An interaction occurs when the effect of experimental group depends on another variable.
In principle, randomization ensures we capture the average marginal effect (even if misleading/useless).
We could incorporate the interacting variable in the model capture it's effect (makes model more complex), e.g. using a two-way ANOVA.
Each observation
has the same
standard deviation σ.
ANOVA is quite sensitive to this assumption!
Plot standardized (rstandard
) or studentized residuals (rstudent
) against fitted values.
data(arithmetic, package = "hecedsm")model <- lm(score ~ group, data = arithmetic)data <- data.frame( fitted = fitted(model), residuals = rstudent(model))ggplot(data = data, mapping = aes(x = fitted, y = residuals)) + geom_point()
Can use a statistical test for H0:σ1=⋯=σK.
Different tests may yield different conclusions
Bartlett is uniformly most powerful for normal data.
Levene and BF are most commonly used in practice (so far of what I have seen)
data(arithmetic, package = "hecedsm")model <- aov(score ~ group, data = arithmetic)car::leveneTest(model) #Brown-Forsythe by default
## Levene's Test for Homogeneity of Variance (center = median)## Df F value Pr(>F)## group 4 1.2072 0.3228## 40
Fail to reject the null: no evidence of unequal variance
To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!
Box, G.E.P. (1953). Non-Normality and Tests on Variances. Biometrika 40 (3)-4: 318–335.
oneway.test
in R).Reject null hypothesis more often even if no difference in mean!
Histogram of the null distribution of p-values obtained through simulation using the classical analysis of variance F-test (left) and Welch's unequal variance alternative (right), based on 10 000 simulations. Each simulated sample consist of 50 observations from a standard normal distribution and 10 observations from centered normal with variance of 9. The uniform distribution would have 5% in each of the 20 bins used for the display.
As a Quebecer, I am not allowed to talk about this topic.
No visual diagnostic or test available.
Rather, infer from context.
Knowing the value of one observation tells us nothing about the value taken by the others.
Observations close by tend to be more alike (correlated).
Where does the F-distribution come from?
Normality of group average
This holds (in great generality)
because of the
central limit theorem
In large samples, the sample mean is approximately normally distributed.
Top row shows data generating mechanism and a sample, bottom row shows the distribution of the sample mean of n=30 and n=50 observations.
Rule of thumb: 20 or 30 per group
Gather sufficient number of observations.
The closer data are to being normal, the better the large-sample distribution approximation is.
Can check normality via quantile-quantile plot with standardized residuals ri:
In R, use functions qqnorm
or car::qqPlot
to produce the plots.
The ordered residuals should align on a straight line.
Normal quantile-quantile plot (left) and Tukey's mean different QQ-plot (right).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 3
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Hypothesis tests for ANOVA
Hypothesis tests for ANOVA
Model assumptions
Global null hypothesis
No difference between treatments
Tacitly assume that all observations have the same standard deviation σ.
Decomposing variability into bits
∑i∑k(yik−ˆμ)2total sum of squares=∑i∑k(yik−ˆμk)2within sum of squares+∑kni(ˆμk−ˆμ)2between sum of squares.
null model
alternative model
added variability
Omnibus test
With K groups and n observations, the statistic is
F=between-group variabilitywithin-group variability=between sum of squares/(K−1)within sum of squares/(n−K)
If all groups have the same mean, both numerator and denominator are estimators of σ2, thus
The null distribution (benchmark) is a Fisher distribution F(ν1,ν2).
The parameters ν1,ν2 are called degrees of freedom.
For the one-way ANOVA:
The number of constraints come from the fact we go from K means under alternative, to 1 mean under null.
Note: the F(ν1,ν2) distribution is indistinguishable from χ2(ν1) for ν2 large.
From Davison (2008), Example 9.2
In an investigation on the teaching of arithmetic, 45 pupils were divided at random into five groups of nine. Groups A and B were taught in separate classes by the usual method. Groups C, D, and E were taught together for a number of days. On each day C were praised publicly for their work, D were publicly reproved and E were ignored. At the end of the period all pupils took a standard test.
Let μA,…,μE denote the population average (expectation) score for the test for each experimental condition.
The null hypothesis is H0:μA=μB=⋯=μE against the alternative Ha that at least one of the population average is different.
#Fit one way analysis of variancetest <- aov(data = arithmetic, formula = score ~ group)anova(test) #print anova table
term | df | sum of square | mean square | statistic | p-value |
---|---|---|---|---|---|
group | 4 | 722.67 | 180.67 | 15.27 | < 1e-04 |
Residuals | 40 | 473.33 | 11.83 |
The p-value gives the probability of observing an outcome as extreme if the null hypothesis was true.
# Compute p-valuepf(15.27, df1 = 4, df2 = 40, lower.tail = FALSE)
Probability that a F(4,40) exceeds 15.27.
All statements about p-values
are approximate.
Additivity and linearity Independence
Equal variance Large sample size
Write ith observation of kth experimental group as
Yikμkobservationgp=μkmean of group+εikμkerror termgp,
where, for i=1,…,nk and k=1,…,K,
Additive decomposition reads:
(quantity dependingon the treatment used)+(quantity depending only on the particular unit)
Plot group averages {ˆμk} against residuals {eik}, where eik=yik−ˆμk.
By construction, sample mean of eik is always zero.
Less improvement for scores of stronger students.
Plot and context suggests multiplicative structure. Tempting to diagnose unequal variance.
Reading diagnostic plots requires practice (and is analogous to reading tea leaves: leaves a lot to interpretation).
Look for potential patterns
on the y-axis only!
Multiplicative data of the form (quantity dependingon the treatment used)×(quantity depending only on the particular unit) tend to have higher variability when the response is larger.
A log-transformation of response makes the model additive.
For responses bounded between a and b, reduce warping effects via ln{x−a+δb+δ−x}
Careful with transformations:
If we consider a response on the log-scale, the test is for equality of the geometric mean!
Plot residuals against other explanatories.
Difference in average response; while the treatment seems to lead to a decrease in the response variable, a stratification by age group reveals this only occurs in less than 25 group, with a seemingly reversed effect for the adults. Thus, the marginal model implied by the one-way analysis of variance is misleading.
An interaction occurs when the effect of experimental group depends on another variable.
In principle, randomization ensures we capture the average marginal effect (even if misleading/useless).
We could incorporate the interacting variable in the model capture it's effect (makes model more complex), e.g. using a two-way ANOVA.
Each observation
has the same
standard deviation σ.
ANOVA is quite sensitive to this assumption!
Plot standardized (rstandard
) or studentized residuals (rstudent
) against fitted values.
data(arithmetic, package = "hecedsm")model <- lm(score ~ group, data = arithmetic)data <- data.frame( fitted = fitted(model), residuals = rstudent(model))ggplot(data = data, mapping = aes(x = fitted, y = residuals)) + geom_point()
Can use a statistical test for H0:σ1=⋯=σK.
Different tests may yield different conclusions
Bartlett is uniformly most powerful for normal data.
Levene and BF are most commonly used in practice (so far of what I have seen)
data(arithmetic, package = "hecedsm")model <- aov(score ~ group, data = arithmetic)car::leveneTest(model) #Brown-Forsythe by default
## Levene's Test for Homogeneity of Variance (center = median)## Df F value Pr(>F)## group 4 1.2072 0.3228## 40
Fail to reject the null: no evidence of unequal variance
To make the preliminary test on variances is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!
Box, G.E.P. (1953). Non-Normality and Tests on Variances. Biometrika 40 (3)-4: 318–335.
oneway.test
in R).Reject null hypothesis more often even if no difference in mean!
Histogram of the null distribution of p-values obtained through simulation using the classical analysis of variance F-test (left) and Welch's unequal variance alternative (right), based on 10 000 simulations. Each simulated sample consist of 50 observations from a standard normal distribution and 10 observations from centered normal with variance of 9. The uniform distribution would have 5% in each of the 20 bins used for the display.
As a Quebecer, I am not allowed to talk about this topic.
No visual diagnostic or test available.
Rather, infer from context.
Knowing the value of one observation tells us nothing about the value taken by the others.
Observations close by tend to be more alike (correlated).
Where does the F-distribution come from?
Normality of group average
This holds (in great generality)
because of the
central limit theorem
In large samples, the sample mean is approximately normally distributed.
Top row shows data generating mechanism and a sample, bottom row shows the distribution of the sample mean of n=30 and n=50 observations.
Rule of thumb: 20 or 30 per group
Gather sufficient number of observations.
The closer data are to being normal, the better the large-sample distribution approximation is.
Can check normality via quantile-quantile plot with standardized residuals ri:
In R, use functions qqnorm
or car::qqPlot
to produce the plots.
The ordered residuals should align on a straight line.
Normal quantile-quantile plot (left) and Tukey's mean different QQ-plot (right).