Session 9
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Effect sizes
Power
Quote from the OSC psychology replication
The key statistics provided in the paper to test the “depletion” hypothesis is the main effect of a one-way ANOVA with three experimental conditions and confirmatory information processing as the dependent variable; F(2,82)=4.05, p=0.02, η2=0.09. Considering the original effect size and an alpha of 0.05 the sample size needed to achieve 90% power is 132 subjects.
Replication report of Fischer, Greitemeyer, and Frey (2008, JPSP, Study 2) by E.M. Galliani
Q: How many observations should
I gather to reliably detect an effect?
Q: How big is this effect?
With large enough sample size, any sized difference between treatments becomes statistically significant.
Statistical significance ≠ practical relevance
But whether this is important depends on the scientific question.
Statistics and p-values are not good summaries of magnitude of an effect:
Instead use
standardized differences
percentage of variability explained
Estimators popularized in the handbook
Cohen, Jacob. Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Routhledge, 1988.
The plot shows null (thick) and true sampling distributions (dashed) for the same difference in sample mean with small (left) and large (right) samples.
Left to right: parameter μ (target), estimator ˆμ (recipe) and estimate ˆμ=10 (numerical value, proxy)
From Twitter, @simongrund89
Standardized measure of effect (dimensionless=no units):
Assuming equal variance σ2, compare mean of two groups i and j:
d=μi−μjσ
Cohen's classification: small (d=0.2), medium (d=0.5) or large (d=0.8) effect size.
Note: this is not the t-statistic (the denominator is the estimated standard deviation, not the standard error of the mean).
Note that there are multiple versions of Cohen's coefficients. These are the effects of the pwr package. The small/medium/large effect size varies depending on the test! See the vignette of pwr for defaults.
For a one-way ANOVA (equal variance σ2) with more than two groups, Cohen's f is the square root of
f2=1σ2k∑j=1njn(μj−μ)2, a weighted sum of squared difference relative to the overall mean μ.
For k=2 groups, Cohen's f and Cohen's d are related via f=d/2.
If there is a single experimental factor A, we break down the variability as σ2total=σ2resid+σ2A and define the percentage of variability explained by the effect of A as. η2=explained variabilitytotal variability=σ2Aσ2total.
For the balanced one-way between-subject ANOVA, typical estimator is the coefficient of determination
ˆη2≡ˆR2=Fν1Fν1+ν2 where ν1=K−1 and ν2=n−K are the degrees of freedom for the one-way ANOVA with n observations and K groups.
People frequently write η2 when they mean the estimator ˆR2
Another estimator of η2 that is recommended in Keppel & Wickens (2004) for power calculations is ˆω2.
For one-way between-subject ANOVA, the latter is obtained from the F-statistic as
ˆω2=ν1(F−1)ν1(F−1)+n
Since the F statistic is approximately 1 on average, this measure removes the average.
Software usually take Cohen's f (or f2) as input for the effect size.
Convert from η2 (proportion of variance) to f (ratio of variance) via the relationship
f2=η21−η2.
Replace η2 by ˆR2 or ˆω2 to get
ˆf=√Fν1ν2,˜f=√ν1(F−1)n
If we plug-in estimated values
With a completely randomized design with only experimental factors, use partial effect size η2⟨effect⟩=σ2effect/(σ2effect+σ2resid)
In R, use effectsize::omega_squared(model, partial = TRUE)
.
Consider a completely randomized balanced design with two factors A, B and their interaction AB. In a balanced design, we can decompose the total variance as
σ2total=σ2A+σ2B+σ2AB+σ2resid.
Cohen's partial f measures the proportion of variability that is explained by a main effect or an interaction, e.g.,
f⟨A⟩=σ2Aσ2resid,f⟨AB⟩=σ2ABσ2resid.
These variance quantities are unknown, so need to be estimated somehow.
Effect size are often reported in terms of variability via the ratio η2⟨effect⟩=σ2effectσ2effect+σ2resid.
ˆω2⟨effect⟩ is presumed less biased than ˆη2⟨effect⟩, as is ˆϵ⟨effect⟩.
Similar formulas as the one-way case for between-subject experiments, with
ˆω2⟨effect⟩=dfeffect(Feffect−1)dfeffect(Feffect−1)+n, where n is the overall sample size.
In R, effectsize::omega_squared
reports these estimates with one-sided confidence intervals.
Reference for confidence intervals: Steiger (2004), Psychological Methods
The confidence intervals are based on the F distribution, by changing the non-centrality parameter and inverting the distribution function (pivot method). There is a one-to-one correspondence with Cohen's f, and a bijection between the latter and omega_sq_partial or eta_sq_partial. This yields asymmetric intervals.
Given an estimate of η2⟨effect⟩, convert it into an estimate of Cohen's partial f2⟨effect⟩, e.g., ˆf2⟨effect⟩=ˆω2⟨effect⟩1−ˆω2⟨effect⟩.
The package effectsize::cohens_f
returns ˜f2=n−1Feffectdfeffect, a transformation of ˆη2⟨effect⟩.
Journals and grant agencies oftentimes require an estimate of the sample size needed for a study.
Same for replication studies: how many participants needed?
How does the F-test behaves under an alternative?
What do you think is the effect on power of an increase of the
The peak of the distribution shifts to the right.
Why? on average, the numerator of the F-statistic is
E(between-group variability)=σ2+∑Kj=1nj(μj−μ)2K−1.
Under the null hypothesis, μj=μ for j=1,…,K
The alternative distribution is F(ν1,ν2,Δ) distribution with degrees of freedom ν1 and ν2 and noncentrality parameter Δ=∑Kj=1nj(μj−μ)2σ2.
The null alternative corresponds to a single value (equality in mean), whereas there are infinitely many alternatives...
Power is the ability to detect when the null is false, for a given alternative (dashed).
Power is the area in white under the dashed curved, beyond the cutoff.
In which of the two figures is power the largest?
Think in your head of potential factors impact power for a factorial design.
Think in your head of potential factors impact power for a factorial design.
Think in your head of potential factors impact power for a factorial design.
We focus on the interplay between
effect size | power | sample size
The level is fixed, but we may consider multiplicity correction within the power function. The noise level is oftentimes intrinsic to the measurement.
In a one-way ANOVA, the alternative distribution of the F test has an additional parameter Δ, which depends on both the sample and the effect sizes.
Δ=∑Kj=1nj(μj−μ)2σ2=nf2.
Under the null hypothesis, μj=μ for j=1,…,K and Δ=0.
The greater Δ, the further the mode (peak of the distribution) is from unity.
Δ=∑Kj=1nj(μj−μ)2σ2.
When does power increase?
What is the effect of an increase of the
The alternative distribution is F(ν1,ν2,Δ) distribution with degrees of freedom ν1 and ν2 and noncentrality parameter Δ.
For other tests, parameters vary but the story is the same.
The plot shows the null and alternative distributions. The noncentral F is shifted to the right (mode = peak) and right skewed. The power is shaded in blue, the null distribution is shown in dashed lines.
Consider a completely randomized design with two crossed factors A and B.
We are interested by the interaction, η2⟨AB⟩, and we want 80% power:
# Estimate Cohen's f from omega.sq.partialfhat <- sqrt(omega.sq.part/(1-omega.sq.part))# na and nb are number of levels of factorsWebPower::wp.kanova(power = 0.8, f = fhat, ndf = (na-1)*(nb-1), ng = na*nb)
library(pwr)power_curve <- pwr.anova.test( f = 0.314, #from R-squared k = 3, power = 0.9, sig.level = 0.05)plot(power_curve)
Recall: convert η2 to Cohen's f (the effect size reported in pwr
) via f2=η2/(1−η2)
Using ˜f instead (from ˆω2) yields n=59 observations per group!
WARNING!
Most effects reported in the literature are severely inflated.
Publication bias & the file drawer problem
Recall the file drawer problem: most studies with small effects lead to non significant results and are not published. So the reported effects are larger than expected.
Better to do a large replication than multiple small studies.
Otherwise, you risk being in this situation:
Sometimes, the estimated values of the effect size, etc. are used as plug-in.
Statistical fallacy
Because we reject a null doesn't mean the alternative is true!
Power is a long-term frequency property: in a given experiment, we either reject or we don't.
Not recommended unless the observed differences among the means seem important in practice but are not statistically significant
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Session 9
MATH 80667A: Experimental Design and Statistical Methods
HEC Montréal
Effect sizes
Power
Quote from the OSC psychology replication
The key statistics provided in the paper to test the “depletion” hypothesis is the main effect of a one-way ANOVA with three experimental conditions and confirmatory information processing as the dependent variable; F(2,82)=4.05, p=0.02, η2=0.09. Considering the original effect size and an alpha of 0.05 the sample size needed to achieve 90% power is 132 subjects.
Replication report of Fischer, Greitemeyer, and Frey (2008, JPSP, Study 2) by E.M. Galliani
Q: How many observations should
I gather to reliably detect an effect?
Q: How big is this effect?
With large enough sample size, any sized difference between treatments becomes statistically significant.
Statistical significance ≠ practical relevance
But whether this is important depends on the scientific question.
Statistics and p-values are not good summaries of magnitude of an effect:
Instead use
standardized differences
percentage of variability explained
Estimators popularized in the handbook
Cohen, Jacob. Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Routhledge, 1988.
The plot shows null (thick) and true sampling distributions (dashed) for the same difference in sample mean with small (left) and large (right) samples.
Left to right: parameter μ (target), estimator ˆμ (recipe) and estimate ˆμ=10 (numerical value, proxy)
From Twitter, @simongrund89
Standardized measure of effect (dimensionless=no units):
Assuming equal variance σ2, compare mean of two groups i and j:
d=μi−μjσ
Cohen's classification: small (d=0.2), medium (d=0.5) or large (d=0.8) effect size.
Note: this is not the t-statistic (the denominator is the estimated standard deviation, not the standard error of the mean).
Note that there are multiple versions of Cohen's coefficients. These are the effects of the pwr package. The small/medium/large effect size varies depending on the test! See the vignette of pwr for defaults.
For a one-way ANOVA (equal variance σ2) with more than two groups, Cohen's f is the square root of
f2=1σ2k∑j=1njn(μj−μ)2, a weighted sum of squared difference relative to the overall mean μ.
For k=2 groups, Cohen's f and Cohen's d are related via f=d/2.
If there is a single experimental factor A, we break down the variability as σ2total=σ2resid+σ2A and define the percentage of variability explained by the effect of A as. η2=explained variabilitytotal variability=σ2Aσ2total.
For the balanced one-way between-subject ANOVA, typical estimator is the coefficient of determination
ˆη2≡ˆR2=Fν1Fν1+ν2 where ν1=K−1 and ν2=n−K are the degrees of freedom for the one-way ANOVA with n observations and K groups.
People frequently write η2 when they mean the estimator ˆR2
Another estimator of η2 that is recommended in Keppel & Wickens (2004) for power calculations is ˆω2.
For one-way between-subject ANOVA, the latter is obtained from the F-statistic as
ˆω2=ν1(F−1)ν1(F−1)+n
Since the F statistic is approximately 1 on average, this measure removes the average.
Software usually take Cohen's f (or f2) as input for the effect size.
Convert from η2 (proportion of variance) to f (ratio of variance) via the relationship
f2=η21−η2.
Replace η2 by ˆR2 or ˆω2 to get
ˆf=√Fν1ν2,˜f=√ν1(F−1)n
If we plug-in estimated values
With a completely randomized design with only experimental factors, use partial effect size η2⟨effect⟩=σ2effect/(σ2effect+σ2resid)
In R, use effectsize::omega_squared(model, partial = TRUE)
.
Consider a completely randomized balanced design with two factors A, B and their interaction AB. In a balanced design, we can decompose the total variance as
σ2total=σ2A+σ2B+σ2AB+σ2resid.
Cohen's partial f measures the proportion of variability that is explained by a main effect or an interaction, e.g.,
f⟨A⟩=σ2Aσ2resid,f⟨AB⟩=σ2ABσ2resid.
These variance quantities are unknown, so need to be estimated somehow.
Effect size are often reported in terms of variability via the ratio η2⟨effect⟩=σ2effectσ2effect+σ2resid.
ˆω2⟨effect⟩ is presumed less biased than ˆη2⟨effect⟩, as is ˆϵ⟨effect⟩.
Similar formulas as the one-way case for between-subject experiments, with
ˆω2⟨effect⟩=dfeffect(Feffect−1)dfeffect(Feffect−1)+n, where n is the overall sample size.
In R, effectsize::omega_squared
reports these estimates with one-sided confidence intervals.
Reference for confidence intervals: Steiger (2004), Psychological Methods
The confidence intervals are based on the F distribution, by changing the non-centrality parameter and inverting the distribution function (pivot method). There is a one-to-one correspondence with Cohen's f, and a bijection between the latter and omega_sq_partial or eta_sq_partial. This yields asymmetric intervals.
Given an estimate of η2⟨effect⟩, convert it into an estimate of Cohen's partial f2⟨effect⟩, e.g., ˆf2⟨effect⟩=ˆω2⟨effect⟩1−ˆω2⟨effect⟩.
The package effectsize::cohens_f
returns ˜f2=n−1Feffectdfeffect, a transformation of ˆη2⟨effect⟩.
Journals and grant agencies oftentimes require an estimate of the sample size needed for a study.
Same for replication studies: how many participants needed?
How does the F-test behaves under an alternative?
What do you think is the effect on power of an increase of the
The peak of the distribution shifts to the right.
Why? on average, the numerator of the F-statistic is
E(between-group variability)=σ2+∑Kj=1nj(μj−μ)2K−1.
Under the null hypothesis, μj=μ for j=1,…,K
The alternative distribution is F(ν1,ν2,Δ) distribution with degrees of freedom ν1 and ν2 and noncentrality parameter Δ=∑Kj=1nj(μj−μ)2σ2.
The null alternative corresponds to a single value (equality in mean), whereas there are infinitely many alternatives...
Power is the ability to detect when the null is false, for a given alternative (dashed).
Power is the area in white under the dashed curved, beyond the cutoff.
In which of the two figures is power the largest?
Think in your head of potential factors impact power for a factorial design.
Think in your head of potential factors impact power for a factorial design.
Think in your head of potential factors impact power for a factorial design.
We focus on the interplay between
effect size | power | sample size
The level is fixed, but we may consider multiplicity correction within the power function. The noise level is oftentimes intrinsic to the measurement.
In a one-way ANOVA, the alternative distribution of the F test has an additional parameter Δ, which depends on both the sample and the effect sizes.
Δ=∑Kj=1nj(μj−μ)2σ2=nf2.
Under the null hypothesis, μj=μ for j=1,…,K and Δ=0.
The greater Δ, the further the mode (peak of the distribution) is from unity.
Δ=∑Kj=1nj(μj−μ)2σ2.
When does power increase?
What is the effect of an increase of the
The alternative distribution is F(ν1,ν2,Δ) distribution with degrees of freedom ν1 and ν2 and noncentrality parameter Δ.
For other tests, parameters vary but the story is the same.
The plot shows the null and alternative distributions. The noncentral F is shifted to the right (mode = peak) and right skewed. The power is shaded in blue, the null distribution is shown in dashed lines.
Consider a completely randomized design with two crossed factors A and B.
We are interested by the interaction, η2⟨AB⟩, and we want 80% power:
# Estimate Cohen's f from omega.sq.partialfhat <- sqrt(omega.sq.part/(1-omega.sq.part))# na and nb are number of levels of factorsWebPower::wp.kanova(power = 0.8, f = fhat, ndf = (na-1)*(nb-1), ng = na*nb)
library(pwr)power_curve <- pwr.anova.test( f = 0.314, #from R-squared k = 3, power = 0.9, sig.level = 0.05)plot(power_curve)
Recall: convert η2 to Cohen's f (the effect size reported in pwr
) via f2=η2/(1−η2)
Using ˜f instead (from ˆω2) yields n=59 observations per group!
WARNING!
Most effects reported in the literature are severely inflated.
Publication bias & the file drawer problem
Recall the file drawer problem: most studies with small effects lead to non significant results and are not published. So the reported effects are larger than expected.
Better to do a large replication than multiple small studies.
Otherwise, you risk being in this situation:
Sometimes, the estimated values of the effect size, etc. are used as plug-in.
Statistical fallacy
Because we reject a null doesn't mean the alternative is true!
Power is a long-term frequency property: in a given experiment, we either reject or we don't.
Not recommended unless the observed differences among the means seem important in practice but are not statistically significant