<- lm(score ~ group,
mod_contrast data = arithmetic)
<- lm(score ~ group,
mod_sum2zero data = arithmetic,
contrasts = list(group = contr.sum))
3 Completely randomized designs
This chapter focuses on experiments where potentially multiple factors of interest are manipulated by the experimenter to study their impact. If the allocation of observational units to each treatment combination is completely random, the resulting experiment is a completely randomized design.
The one-way analysis of variance describes the most simple experimental setup one can consider: completely randomized experiments with one factor, in which we are solely interested in the effect of a single treatment variable with multiple levels.
3.1 One-way analysis of variance
The focus is on comparisons of the average of a single outcome variable with
Let
3.1.1 Parametrizations and contrasts
This section can be skipped on first reading. It focuses on the interpretation of the coefficients obtained from a linear model or analysis of variance model.
The most natural parametrization is in terms of group averages: the (theoretical unknown) average for treatment
The most common parametrization for the linear model is in terms of differences to a baseline, say
An equivalent formulation writes for each treatment group the average of subpopulation
Example 3.1 (Impact of encouragement on teaching) In R, the lm
function fits a linear model based on a formula of the form response ~ explanatory
. If the explanatory is categorical (i.e., a factor), the parameters of this model are the intercept, which is the sample average of the baseline group and the other parameters are simply contrasts, i.e., the
The sum-to-zero parametrization is obtained with contrasts = list(... = contr.sum)
, where the ellipsis is replaced by the name of the categorical variable; an easier alternative is aov
, which enforces this parametrization by default. With the sum-to-zero parametrization, the intercept is the average of each treatment average,
We show the function call to fit a one-way ANOVA in the different parametrizations along with the sample average of each arithmetic group (the two controls who were taught separately and the groups that were praised, reproved and ignored in the third class). Note that the omitted category changes depending on the parametrization.
We can still assess the hypothesis by comparing the sample means in each group, which are noisy estimates of the population mean: their inherent variability will limit our ability to detect differences in averages if the signal-to-noise ratio is small.
3.1.2 Sum of squares decomposition
The following section can be safely skipped on first reading: it attempts to shed some light into how the
The usual notation for the sum of squares decomposition is as follows: suppose
Under the null model, all groups have the same mean, so the natural estimator for the latter is the sample average of the pooled sample
We can measure how much worst we do with the alternative model (different average per group) relative to the null by calculating the between sum of square. This quantity in itself varies with the sample size (the more observations, the larger it is) so we must standardize as usual this quantity so that we have a suitable benchmark.
The
If there is no mean difference (null hypothesis), the numerator is an estimator of the population variance, and so is the denominator of eq.
Figure 3.1 shows how the difference between these distances can encompass information that the null is wrong. The sum of squares is obtained by computing the squared length of these vectors and adding them up. The left panel shows strong signal-to-noise ratio, so that, on average, the black segments are much longer than the colored ones. This indicates that the model obtained by letting each group have its own mean is much better than the other. The picture in the right panel is not as clear: on average, the colored arrows are shorter, but the difference in length is much smaller relative to the colored arrows.

The
As was alluded to in the last chapter, large sample approximations are not the only option for assessing the null, but they are cheap and easy to obtain. If the distributions are the same under the null and alternative except for a location shift, we could instead resort to a permutation-based approach to generate those alternative samples by simply shuffling the labels. We see in Figure 3.2 that the histogram of the
More interestingly perhaps is what happens to the values taken by the statistic when not all of the averages are the same. We can see in Figure 3.3 that, when there are some difference between group means, the values taken by the statistic for a random sample are more to the right than the null distribution: the larger those differences, the more the curve will shift to the right and the more often we will obtain a value in the rejection region (in red).
If there are only two groups, then one can show that the
3.2 Graphical representation
How to represent data for a one-way analysis in a publication? The purpose of the visualization is to provide intuition that extends beyond the reported descriptive statistics and to check the model assumptions. Most of the time, we will be interested in averages and dispersion, but plotting the raw data can be insightful. It is also important to keep in mind that summary statistics are estimators of population quantities that are perhaps unreliable (much too variable) in small samples to be meaningful quantities. Since the mean estimates will likely be reported in the text, the graphics should be used to convey additional information about the data. If the samples are extremely large, then graphics will be typically be used to present salient features of the distributions.
In a one-way analysis of variance, the outcome is a continuous numerical variable, whereas the treatment or explanatory is a categorical variable. Basic graphics include dot plots, histograms and density plots, or rugs for the raw data.
Typically, scatterplots are not a good option because observations get overlaid. There are multiple workarounds, involving transparency, bubble plots for discrete data with ties, adding noise (jitter) to every observation or drawing values using a thin line (rugs) if the data are continuous and take on few distinct values.
Journals are plagued with poor vizualisations, a prime example of which is the infamous dynamite plot: it consists of a bar plot with one standard error interval. The problem with this (or with other summary statistics) is that they hide precious information about the spread and values taken by the data, as many different data could give rise to the same average while being quite different in nature. The height of the bar is the sample average and the bars extend beyond one standard error: this makes little sense as we end up comparing areas, whereas the mean is a single number. The right panel of Figure 3.4 shows instead a dot plot for the data, i.e., sample values with ties stacked for clarity, along with the sample average and a 95% confidence interval for the latter as a line underneath. In this example, there are not enough observations per group to produce histograms, and a five number summary of nine observations isn’t really necessary so boxplot are useless. Weissgerber et al. (2015) discusses alternative solutions and can be referenced when fighting reviewers who insist on bad visualizations.
If we have a lot of data, it sometimes help to represent selected summary statistics or group data. A box-and-whiskers plot (or boxplot) is a commonly used graphic representing the whole data distribution using five numbers
- The box gives the quartiles, say
, (median) and of the distribution: 50% of the observations are smaller or larger than , 25% are smaller than and 75% are smaller than for the sample. - The whiskers extend up to
times the box width ( ) (so the largest observation that is smaller than , etc.)
Observations beyond the whiskers are represented by dots or circles, sometimes termed outliers. However, beware of this terminology: the larger the sample size, the more values will fall outside the whiskers (about 0.7% for normal data). This is a drawback of boxplots, which were conceived at a time where big data didn’t exist. If you want to combine boxplots with the raw data, remove the display of outliers to avoid artefacts.
Weissgerber et al. (2019) contains many examples of how to build effective visualizations, including highlighting particular aspects using color, jittering, transparency and how to adequately select the display zone.
3.3 Pairwise tests
If the global test of equality of mean for the one-way ANOVA leads to rejection of the null, the conclusion is that one of the group has a different mean. However, the test does not indicate which of the groups differ from the rest nor does it say how many are different. There are different options: one is custom contrasts, a special instance of which is pairwise comparisons.
We are interested in looking at the difference between the (population) average of group
Assuming equal variances, the two-sample
Figure 3.6 shows the density of the benchmark distribution for pairwise comparisons in mean for the arithmetic
data. The blue area under the curve defines the set of values for which we fail to reject the null hypothesis, whereas all values of the test statistic falling in the red area lead to rejection at level
We fail to reject
Example 3.2 (Calculation of pairwise comparisons) We consider the pairwise average difference in scores between the praised (group C) and the reproved (group D) of the arithmetic
study. The sample averages are respectively
If we take as null hypothesis
3.4 Model assumptions
So far, we have brushed all of the model assumptions under the carpet. These are necessary requirements for the inference to be valid: any statement related to p-values, etc. will approximately hold only if a set of assumptions is met in the first place. This section is devoted to the discussion of these assumptions, showcasing examples of where things can go wrong.
It is customary to write the
3.4.1 Additivity
The basic assumption of most designs is that we can decompose the outcome into two components (Cox 1958)
This additive decomposition further assumes that each unit is unaffected by the treatment of the other units and that the average effect of the treatment is constant. Thus, it is justified to use difference in sample mean to estimate the treatment effect since on average, the individual effect is zero.
The decomposition of observations in terms of group average and mean-zero noise in Equation 3.2 suggests that we could plot the error term
Many graphical diagnostics use residuals, i.e., some variant of the observations minus the group mean
More generally, the test statistic may make further assumptions. The
Example 3.3 (Additivity and transformations) Chapter 2 of Cox (1958) discusses the assumption of additivity and provides useful examples showing when it cannot be taken for granted. One of them, Example 2.3, is a scenario in which the experimental units are participants and they are asked to provide a ranking of different kindergarten students on their capacity to interact with others in games, ranked on a scale of 0 to 100. A random group of students receives additional orthopedagogical support, while the balance is in the business-as-usual setting (control group). Since there are intrinsic differences at the student level, one could consider a paired experiment and take as outcome the difference in sociability scores at the beginning and at the end of the school year.
One can expect the treatment to have more impact on people with low sociability skills who were struggling to make contacts: a student who scored 50 initially might see an improvement of 20 points with support relative to 10 in the business-as-usual scenario, whereas another who is well integrated and scored high initially may see an improvement of only 5 more had (s)he been assigned to the support group. This implies that the treatment effects are not constant over the scale, a violation of the additivity assumption. One way to deal with this is via transformations: Cox (1958) discusses the transformation
Another example is in experiments where the effect of treatment is multiplicative, so that the output is of the form
Example 3.4 (Inadequacy of additivity based on context) This example is adapted from Cox (1958), Example 2.2. Children suffering from attention deficit hyperactivity disorder (ADHD) may receive medication to increase their attention span, measured on a scale of 0 to 100, with 0 indicating normal attention span. An experiment can be designed to assess the impact of a standardized dose in a laboratory by comparing performances of students on a series of task before and after, when to a placebo. To make a case, suppose that students with ADHD fall into two categories: low symptoms and strong symptoms. In the low symptom group, the average attention is 8 per cent with the drug and 12 per cent with the placebo, whereas for people with strong symptoms, the average is 40 per cent among treated and 60 per cent with the placebo. If these two categories are equally represented in the experiment and the population, we would estimate an average reduction of 12 percent in the score (thus higher attention span among treated). Yet, this quantity is artificial, and a better measure would be that symptoms are for the treatment are 2/3 of those of the control (the ratio of proportions).
Equation 3.3 also implies that the effect of the treatment is constant for all individuals. This often isn’t the case: in an experimental study on the impact of teaching delivery type (online, hybrid, in person), it may be that the response to the choice of delivery mode depends on the different preferences of learning types (auditory, visual, kinestetic, etc.) Thus, recording additional measurements that are susceptible to interact may be useful; likewise, treatment allotment must factor in this variability should we wish to make it detectable. The solution to this would be to setup a more complex model (two-way analysis of variance, general linear model) or stratify by the explanatory variable (for example, compute the difference within each level).

3.4.2 Heterogeneity
The one-way ANOVA builds on the fact that the variance in each group is equal, so that upon recentering, we can estimate it from the variance of the residuals
For the time being, we consider hypothesis tests for the homogeneity (equal) variance assumption. The most commonly used tests are Bartlett’s test5 and Levene’s test (a more robust alternative, less sensitive to outliers). For both tests, the null distribution is
bartlett.test(score ~ group,
data = arithmetic)
Bartlett test of homogeneity of variances
data: score by group
Bartlett's K-squared = 2.3515, df = 4, p-value = 0.6714
::leveneTest(score ~ group,
cardata = arithmetic,
center = mean)
Levene's Test for Homogeneity of Variance (center = mean)
Df F value Pr(>F)
group 4 1.569 0.2013
40
# compare with one-way ANOVA
<- lm(score ~ group, data = arithmetic)
mod $absresid <- abs(resid(mod)) #|y_{ik}-mean_k|
arithmeticanova(aov(absresid ~ group, data = arithmetic))
Analysis of Variance Table
Response: absresid
Df Sum Sq Mean Sq F value Pr(>F)
group 4 17.354 4.3385 1.569 0.2013
Residuals 40 110.606 2.7652
We can see in both cases that the t.test
or oneway.test
.
What happens with the example of the arithmetic data when we use this instead of the usual
# Welch ANOVA
oneway.test(score ~ group, data = arithmetic,
var.equal = FALSE)
One-way analysis of means (not assuming equal variances)
data: score and group
F = 18.537, num df = 4.000, denom df = 19.807, p-value = 1.776e-06
# Usual F-test statistic
oneway.test(score ~ group, data = arithmetic,
var.equal = TRUE)
One-way analysis of means
data: score and group
F = 15.268, num df = 4, denom df = 40, p-value = 1.163e-07
Notice how the degrees of freedom of the denominator have decreased. If we use pairwise.t.test
with argument pool.sd=FALSE
, this amounts to running Welch
What are the impacts of unequal variance if we use the
Example 3.5 (Violation of the null hypothesis of equal variance)

We consider for simplicity a problem with
There are alternative graphical ways of checking the assumption of equal variance, many including the standardized residuals
Oftentimes, unequal variance occurs because the model is not additive. You could use variance-stabilizing transformations (e.g., log for multiplicative effects) to ensure approximately equal variance in each group. Another option is to use a model that is suitable for the type of response you have (including count and binary data). Lastly, it may be necessary to explicitly model the variance in more complex design (including repeated measures) where there is a learning effect over time and variability decreases as a result. Consult an expert if needed.
3.4.3 Normality
There is a persistent yet incorrect claim in the literature that the data (either response, explanatory or both) must be normal in order to use (so-called parametric) models like the one-way analysis of variance. With normal data and equal variances, the eponymous distributions of the
While many authors may advocate rules of thumbs (sample size of

It is important to keep in mind that all statistical statements are typically approximate and their reliability depends on the sample size: too small a sample may hampers the strength of your conclusions. The default graphic for checking whether a sample matches a postulated distribution is the quantile-quantile plot.
3.4.4 Independence
While I am not allowed to talk of independence as a Quebecer7, this simply means that knowing the value of one observation tells us nothing about the value of any other in the sample. Independence may fail to hold in case of group structure (family dyads, cluster sampling) which have common characteristics or more simply in the case of repeated measurements. Random assignment to treatment is thus key to ensure that the measure holds, and ensuring at the measurement phase that there is no spillover.
Example 3.6 (Independence of measurements) There are many hidden ways in which measurements can impact the response. Physical devices that need to be calibrated before use (scales, microscope) require tuning: if measurements are done by different experimenters or on different days, it may impact and add systematic shift in means for the whole batch.
What is the impact of dependence between measurements? Heuristically, correlated measurements carry less information than independent ones. In the most extreme case, there is no additional information and measurements are identical, but adding them multiple times unduly inflates the statistic and leads to more frequent rejections.

The lack of independence can also have drastic consequences on inference and lead to false conclusions: Figure 3.11 shows an example with correlated samples within group (or equivalently repeated measurements from individuals) with 25 observations per group. The
We say a sample is balanced if each (sub)group contains the same number of observations.↩︎
There are only
parameter estimates for the mean, since the overall mean is full determined by the other averages with .↩︎Mostly because the central limit theorem kicks in↩︎
Note that the Student-
distribution is symmetric, so .↩︎For the connoisseur, this is a likelihood ratio test under the assumption of normally distributed data, with a Bartlett correction to improve the
approximation to the null distribution.↩︎Coupled with a slight loss of power if the variance are truly equal, more on this later.↩︎
All credits for this pun are due to C. Genest↩︎