01. Statistical Inference
2024
The population distribution (describing possible outcomes and their frequencies) encodes everything we could be interested in.
Statistics is concerned with decision making under uncertainty.
A famous quote attributed to George Box claims that
All models are wrong, but some are useful.
Peter McCullagh and John Nelder wrote in the preamble of their book (emphasis mine)
Modelling in science remains, partly at least, an art. Some principles do exist, however, to guide the modeller. The first is that all models are wrong; some, though, are better than others and we can search for the better ones. At the same time we must recognize that eternal truth is not within our grasp.
This quote by David R. Cox adds to the point:
…it does not seem helpful just to say that all models are wrong. The very word model implies simplification and idealization. The idea that complex physical, biological or sociological systems can be exactly described by a few formulae is patently absurd. The construction of idealized representations that capture important stable aspects of such systems is, however, a vital part of general scientific analysis and statistical models, especially substantive ones, do not seem essentially different from other kinds of model.
A stochastic model typically combines
Models are “golem” for obtaining answers to our questions.
We need to know
Without further adjustment, we cannot draw causal statements from observational data.
Are driving tests easier if you live in a rural area? Source: The Guardian, August 23rd, 2019
Model: binomial logistic model. Data gbdriving
, R package hecstatmod
.
Models: Within-subject ANOVA (repeated measures) with pairwise paired t-tests or nonparametric tests (Friedman + Wilcoxon signed-rank test). Dataset BRLS21_T3
, package hecedsm
.
Brodeur et al. (2021)
A within-subject experiment was conducted in a driving simulator where 31 participants received and answered text messages under four conditions: they received notifications (1) on a mobile phone, (2) on a smartwatch, and (3) on a speaker, and then responded orally to these messages. They also (4) received messages in a “texting” condition where they had to reply through text to the notifications.
Sokolova, Krishna, and Döring (2023)
Eight studies (N = 4103) document the perceived environmental friendliness (PEF) bias whereby consumers judge plastic packaging with additional paper to be more environmentally friendly than identical plastic packaging without the paper.
Model: linear regression/ANOVA with custom contrasts. Dataset SKD23_S2A
, package hecedsm
Upworthy.com, a US media publisher, revolutionized headlines online advertisement by running systematic A/B tests to compare the different wording of headlines, placement of text and image to figure out what catches attention the most.
The Upworthy Research Archive (Matias et al. 2021) contains results for 22743 experiments, with a click through rate of 1.58% on average and a standard deviation of 1.23%.
Model: Poisson regression with offset. Data upworthy_sesame
, package hecbayes
.
Brucks and Levav (2022)
In a laboratory study and a field experiment across five countries (in Europe, the Middle East and South Asia), we show that videoconferencing inhibits the production of creative ideas […]
we demonstrate that videoconferencing hampers idea generation because it focuses communicators on a screen, which prompts a narrower cognitive focus. Our results suggest that virtual interaction comes with a cognitive cost for creative idea generation.
BL22_E
, package hecedsm
BL22_L
from package hecedsm
.Moon and VanEpps (2023)
Across seven studies, we provide evidence that quantity requests, wherein people consider multiple choice options of how much to donate (e.g., $5, $10, or $15), increase contributions compared to open-ended requests.
Our findings offer new conceptual insights into how quantity requests increase contributions as well as practical implications for charitable organizations to optimize contributions by leveraging the use of quantity requests.
Model: Tobit type II regression and Poisson regression (independence test), data MV23_S1
from package hecedsm
.
Duke and Amir (2023)
Customers must often decide on the quantity to purchase in addition to whether to purchase. The current research introduces and compares the quantity-sequential selling format, in which shoppers resolve the purchase and quantity decisions separately, with the quantity-integrated selling format, where shoppers simultaneously consider whether and how many to buy. Although retailers often use the sequential format, we demonstrate that the integrated format can increase purchase rates.
A field experiment conducted with a large technology firm found that quantity integration yielded considerably higher sales, amounting to an increase of more than $1 million in annual revenue.
Model: logistic regression, dataset DA23_E1
.
Mayors requested an inquiry by the Régie de l’énergie, a regulating agency in charge of energy prices. The report found that prices were indeed more expensive, but pointed out that there were more retailers per capita, and lower volume so their margins were higher.
Model: Linear regression with autoregressive errors, pairwise comparisons. Dataset renergy
, package hecstatmod
.
We cannot compare summaries without accounting for their uncertainty inherent to our estimation which is due to random sample.
The stronger the signal-to-noise ratio, the larger our ability to detect differences when they truly exist.
As we gather more observations (sample size increases), we can better discriminate between scenarios.
An hypothesis test is a binary decision rule (reject/fail to reject)
Below are the different steps to undertake:
Tech3Lab, HEC Montreal’s User Experience (UX) lab, studied the impact of texting on distraction.
The presumption of innocence applies (look at everything as if the null hypothesis is true)
c
)t
)Express the hypothesis in terms of the difference of means \[\begin{align*} \mathscr{H}_a: \mu_{\texttt{t}} - \mu_{\texttt{c}}>0. \end{align*}\]
We only ever assess the null hypothesis at a single value.
We compare the difference of the mean reaction time.
##
## Paired t-test
##
## data: t and c
## t = 3, df = 34, p-value = 0.003
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## 0.131 Inf
## sample estimates:
## mean difference
## 0.313
The null distribution tells us what values we would obtain under a null, and their relative frequency.
Apply distribution function of the null distribution to map the test statistic to the \([0,1]\) interval.
The p-value is the probability that the test statistic is equal or more extreme to the estimate computed from the data, assuming \(\mathscr{H}_0\) is true.
Caution
The American Statistical Association (ASA) published a list of principles guiding (mis)interpretation of p-values, including:
If we repeat the experiment with random samples, we expect \(p\)-values to be uniform if \(\mathscr{H}_0\) is true and the null hypothesis benchmark is properly calibrated.
Under the alternative, smaller \(p\)-values occur more often than \(\alpha\).
If the null hypothesis \(\mathscr{H}_0\) is true, the p-value follows a uniform distribution if our benchmark is properly calibrated.
To make a decision, we compare our p-value \(P\) with the level of the test \(\alpha\):
The value of \(\alpha \in (0, 1)\) is the probability of rejecting \(\mathscr{H}_0\) when \(\mathscr{H}_0\) is in fact true.
We seek to avoid error of type I: we reject \(\mathscr{H}_0\) when \(\mathscr{H}_0\) is true.
Decision \ true model | \(\mathscr{H}_0\) | \(\mathscr{H}_a\) |
---|---|---|
fail to reject \(\mathscr{H}_0\) | \(\checkmark\) | type II error |
reject \(\mathscr{H}_0\) | type I error | \(\checkmark\) |
Since we fix the level \(\alpha\), we have no control over the type II error.
We want to be able to detect and reject \(\mathscr{H}_0\) when it is false.
The power of a test is the probability of rejecting \(\mathscr{H}_0\) when it is false, i.e., \[\begin{align*} \Pr{\!}_a(\text{reject } \mathscr{H}_0), \end{align*}\] where \(\Pr_a\) is the probability under a given alternative of falling in the rejection region .
Minimally, the power of the test should be \(\alpha\) because we reject the null hypothesis \(\alpha\) fraction of the time even when \(\mathscr{H}_0\) is true.
A confidence interval is an alternative way to present the conclusions of an hypothesis test performed at significance level \(\alpha\) with the same data units.
Wald-based \((1-\alpha)\) confidence intervals for a scalar parameter \(\theta\) are of the form \[\begin{align*} [\widehat{\theta} + \mathfrak{q}_{\alpha/2}\mathrm{se}(\widehat{\theta}), \widehat{\theta} +\mathfrak{q}_{1-\alpha/2}\times \mathrm{se}(\widehat{\theta})] \end{align*}\] corresponding to a point estimate plus or minus the margin of error.
We distinguish between our target (estimand, e.g., population mean), the recipe or formula (estimator) and the output (estimate).
Since the inputs of the confidence interval (estimator) are random, the output is also random and change from one sample to the next: even if you repeat a recipe, you won’t always get the exact same result.
The \(1-\alpha\) confidence interval gives us all values for which we fail to reject \(\mathscr{H}_0\) at level \(\alpha\)
But confidence intervals are in terms of the data units, so easier to interpret.
The p-value is \(p = \Pr_0(T > t_D)\), where \(T \sim \mathsf{Student}(34)\). Using R, we find \(p=0.0032\), which is smaller than \(\alpha=5\%\).
The lower bound of the confidence interval is \(\overline{D} + \mathsf{se}(\overline{D}) \times \mathfrak{t}_{0.05}\).
The one-sided confidence interval is \([0.131, \infty]\). The postulated null value, \(0\), is outside the interval.
d <- with(distraction, t - c) # time difference text vs conversation
n <- length(d) # sample size
(mean_d <- mean(d)) # mean difference
## [1] 0.313
(se_d <- sd(d)/sqrt(n)) # standard error of sample mean
## [1] 0.108
(stat <- mean_d/se_d) # t-test statistic
## [1] 2.91
dof <- n - 1L # degrees of freedom
crit <- qt(p = 0.05, df = dof) # critical value, "q" for quantile
(pval <- pt(q = stat, df = dof, lower.tail = FALSE)) # Pr(T > stat)
## [1] 0.00319
(conf_low <- mean_d + se_d*crit) # lower bound of Wald confidence interval
## [1] 0.131
The estimated mean difference is \(0.313\) seconds (std. error of \(0.108\) seconds).
We reject \(\mathscr{H}_0\), meaning that the reaction time is significantly higher (at level \(5\)%) when texting than talking on the cellphone while walking (p-value of \(0.003\)).
Learning objectives