7.1 Example: Price of diamonds
This very large dataset contains the price in USD (rounded to nearest dollar) of \(n=53940\) diamonds. The explanatory variables include three ordinal factors: the quality of the cut
, color
and clarity
. These are ranked from worst to best outcome. Five other variables contain the mensurements of the dimension of the diamond, rounded to the 0.01 mm. They are length x
, width y
, depth z
, total depth percentage depth
where depth
\(=2\times z/(x + y)\), and table
, a measure of the width of the top of the diamond. The last variable is the weight of the diamond, carat
, rounded to the nearest 0.01.
#install.packages("ggplot2")
library(ggplot2); library(car)
data(diamonds, package = "ggplot2")
help(diamonds)
#Subsample because the dataset is very large
set.seed(1234) #Fix RNG seed so as to make output reproducible
di <- diamonds[sample.int(size = 500, replace = FALSE, n = nrow(diamonds)), ]
attach(di)
7.1.1 Exploratory data analysis
We can look at some graphs of the data, including pair plots and some summary statistics. These are useful to spot outliers.
## Classes 'tbl_df', 'tbl' and 'data.frame': 500 obs. of 10 variables:
## $ carat : num 0.61 0.53 0.23 1.33 0.3 0.3 2.01 1.12 1.02 0.74 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 4 3 5 5 5 2 5 5 5 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 4 2 7 2 1 5 5 4 3 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 1 2 6 5 7 4 1 4 6 5 ...
## $ depth : num 63.4 60.8 62.3 61.3 61.6 60.8 63.9 61.8 62.1 62.3 ...
## $ table : num 57.1 58 55 57 56 57 59 55 57 56 ...
## $ price : int 1168 1173 505 6118 838 911 7024 6110 8401 3226 ...
## $ x : num 5.37 5.21 3.9 7.11 4.3 4.34 8.01 6.64 6.43 5.76 ...
## $ y : num 5.43 5.19 3.93 7.08 4.34 4.31 7.92 6.7 6.45 5.79 ...
## $ z : num 3.42 3.16 2.44 4.35 2.66 2.63 5.09 4.12 4 3.6 ...
## carat cut color clarity depth table price x y
## [1,] "0.23" "Fair" "D" "I1" "56.7" "53.0" " 396" "3.90" "3.92"
## [2,] "3.00" "Very Good" "J" "VVS2" "69.0" "70.0" "18741" "9.42" "9.26"
## z
## [1,] "2.38"
## [2,] "5.58"
## cut
## Fair Good Very Good Premium Ideal
## 18 32 116 135 199
## clarity
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 8 89 107 113 80 41 37 25
## color
## D E F G H I J
## 63 85 101 100 81 46 24
The most important variable is likely to be weight or width (which are strongly correlated). An explanatory data analysis reveals the relationship between carat
and price
to be non-linear. A logarithmic transformation of both the price
and carat
alleviates this and reveals the discretization of the measurements (most of the diamonds have a reported weight of 1 or 2 carats). The linear correlation between the variables x
, y
and z
is close to unity (due to the regular cut of diamonds). This will potentially lead to collinearity, so the variables may not be jointly significative. There is (depending on the subset) a clearly visible outlier in z
hat should be removed. There is no evidence of interactions between the categorical variables and the rest (not shown). Lastly, depth
and table
are apparently not linearly correlated with price
.
par(mfrow = c(1, 3), bty = "l")
plot(x = carat, y = price, ylab = "price (in USD)", xlab = "carat")
plot(carat, log(price), ylab = "log price (in USD)", xlab = "carat")
plot(carat, price, log="xy", ylab = "log price (in USD)", xlab = "log carat")
A careful explanatory data analysis with the full model reveals that, despite the fact depth
is a transformed variable supposedly created from x
, y
and z
, there are some outliers (summary(lm(depth ~ -1 + I(z/(x+y)), data = diamonds))
). The model is likely to predict poorly 1 carat and 2 carats diamonds, for which there is a lot of heterogeneity.
Sometimes, it helps to regroup the regressors to better identify patterns. This is most useful in situations where there is a lot of noise in the response (not the case here).
7.1.2 Model selection
We will start with a model with all regressors but y
and z
(which we eliminate on grounds of multicollinearity).
#Small model
redu_mod <- lm(log(price) ~ log(carat))
#Unordered factors (so they are interpretable)
#Ordered factors use an orthogonal decomposition
cut <- factor(cut, ordered = FALSE)
color <- factor(color, ordered = FALSE)
clarity <- factor(clarity, ordered = FALSE)
#Full additive model
full_mod <- lm(log(price) ~ log(carat) + cut + color + clarity + depth + table + x)
summ_full <- summary(full_mod)
knitr::kable(coef(summ_full), digits = 3)
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 6.867 | 0.726 | 9.465 | 0.000 |
log(carat) | 1.559 | 0.102 | 15.235 | 0.000 |
cutGood | 0.080 | 0.043 | 1.863 | 0.063 |
cutVery Good | 0.088 | 0.042 | 2.117 | 0.035 |
cutPremium | 0.089 | 0.041 | 2.143 | 0.033 |
cutIdeal | 0.121 | 0.043 | 2.851 | 0.005 |
colorE | -0.081 | 0.023 | -3.499 | 0.001 |
colorF | -0.109 | 0.022 | -4.856 | 0.000 |
colorG | -0.183 | 0.023 | -7.955 | 0.000 |
colorH | -0.285 | 0.024 | -12.013 | 0.000 |
colorI | -0.406 | 0.028 | -14.695 | 0.000 |
colorJ | -0.532 | 0.034 | -15.576 | 0.000 |
claritySI2 | 0.620 | 0.053 | 11.649 | 0.000 |
claritySI1 | 0.774 | 0.055 | 14.203 | 0.000 |
clarityVS2 | 0.953 | 0.054 | 17.580 | 0.000 |
clarityVS1 | 1.003 | 0.055 | 18.123 | 0.000 |
clarityVVS2 | 1.176 | 0.058 | 20.339 | 0.000 |
clarityVVS1 | 1.204 | 0.058 | 20.659 | 0.000 |
clarityIF | 1.285 | 0.060 | 21.280 | 0.000 |
depth | 0.000 | 0.006 | 0.041 | 0.968 |
table | -0.004 | 0.004 | -0.987 | 0.324 |
x | 0.163 | 0.053 | 3.056 | 0.002 |
The model fit is excellent. Unsurprisingly, all of x
, y
and z
are not marginally significant if they are all included at once (output omitted), but x
is if it is the only one included.
Recall that the t value
column gives the Wald statistic \(t = \hat{\beta}_i/\mathrm{se}(\hat{\beta}_i)\), for the null hypothesis \(\beta_{i}=0\) against the two-sided alternative \(\beta_{i} \neq 0\). Under \(\mathscr{H}_0\), \(t \sim \mathcal{T}(n-p)\) and the \(P\)-value is \(2\times (1-\)pt
\((t, n-p))\). It appears that we could get rid of depth
and table
, which contribute little overall. This is confirmed graphically using added-variable plots, which plots \(\mathbf{M}_{\mathbf{X}_{-j}}\boldsymbol{y}\) against \(\mathbf{M}_{\mathbf{X}_{-j}}\mathbf{x}_j\). This is the residual effect of \(\mathbf{x}_j\) after taking into account the effect of the other variables \(\mathbf{X}_{-j}\) on what remains of the response. If the variable was important, there would be a strong correlation in the variables and the slope would be non-zero. The last plots illustrates what you could see if there was residual structure (strong positive or negative correlation) or lack thereof.
par(mfrow = c(1, 2))
car::avPlot(full_mod, variable = "depth", ellipse = TRUE)
#slope close to zero indicates lack of relationship
car::avPlot(full_mod, variable = "log(carat)", ellipse = TRUE)
Let us look at model simplifications.
We can obtain the \(F\) statistic for the null hypothesis \(\mathscr{H}_0: \beta_{\texttt{depth}} = \beta_{\texttt{table}}=0\) against the alternative \(\mathscr{H}_a: \{(\beta_{\texttt{depth}}, \beta_{\texttt{table}}) \in \mathbb{R}^2\}\) by running the anova
command:
## Analysis of Variance Table
##
## Model 1: log(price) ~ log(carat) + cut + color + clarity + x
## Model 2: log(price) ~ log(carat) + cut + color + clarity + depth + table +
## x
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 480 9.1173
## 2 478 9.0950 2 0.022358 0.5875 0.5561
The test statistic is of the form \[F = \frac{(\mathrm{RSS}_a-\mathrm{RSS}_0)/2}{\mathrm{RSS}_0/478}\sim \mathcal{F}(2, 478)\] and here \(F=\) 0.588; we fail to reject the null (\(P\)-value of 0.556). We have no evidence against the adequacy of the simpler model.
Since there is little difference between the reduced model RSS and that of the full additive model, we may employ either in subsequent ANOVA tests. Let us try to drop one of the remaining variables.
## Single term deletions
##
## Model:
## log(price) ~ log(carat) + cut + color + clarity + x
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 9.1173 -1962.2
## log(carat) 1 5.4527 14.5700 -1729.8 287.0683 < 2.2e-16
## cut 4 0.3540 9.4713 -1951.2 4.6589 0.0010680
## color 6 8.5912 17.7085 -1642.3 75.3832 < 2.2e-16
## clarity 7 19.7002 28.8175 -1400.8 148.1651 < 2.2e-16
## x 1 0.2102 9.3276 -1952.8 11.0690 0.0009452
You could write the test statistics as before (with the difference in the degrees of freedom equal to the number of factor levels minus one if the variable is categorical). All the terms are statistically significant and we reject the null hypothesis of any of the tests at level \(\alpha = 5\%\). This step would mark the end of the backward elimination procedure. Note that you can (and may wish to) use the RSS from the full model full_mod
in the denominator of your \(F\)-tests to avoid biasing your results if retaining the null leads to a sharp decrease.
If your test statistic is small, you cannot conclude anything. This may be because the null hypothesis that the simpler model is adequate is true. It can also be due to a lack of power (you should reject, but there are not enough evidences against the null). If you proceed with the RSS from the null model, your test statistic will then be biased downward; recall section 5.1.3.
We could equally well have started with forward selection. All the variables lead to a decrease in RSS that is significant at level \(\alpha = 5\%\). We pick the most significant one and proceed.
add_step1 <- add1(redu_mod, scope = formula(full_mod), scale = RSS_full, test = "F")
form <- deparse(formula(redu_mod))
while(length(which(add_step1[,"Pr(>F)"] < 0.05)) > 0){
new_var <- rownames(add_step1)[which.max(add_step1[,"F value"])]
form <- paste(form, new_var, sep = " + ")
add_step1 <- add1(update(redu_mod, formula = form),
scope = formula(full_mod), scale = RSS_full, test = "F")
}
The more we test using the ANOVA command, the more size distortion due to multiple testing (the type I error is inflated). A Bonferroni correction could alleviate this. Note that forward selection typically uses a biased estimate of the residual sum of square.
The variable that is the most correlated with log(price)
is clarity
and leads to a significant increase, so we would go for the bigger model since there is strong evidence that the model fit is better.
Both forward selection and backward elimination yielded the same model, with the three categorical variables and length. At this stage, we should try and include interactions.
add1(lm(formula = form, data = di),
scope = as.formula(paste(form, "+ log(carat):color + log(carat):clarity +
log(carat):cut + color:cut + color:clarity + cut:clarity", collapse = "")),
test = "F")
## Single term additions
##
## Model:
## log(price) ~ log(carat) + clarity + color + x + table + cut
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 9.0950 -1961.4
## log(carat):color 6 0.34581 8.7492 -1968.8 3.1159 0.005280
## log(carat):clarity 7 0.42680 8.6682 -1971.5 3.3200 0.001843
## log(carat):cut 4 0.58482 8.5102 -1986.7 8.1606 2.271e-06
## color:cut 23 0.32263 8.7724 -1933.5 0.7292 0.816741
## clarity:color 39 1.22652 7.8685 -1955.9 1.7586 0.004057
## clarity:cut 20 0.65830 8.4367 -1959.0 1.7908 0.019290
The interaction between log(carat)
and cut
is significant at the 5% level, idem for clarity:color
and clarity:cut
. Keep in mind that adding interactions leads to a large increase in the number of parameters; clarity:color
would lead to an additional 37 parameters!
7.1.3 Information criterion
We have covered (old school) partial \(F\)-tests in the ANOVA section. Other widely (mis)used goodness-of-fit diagnostics are AIC and BIC. These information criterion are goodness-of-fit measures coupled with model complexity penalty. They are (under many hypothesis) estimates of the Kullback–Leibler divergence.
Akaike’s An Information Criterion (AIC) is \(\mathrm{AIC}=-2\ell(\hat{\boldsymbol{\theta}}) + 2p\), while Schwartz’s information criterion is \(\mathrm{BIC}=-2\ell(\hat{\boldsymbol{\theta}}) + p \log(n)\). The latter is more stringent and penalizes more heavily the complex models as more data becomes available.
Some general remarks if you use information criteria
- AIC and BIC must be computed using maximum likelihood estimators. In a linear model, this means that the estimator of the variance is \(\hat{\sigma}^2=\mathrm{RSS}/n\) and not \(s^2\). Similarly, the ordinary least square estimator is equivalent to the MLE for \(\hat{\boldsymbol{\beta}}\) if \(\mathrm{Var}({\boldsymbol{\varepsilon}}) = \sigma^2 \mathbf{I}_n\). In R, you can use
BIC
andAIC
commands on models obtained fromlm
to get those values. - Information criteria can be used to compare nested and non-nested models.
- The models should include the same data to be comparable.
- If you are comparing different distributions, you need to include all the constants to make AIC values comparable.
The function step
allows you to do forward or backward model selection using one of the information criterion. If you use this procedure, make sure that the model returned makes sense (e.g., no interactions without main effects). You may wish to use BIC rather than AIC because the latter leads to more parsimonious models. It may be a good starting point for your model search.
BIC_search <- formula(step(full_mod, trace = 0, k = log(nrow(di)))) #BIC
#Search within space of main effect-only models
BIC_mod <- lm(BIC_search, data = di)
final_mod <- lm(log(price) ~ log(carat) + cut + color + clarity +
x + table + log(carat):cut, data = di)
#summary(final_mod)
A further simplification would consist in merging the factors levels, typically into low quality and high quality. This may not a good idea, because it will disproportionally affect the prediction of large diamonds worth a lot, and will negatively impact the predictive accuracy. To merge factors, use e.g.
newcut <- cut
levels(newcut) <- list("Fair-Good" = c("Fair", "Good"), "High" = c("Very Good", "Premium", "Ideal"))
To export your table to , I recommend you use dedicated packages such as texreg
, stargazer
or xtable
. You can easily export, using e.g. the command texreg::texreg(final_mod, stars = 0, digits = 2, single.row = TRUE, booktabs = TRUE)
, to get what you want. The level of customization is important (so you could rename the columns). Please make sure the font size is adequate and you use the right amount of digits.
7.1.4 Cross-validation
Let us now compare the predictive performance of the model using cross-validation. The idea underlying cross-validation is simple: split the data, use a fraction (called training set) for model fit and the remaining observations (termed validation set) to check predictions.
The predicted residual error sum of squares (PRESS), denoted \(\mathrm{CV}\) in the course, is the result of leave-one-out cross validation. The \(i\)th observation is predicted using the \(n-1\) other observations for every \(i=1, \ldots, n\). That is, we do not use the observation both for estimation and prediction and thus the predicted residual error is a more accurate measure of prediction error. We can return the PRESS statistic using the residuals from R.
#Leave-one-out cross validation
PRESS <- function(model){crossprod(rstandard(model, type = "pred"))[1,1]}
round(c("reduced model" = PRESS(redu_mod), #underfit?
"final model " = PRESS(final_mod),
"full model" = PRESS(full_mod)),
digits = 2) #overfitting
## reduced model final model full model
## 38.82 9.57 10.10
The cross-validated error estimate shows that we do significantly better with the final model than using simply the model with log(carat)
and that the addition of x
does not increase predictive accuracy. The full model has a larger prediction error, an indication that we may be overfitting.
Rather than use only one observation for validation and \(n-1\) for training, we can split more evenly: \(K\)-fold cross validation uses \(n-n/K\) observations for fitting and \(n/K\) for validation, providing a more realistic depiction of prediction. Unfortunately, the number of possible subsets of size \(\lfloor n/K\rfloor\) is very large and so one typically split the data into classes of equal size at random. The following function, which performs \(K\)-fold cross validation, can be used in your project. Since the result is random, it may be necessary to average over many replicates of the \(K\)-fold statistic provided that the calculation is not too computationally demanding. For large \(n\), this has less impact.
The smaller the prediction error, the better the model.
K <- 10
#Manually perform cross fold validation
KfoldCV <- function(fitted.mod, K, ...){
data <- model.matrix(fitted.mod) #design matrix
y <- fitted.mod$model[,1] #response
n <- nrow(data)
#Shuffle the indices
inds <- sample.int(n = n, size = n, replace = FALSE)
#Split into K groups of ~ equal size (from https://stackoverflow.com/a/16275428)
form_group <- function(x, n){ split(x, cut(seq_along(x), n, labels = FALSE)) }
groups <- form_group(inds, K)
#Obtain prediction from K-folds
preds <- rep(NA, n)
for(j in 1:K){
preds[groups[[j]]] <- data[groups[[j]],] %*%
lm(y[-groups[[j]]] ~ -1 + data[-groups[[j]],])$coef
}
#Compute prediction error
crossprod(preds - y)[1,1]
}
## Because splitting is random, get different answer
round(c("reduced model" = median(replicate(KfoldCV(fitted.mod = redu_mod, K = K), n = 100)),
"final model " = median(replicate(KfoldCV(fitted.mod = final_mod, K = K), n = 100)),
"additive forward" = median(replicate(KfoldCV(fitted.mod = BIC_mod, K = K), n = 100))),
digits = 2)
## reduced model final model additive forward
## 38.84 9.62 10.18
The conclusions are the same for \(10\)-fold cross validation as for leave-one-out cross validation, conforting our model choice. In general, we prefer the former.
7.1.5 Presentation of results
Having selected a model (say final_mod
), you should now present a table with the coefficients and standard errors, some goodness-of-fit measures (\(\mathrm{R}^2_c\), \(\mathrm{AIC}/\mathrm{BIC}, \mathrm{CV}\), \(K\)-fold cross-validation error). Explain your model (interpret the parameters), look at diagnostic plots and answer the questions.
par(mfrow = c(2, 2), mar = c(5, 5, 1.5, 0.5))
bl <- scales::alpha("black", 0.5) #semi-transparent black
n <- nrow(di)
#plot(final_mod)
#Student Q-Q plot
qqPlot(final_mod, simulate = 1999, envelope = TRUE,
ylab = "Externally studentized residuals",
xlab = "Theoretical student quantiles",
pch = 20, col = bl)
## [1] 118 214
#Residuals vs fitted values
residualPlot(final_mod, type = "rstudent", quadratic = FALSE,
pch = 20, ylab = "Externally studentized residuals")
#Cook distance
plot(cooks.distance(final_mod), col = bl, pch = 20, ylab = "Cook's distances")
abline(h = 8/(n-2*length(coef(final_mod))), col = 2)
influencePlot(final_mod)
## StudRes Hat CookD
## 116 -0.3357101 0.32031107 0.002128445
## 118 -2.8462715 0.15974909 0.060701211
## 214 8.5006083 0.05591926 0.148869658
## 221 -0.2968635 0.27957372 0.001370613
## 253 -2.3850490 0.21631304 0.062191214
Many points have high leverage and large Cook’s values. This means the slope could in principle largely be driven by those points
We get a very large residual (observation 350
, which is a very small diamond of high quality sold for almost double the value of a comparable one). A more careful analysis would be necessary to see the impact of these points on the coefficients and whether they matter (or not). For example, we could refit the model using the command lm(final_mod, subset = -350)
.
Diagnostics of heteroscedasticity are mostly useful when you have suspicions that different subgroups could have different variances (if you include factor variables) or if the data are time series, in which case you may wish to look at plots of the correlogram (acf(resid(final_mod))
and partial correlogram pacf(resid(final_mod))
). These are only useful for time ordered observations, i.e. when the errors at time \(t\) depend on previous time periods \(\{t-1, \ldots\}\). The impact of serial dependence of the residuals is that the standard errors from OLS are too small and need to be inflated.