7.1 Example: Price of diamonds

This very large dataset contains the price in USD (rounded to nearest dollar) of \(n=53940\) diamonds. The explanatory variables include three ordinal factors: the quality of the cut, color and clarity. These are ranked from worst to best outcome. Five other variables contain the mensurements of the dimension of the diamond, rounded to the 0.01 mm. They are length x, width y, depth z, total depth percentage depth where depth\(=2\times z/(x + y)\), and table, a measure of the width of the top of the diamond. The last variable is the weight of the diamond, carat, rounded to the nearest 0.01.

#install.packages("ggplot2")
library(ggplot2); library(car)
data(diamonds, package = "ggplot2")
help(diamonds)
#Subsample because the dataset is very large
set.seed(1234) #Fix RNG seed so as to make output reproducible
di <- diamonds[sample.int(size = 500, replace = FALSE, n = nrow(diamonds)), ]
attach(di)

7.1.1 Exploratory data analysis

We can look at some graphs of the data, including pair plots and some summary statistics. These are useful to spot outliers.

str(di)

## Classes 'tbl_df', 'tbl' and 'data.frame':    500 obs. of  10 variables:
##  $ carat  : num  0.61 0.53 0.23 1.33 0.3 0.3 2.01 1.12 1.02 0.74 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 2 4 3 5 5 5 2 5 5 5 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 4 2 7 2 1 5 5 4 3 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 1 2 6 5 7 4 1 4 6 5 ...
##  $ depth  : num  63.4 60.8 62.3 61.3 61.6 60.8 63.9 61.8 62.1 62.3 ...
##  $ table  : num  57.1 58 55 57 56 57 59 55 57 56 ...
##  $ price  : int  1168 1173 505 6118 838 911 7024 6110 8401 3226 ...
##  $ x      : num  5.37 5.21 3.9 7.11 4.3 4.34 8.01 6.64 6.43 5.76 ...
##  $ y      : num  5.43 5.19 3.93 7.08 4.34 4.31 7.92 6.7 6.45 5.79 ...
##  $ z      : num  3.42 3.16 2.44 4.35 2.66 2.63 5.09 4.12 4 3.6 ...

apply(di, 2, range)

##      carat  cut         color clarity depth  table  price   x      y     
## [1,] "0.23" "Fair"      "D"   "I1"    "56.7" "53.0" "  396" "3.90" "3.92"
## [2,] "3.00" "Very Good" "J"   "VVS2"  "69.0" "70.0" "18741" "9.42" "9.26"
##      z     
## [1,] "2.38"
## [2,] "5.58"

print(GGally::ggpairs(di[,-(2:4)], progress = FALSE))

table(cut)

## cut
##      Fair      Good Very Good   Premium     Ideal 
##        18        32       116       135       199

table(clarity)

## clarity
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##    8   89  107  113   80   41   37   25

table(color)

## color
##   D   E   F   G   H   I   J 
##  63  85 101 100  81  46  24

graphics.off()
plot(density(carat, bw = 0.02), main = "Density estimate of carat")

The most important variable is likely to be weight or width (which are strongly correlated). An explanatory data analysis reveals the relationship between carat and price to be non-linear. A logarithmic transformation of both the price and carat alleviates this and reveals the discretization of the measurements (most of the diamonds have a reported weight of 1 or 2 carats). The linear correlation between the variables x, y and z is close to unity (due to the regular cut of diamonds). This will potentially lead to collinearity, so the variables may not be jointly significative. There is (depending on the subset) a clearly visible outlier in z hat should be removed. There is no evidence of interactions between the categorical variables and the rest (not shown). Lastly, depth and table are apparently not linearly correlated with price.

par(mfrow = c(1, 3), bty = "l")
plot(x = carat, y = price, ylab = "price (in USD)", xlab = "carat")
plot(carat, log(price), ylab = "log price (in USD)", xlab = "carat")
plot(carat, price, log="xy", ylab = "log price (in USD)", xlab = "log carat")

A careful explanatory data analysis with the full model reveals that, despite the fact depth is a transformed variable supposedly created from x, y and z, there are some outliers (summary(lm(depth ~ -1 + I(z/(x+y)), data = diamonds))). The model is likely to predict poorly 1 carat and 2 carats diamonds, for which there is a lot of heterogeneity.

Sometimes, it helps to regroup the regressors to better identify patterns. This is most useful in situations where there is a lot of noise in the response (not the case here).

lcarat_cut <- cut(log(carat), breaks = seq(-1.5, 1, by = 0.25))
boxplot(x ~ lcarat_cut, col = 'grey')

7.1.2 Model selection

We will start with a model with all regressors but y and z (which we eliminate on grounds of multicollinearity).

#Small model
redu_mod <- lm(log(price) ~ log(carat))
#Unordered factors (so they are interpretable)
#Ordered factors use an orthogonal decomposition
cut <- factor(cut, ordered = FALSE)
color <- factor(color, ordered = FALSE)
clarity <- factor(clarity, ordered = FALSE)
#Full additive model
full_mod <- lm(log(price) ~ log(carat) + cut + color + clarity + depth + table + x)
summ_full <- summary(full_mod)
knitr::kable(coef(summ_full), digits = 3)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	6.867	0.726	9.465	0.000
log(carat)	1.559	0.102	15.235	0.000
cutGood	0.080	0.043	1.863	0.063
cutVery Good	0.088	0.042	2.117	0.035
cutPremium	0.089	0.041	2.143	0.033
cutIdeal	0.121	0.043	2.851	0.005
colorE	-0.081	0.023	-3.499	0.001
colorF	-0.109	0.022	-4.856	0.000
colorG	-0.183	0.023	-7.955	0.000
colorH	-0.285	0.024	-12.013	0.000
colorI	-0.406	0.028	-14.695	0.000
colorJ	-0.532	0.034	-15.576	0.000
claritySI2	0.620	0.053	11.649	0.000
claritySI1	0.774	0.055	14.203	0.000
clarityVS2	0.953	0.054	17.580	0.000
clarityVS1	1.003	0.055	18.123	0.000
clarityVVS2	1.176	0.058	20.339	0.000
clarityVVS1	1.204	0.058	20.659	0.000
clarityIF	1.285	0.060	21.280	0.000
depth	0.000	0.006	0.041	0.968
table	-0.004	0.004	-0.987	0.324
x	0.163	0.053	3.056	0.002

RSS_full <- summ_full$sigma^2 * summ_full$df[2]
#RSS_full <- crossprod(resid(full_mod))[1]

The model fit is excellent. Unsurprisingly, all of x, y and z are not marginally significant if they are all included at once (output omitted), but x is if it is the only one included.

Recall that the t value column gives the Wald statistic \(t = \hat{\beta}_i/\mathrm{se}(\hat{\beta}_i)\), for the null hypothesis \(\beta_{i}=0\) against the two-sided alternative \(\beta_{i} \neq 0\). Under \(\mathscr{H}_0\), \(t \sim \mathcal{T}(n-p)\) and the \(P\)-value is \(2\times (1-\)pt\((t, n-p))\). It appears that we could get rid of depth and table, which contribute little overall. This is confirmed graphically using added-variable plots, which plots \(\mathbf{M}_{\mathbf{X}_{-j}}\boldsymbol{y}\) against \(\mathbf{M}_{\mathbf{X}_{-j}}\mathbf{x}_j\). This is the residual effect of \(\mathbf{x}_j\) after taking into account the effect of the other variables \(\mathbf{X}_{-j}\) on what remains of the response. If the variable was important, there would be a strong correlation in the variables and the slope would be non-zero. The last plots illustrates what you could see if there was residual structure (strong positive or negative correlation) or lack thereof.

par(mfrow = c(1, 2))
car::avPlot(full_mod, variable = "depth", ellipse = TRUE)
#slope close to zero indicates lack of relationship
car::avPlot(full_mod, variable = "log(carat)", ellipse = TRUE)

Let us look at model simplifications. We can obtain the \(F\) statistic for the null hypothesis \(\mathscr{H}_0: \beta_{\texttt{depth}} = \beta_{\texttt{table}}=0\) against the alternative \(\mathscr{H}_a: \{(\beta_{\texttt{depth}}, \beta_{\texttt{table}}) \in \mathbb{R}^2\}\) by running the anova command:

anova(lm(log(price) ~ log(carat) + cut + color + clarity + x), full_mod)

## Analysis of Variance Table
## 
## Model 1: log(price) ~ log(carat) + cut + color + clarity + x
## Model 2: log(price) ~ log(carat) + cut + color + clarity + depth + table + 
##     x
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    480 9.1173                           
## 2    478 9.0950  2  0.022358 0.5875 0.5561

The test statistic is of the form \[F = \frac{(\mathrm{RSS}_a-\mathrm{RSS}_0)/2}{\mathrm{RSS}_0/478}\sim \mathcal{F}(2, 478)\] and here \(F=\) 0.588; we fail to reject the null (\(P\)-value of 0.556). We have no evidence against the adequacy of the simpler model.

Since there is little difference between the reduced model RSS and that of the full additive model, we may employ either in subsequent ANOVA tests. Let us try to drop one of the remaining variables.

drop1(lm(log(price) ~ log(carat) + cut + color + clarity + x), test = "F")

## Single term deletions
## 
## Model:
## log(price) ~ log(carat) + cut + color + clarity + x
##            Df Sum of Sq     RSS     AIC  F value    Pr(>F)
## <none>                   9.1173 -1962.2                   
## log(carat)  1    5.4527 14.5700 -1729.8 287.0683 < 2.2e-16
## cut         4    0.3540  9.4713 -1951.2   4.6589 0.0010680
## color       6    8.5912 17.7085 -1642.3  75.3832 < 2.2e-16
## clarity     7   19.7002 28.8175 -1400.8 148.1651 < 2.2e-16
## x           1    0.2102  9.3276 -1952.8  11.0690 0.0009452

You could write the test statistics as before (with the difference in the degrees of freedom equal to the number of factor levels minus one if the variable is categorical). All the terms are statistically significant and we reject the null hypothesis of any of the tests at level \(\alpha = 5\%\). This step would mark the end of the backward elimination procedure. Note that you can (and may wish to) use the RSS from the full model full_mod in the denominator of your \(F\)-tests to avoid biasing your results if retaining the null leads to a sharp decrease.

If your test statistic is small, you cannot conclude anything. This may be because the null hypothesis that the simpler model is adequate is true. It can also be due to a lack of power (you should reject, but there are not enough evidences against the null). If you proceed with the RSS from the null model, your test statistic will then be biased downward; recall section 5.1.3.

We could equally well have started with forward selection. All the variables lead to a decrease in RSS that is significant at level \(\alpha = 5\%\). We pick the most significant one and proceed.

add_step1 <- add1(redu_mod, scope = formula(full_mod), scale = RSS_full, test = "F")
form <- deparse(formula(redu_mod))
while(length(which(add_step1[,"Pr(>F)"] < 0.05)) > 0){
  new_var <- rownames(add_step1)[which.max(add_step1[,"F value"])]
  form <- paste(form, new_var, sep = " + ")
  add_step1 <- add1(update(redu_mod, formula = form), 
                    scope = formula(full_mod), scale = RSS_full, test = "F")
}

The more we test using the ANOVA command, the more size distortion due to multiple testing (the type I error is inflated). A Bonferroni correction could alleviate this. Note that forward selection typically uses a biased estimate of the residual sum of square.

The variable that is the most correlated with log(price) is clarity and leads to a significant increase, so we would go for the bigger model since there is strong evidence that the model fit is better.

Both forward selection and backward elimination yielded the same model, with the three categorical variables and length. At this stage, we should try and include interactions.

add1(lm(formula = form, data = di),
        scope =  as.formula(paste(form, "+ log(carat):color + log(carat):clarity + 
                                  log(carat):cut + color:cut + color:clarity + cut:clarity", collapse = "")),
     test = "F")

## Single term additions
## 
## Model:
## log(price) ~ log(carat) + clarity + color + x + table + cut
##                    Df Sum of Sq    RSS     AIC F value    Pr(>F)
## <none>                          9.0950 -1961.4                  
## log(carat):color    6   0.34581 8.7492 -1968.8  3.1159  0.005280
## log(carat):clarity  7   0.42680 8.6682 -1971.5  3.3200  0.001843
## log(carat):cut      4   0.58482 8.5102 -1986.7  8.1606 2.271e-06
## color:cut          23   0.32263 8.7724 -1933.5  0.7292  0.816741
## clarity:color      39   1.22652 7.8685 -1955.9  1.7586  0.004057
## clarity:cut        20   0.65830 8.4367 -1959.0  1.7908  0.019290

The interaction between log(carat) and cut is significant at the 5% level, idem for clarity:color and clarity:cut. Keep in mind that adding interactions leads to a large increase in the number of parameters; clarity:color would lead to an additional 37 parameters!

7.1.3 Information criterion

We have covered (old school) partial \(F\)-tests in the ANOVA section. Other widely (mis)used goodness-of-fit diagnostics are AIC and BIC. These information criterion are goodness-of-fit measures coupled with model complexity penalty. They are (under many hypothesis) estimates of the Kullback–Leibler divergence.

Akaike’s An Information Criterion (AIC) is \(\mathrm{AIC}=-2\ell(\hat{\boldsymbol{\theta}}) + 2p\), while Schwartz’s information criterion is \(\mathrm{BIC}=-2\ell(\hat{\boldsymbol{\theta}}) + p \log(n)\). The latter is more stringent and penalizes more heavily the complex models as more data becomes available.

Some general remarks if you use information criteria

AIC and BIC must be computed using maximum likelihood estimators. In a linear model, this means that the estimator of the variance is \(\hat{\sigma}^2=\mathrm{RSS}/n\) and not \(s^2\). Similarly, the ordinary least square estimator is equivalent to the MLE for \(\hat{\boldsymbol{\beta}}\) if \(\mathrm{Var}({\boldsymbol{\varepsilon}}) = \sigma^2 \mathbf{I}_n\). In R, you can use BIC and AIC commands on models obtained from lm to get those values.
Information criteria can be used to compare nested and non-nested models.
The models should include the same data to be comparable.
If you are comparing different distributions, you need to include all the constants to make AIC values comparable.

The function step allows you to do forward or backward model selection using one of the information criterion. If you use this procedure, make sure that the model returned makes sense (e.g., no interactions without main effects). You may wish to use BIC rather than AIC because the latter leads to more parsimonious models. It may be a good starting point for your model search.

BIC_search <- formula(step(full_mod, trace = 0, k = log(nrow(di)))) #BIC
#Search within space of main effect-only models
BIC_mod <- lm(BIC_search, data = di) 
final_mod <- lm(log(price) ~ log(carat) + cut + color + clarity + 
                  x + table + log(carat):cut, data = di)
#summary(final_mod)

A further simplification would consist in merging the factors levels, typically into low quality and high quality. This may not a good idea, because it will disproportionally affect the prediction of large diamonds worth a lot, and will negatively impact the predictive accuracy. To merge factors, use e.g.

newcut <- cut
levels(newcut) <- list("Fair-Good" = c("Fair", "Good"), "High" = c("Very Good", "Premium", "Ideal"))

To export your table to , I recommend you use dedicated packages such as texreg, stargazer or xtable. You can easily export, using e.g. the command texreg::texreg(final_mod, stars = 0, digits = 2, single.row = TRUE, booktabs = TRUE), to get what you want. The level of customization is important (so you could rename the columns). Please make sure the font size is adequate and you use the right amount of digits.

7.1.4 Cross-validation

Let us now compare the predictive performance of the model using cross-validation. The idea underlying cross-validation is simple: split the data, use a fraction (called training set) for model fit and the remaining observations (termed validation set) to check predictions.

The predicted residual error sum of squares (PRESS), denoted \(\mathrm{CV}\) in the course, is the result of leave-one-out cross validation. The \(i\)th observation is predicted using the \(n-1\) other observations for every \(i=1, \ldots, n\). That is, we do not use the observation both for estimation and prediction and thus the predicted residual error is a more accurate measure of prediction error. We can return the PRESS statistic using the residuals from R.

#Leave-one-out cross validation
PRESS <- function(model){crossprod(rstandard(model, type = "pred"))[1,1]}
round(c("reduced model" = PRESS(redu_mod), #underfit?
        "final model " = PRESS(final_mod),
        "full model" = PRESS(full_mod)),
  digits = 2) #overfitting

## reduced model  final model     full model 
##         38.82          9.57         10.10

The cross-validated error estimate shows that we do significantly better with the final model than using simply the model with log(carat) and that the addition of x does not increase predictive accuracy. The full model has a larger prediction error, an indication that we may be overfitting.

Rather than use only one observation for validation and \(n-1\) for training, we can split more evenly: \(K\)-fold cross validation uses \(n-n/K\) observations for fitting and \(n/K\) for validation, providing a more realistic depiction of prediction. Unfortunately, the number of possible subsets of size \(\lfloor n/K\rfloor\) is very large and so one typically split the data into classes of equal size at random. The following function, which performs \(K\)-fold cross validation, can be used in your project. Since the result is random, it may be necessary to average over many replicates of the \(K\)-fold statistic provided that the calculation is not too computationally demanding. For large \(n\), this has less impact.

The smaller the prediction error, the better the model.

K <- 10
#Manually perform cross fold validation
KfoldCV <- function(fitted.mod, K, ...){
  data <- model.matrix(fitted.mod) #design matrix
  y <- fitted.mod$model[,1] #response
  n <- nrow(data)
  #Shuffle the indices
  inds <- sample.int(n = n, size = n, replace = FALSE)
  #Split into K groups of ~ equal size (from https://stackoverflow.com/a/16275428)
  form_group <- function(x, n){ split(x, cut(seq_along(x), n, labels = FALSE)) }
  groups <- form_group(inds, K)
  #Obtain prediction from K-folds
  preds <- rep(NA, n)
  for(j in 1:K){
   preds[groups[[j]]] <- data[groups[[j]],] %*% 
                            lm(y[-groups[[j]]] ~ -1 + data[-groups[[j]],])$coef
  } 
  #Compute prediction error
  crossprod(preds - y)[1,1]
}
## Because splitting is random, get different answer
round(c("reduced model" = median(replicate(KfoldCV(fitted.mod = redu_mod, K = K), n = 100)),
        "final model " = median(replicate(KfoldCV(fitted.mod = final_mod, K = K), n = 100)),
        "additive forward" = median(replicate(KfoldCV(fitted.mod = BIC_mod, K = K), n = 100))),
        digits = 2)

##    reduced model     final model  additive forward 
##            38.84             9.62            10.18

The conclusions are the same for \(10\)-fold cross validation as for leave-one-out cross validation, conforting our model choice. In general, we prefer the former.

7.1.5 Presentation of results

Having selected a model (say final_mod), you should now present a table with the coefficients and standard errors, some goodness-of-fit measures (\(\mathrm{R}^2_c\), \(\mathrm{AIC}/\mathrm{BIC}, \mathrm{CV}\), \(K\)-fold cross-validation error). Explain your model (interpret the parameters), look at diagnostic plots and answer the questions.

par(mfrow = c(2, 2), mar = c(5, 5, 1.5, 0.5))
bl <- scales::alpha("black", 0.5) #semi-transparent black
n <- nrow(di)
#plot(final_mod)

#Student Q-Q plot
qqPlot(final_mod, simulate = 1999, envelope = TRUE,
       ylab = "Externally studentized residuals", 
       xlab = "Theoretical student quantiles", 
       pch = 20, col = bl)

## [1] 118 214

#Residuals vs fitted values
residualPlot(final_mod, type = "rstudent", quadratic = FALSE, 
             pch = 20, ylab = "Externally studentized residuals")
#Cook distance
plot(cooks.distance(final_mod), col = bl, pch = 20, ylab = "Cook's distances")
abline(h = 8/(n-2*length(coef(final_mod))), col = 2)
influencePlot(final_mod)

##        StudRes        Hat       CookD
## 116 -0.3357101 0.32031107 0.002128445
## 118 -2.8462715 0.15974909 0.060701211
## 214  8.5006083 0.05591926 0.148869658
## 221 -0.2968635 0.27957372 0.001370613
## 253 -2.3850490 0.21631304 0.062191214

Many points have high leverage and large Cook’s values. This means the slope could in principle largely be driven by those points We get a very large residual (observation 350, which is a very small diamond of high quality sold for almost double the value of a comparable one). A more careful analysis would be necessary to see the impact of these points on the coefficients and whether they matter (or not). For example, we could refit the model using the command lm(final_mod, subset = -350).

Diagnostics of heteroscedasticity are mostly useful when you have suspicions that different subgroups could have different variances (if you include factor variables) or if the data are time series, in which case you may wish to look at plots of the correlogram (acf(resid(final_mod)) and partial correlogram pacf(resid(final_mod))). These are only useful for time ordered observations, i.e. when the errors at time \(t\) depend on previous time periods \(\{t-1, \ldots\}\). The impact of serial dependence of the residuals is that the standard errors from OLS are too small and need to be inflated.