7 Model selection

Why perform model selection? Practitionners tend to prefer the use of a single model to model aggregation or averaging, since having a single model is easier to interpret. One guiding principle for choosing a single model is parsimony and interpretability: we want a model that is not overly complex and that does not overfit the data. This is most important if the goal of the analysis is prediction. Whenever possible, we will make comparisons between nested models.

My (personal) general strategy for model selection, given the tools covered in MATH 341, is the following:

  1. Start with a complex model (all additive terms, say). Look at the individual Wald tests for the marginal significance of the coefficients.
  2. Try to simplify the full model using an \(F\) test, potentially dropping multiple terms at once. This preserves power (avoids the potential bias in the estimate of RSS in the denominator; cf. 5.1.3 and reduces the multiple testing problem that inflates Type I errors (reject the null more than \(\alpha\%\) of the time when you shouldn’t).
  3. Repeat; you can use drop1 to further reduce the model.
  4. Try adding interactions between variables, if any.
  5. Compare the nested models using an information criterion such as AIC or BIC.
  6. Select the best forward selection / backward elimination models and some additional ones (that have a nice interpretation, have low information criterion values, etc.) Compare them in terms of prediction and goodness-of-fit.

Once you have a final model, you can interpret coefficients. Model selection invalides classical inference, so report coefficients and standard errors as is without overinterpreting the \(P\)-values from summary.

The best tool to assess the predictive power your model is the use of cross-validation. If there is no temporal structure, you can use e.g. five-fold cross validation to find the best fitting model.