3.2 Factors

Oftentimes, the regressors that are part of the design matrix will be categorical variables, which can be ordered or unordered. In the design matrix, these will appear as matrix of binary indicators. In R, dummy variables (vectors of class factor) are used to indicate categorical variables, which are often encoded as strings or (perhaps worse) integers. The function read.table has an argument, stringsAsFactors, that will automatically cast strings to factors. Oftentimes, dummy indicators are encoded as integer vectors in data frames. There is a risk that the vector be interpreted as a continuous numeric if the levels are integers (for example, advancement of the state of an illness that is encoded as 1, 2, 3, ).

Useful functions to deal with factors include

  • as.factor: casts column vectors to factors;
  • is.factor: to check whether the vector is a factor;
  • class: reports the encoding (factor for factor objects);
  • summary: displays counts of the various levels of a factor in place of the usual summary statistics;
  • levels: the names of the different levels of a factor; can be used to replace existing category names by more meaningful ones;

The function lm knows how to deal with factor objects and will automatically transform it to a matrix of binary indiactors. For identifiability constraints, one level of the factor must be dropped and the convention in R is that the categories whose levels is first in alphabetical order is used as baseline. See ?factor for more details.

Suppose your data set contains binary factors with only two levels and such that these are mutually exclusive. You may wish to merge these if they refer to the same variable, for example ethnicity. The following brute-force solution shows how to merge the factor vectors into a single factor. This will not change the model you get (since the design matrix would span the same space), but can affect the result of ANOVA calls since you will drop all different levels at once rather than subgroup by subgroup.

Ordered factors do not have the same parametrization as unordered ones, so be careful when interpreting them.