05. Linear models (coefficient of determination)
2024
The Pearson correlation coefficient quantifies the strength of the linear relationship between two random variables \(X\) and \(Y\). \[\begin{align*} \rho= \mathsf{cor}(X, Y) = \frac{{\mathsf{Co}}(X,Y)}{\sqrt{{\mathsf{Va}}(X){\mathsf{Va}}(Y)}}. \end{align*}\]
The sign determines the orientation of the slope.
Figure 1: Scatterplots of observations with correlations of \(0.1\), \(0.5\), \(-0.75\) and \(0.95\) from \(A\) to \(D\).
Figure 2: Four datasets with dependent data having identical summary statistics and a linear correlation of -0.06.
Suppose that we do not use any explanatory variable (i.e., the intercept-only model). In this case, the fitted value for \(Y\) is the overall mean and the sum of squared centered observations \[\begin{align*} \mathsf{SS}_c=\sum_{i=1}^n (Y_i-\overline{Y})^2 \end{align*}\] where \(\overline{Y}\) represents the intercept-only fitted value.
When we include the \(p\) regressors, we get rather \[\begin{align*} \mathsf{SS}_e=\sum_{i=1}^n (Y_i-\widehat{Y}_i)^2 \end{align*}\] The \(\mathsf{SS}_e\) is non-increasing when we include more variables.
Consider the sum of squared residuals for two models:
Consequently, \(\mathsf{SS}_c-\mathsf{SS}_e\) is the reduction of the error associated with including \(\mathbf{X}\) in the model \[\begin{align*} R^2=\frac{\mathsf{SS}_c-\mathsf{SS}_e}{\mathsf{SS}_c} \end{align*}\] This gives the proportion of the variability in \(\boldsymbol{Y}\) explained by \(\mathbf{X}\).
We can show that the coefficient of determination is the square of Pearson’s linear correlation between the response \(\boldsymbol{y}\) and the fitted values \(\widehat{\boldsymbol{y}}\), \[R^2 = \mathsf{cor}^2(\boldsymbol{y}, \widehat{\boldsymbol{y}}).\]