# The Linear Regression Model: Ordinary Least Squares {#linreg-estimation-ols-example}

In this example, we demonstrate how regression coefficients are estimated using ordinary least squares.

The ordinary least squares estimator of regression slopes is given by

\[\begin{equation} \boldsymbol{\hat{\beta}} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \left( \mathbf{X}^{T} \mathbf{y} \right) \end{equation}\]

The jeksterslabRlinreg package has several functions that can be used in solving for \(\boldsymbol{\hat{\beta}}\) namely

Data

See jeksterslabRdatarepo::wages.matrix() for the data set used in this example.

X <- jeksterslabRdatarepo::wages.matrix[["X"]]
# age is removed
X <- X[, -ncol(X)]
y <- jeksterslabRdatarepo::wages.matrix[["y"]]
head(X)
#>      constant gender race union education experience
#> [1,]        1      1    0     0        12         20
#> [2,]        1      0    0     0         9          9
#> [3,]        1      0    0     0        16         15
#> [4,]        1      0    1     1        14         38
#> [5,]        1      1    1     0        16         19
#> [6,]        1      1    0     0        12          4
head(y)
#>      wages
#> [1,] 11.55
#> [2,]  5.00
#> [3,] 12.00
#> [4,]  7.00
#> [5,] 21.15
#> [6,]  6.92

Normal Equation

\[\begin{equation} \boldsymbol{\hat{\beta}} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \left( \mathbf{X}^{T} \mathbf{y} \right) \end{equation}\]

The following implements the equation in R.

solve(t(X) %*% X) %*% t(X) %*% y
#>                 wages
#> constant   -7.1833382
#> gender     -3.0748755
#> race       -1.5653133
#> union       1.0959758
#> education   1.3703010
#> experience  0.1666065

The crossprod() function is more efficient in calculating cross products. Below is a more efficient implementation of the equation.

solve(crossprod(X), crossprod(X, y))
#>                 wages
#> constant   -7.1833382
#> gender     -3.0748755
#> race       -1.5653133
#> union       1.0959758
#> education   1.3703010
#> experience  0.1666065

jeksterslabRlinreg::.betahatnorm()

The jeksterslabRlinreg package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using the normal equation.

result_inv <- jeksterslabRlinreg::.betahatnorm(
  X = X,
  y = y
)
result_inv
#>               betahat
#> constant   -7.1833382
#> gender     -3.0748755
#> race       -1.5653133
#> union       1.0959758
#> education   1.3703010
#> experience  0.1666065

QR Decomposition

Another way to solve for \(\boldsymbol{\hat{\beta}}\) is through QR decomposition.

The data matrix \(\mathbf{X}\) is decomposed into

\[\begin{equation} \mathbf{X} = \mathbf{Q} \mathbf{R}. \end{equation}\]

QR decomposition is implemeded in R using the qr() function.

Xqr <- qr(X)
R <- qr.R(Xqr)
Q <- qr.Q(Xqr)

Estimates are found by solving \(\boldsymbol{\hat{\beta}}\) in

\[\begin{equation} \mathbf{R} \boldsymbol{\hat{\beta}} = \mathbf{Q}^{T} \mathbf{y}. \end{equation}\]

backsolve(R, crossprod(Q, y))
#>            [,1]
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,]  1.0959758
#> [5,]  1.3703010
#> [6,]  0.1666065

Another solution is to multiply both sides by the inverse of \(\mathbf{R}\) to isolate \(\boldsymbol{\hat{\beta}}\).

\[\begin{equation} \boldsymbol{\hat{\beta}} = \mathbf{R}^{-1} \mathbf{Q}^{T} \mathbf{y}. \end{equation}\]

solve(R) %*% crossprod(Q, y)
#>                 wages
#> constant   -7.1833382
#> gender     -3.0748755
#> race       -1.5653133
#> union       1.0959758
#> education   1.3703010
#> experience  0.1666065

The first solution is more efficient.

jeksterslabRlinreg::.betahatqr()

The jeksterslabRlinreg package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using QR decomposition.

result_qr <- jeksterslabRlinreg::.betahatqr(
  X = X,
  y = y
)
result_qr
#>         betahat
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,]  1.0959758
#> [5,]  1.3703010
#> [6,]  0.1666065

Singular Value Decomposition (SVD)

Another way to solve for \(\boldsymbol{\hat{\beta}}\) is through singular value decomposition.

The data matrix \(\mathbf{X}\) is decomposed into

\[\begin{equation} \mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{T}. \end{equation}\]

\(\boldsymbol{\hat{\beta}}\) is given by

\[\begin{equation} \boldsymbol{\hat{\beta}} = \mathbf{V} \mathbf{\Sigma}^{+} \mathbf{U}^{T} \mathbf{y} \end{equation}\]

where the superscript \(+\) indicates the pseudoinverse.

Singular value decomposition is implemeded in R using the svd() function.

Xsvd <- svd(X)
V <- Xsvd$v
U <- Xsvd$u
Sigma <- Xsvd$d

\(\boldsymbol{\hat{\beta}}\) is given by

V %*% ((1 / Sigma) * crossprod(U, y))
#>           wages
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,]  1.0959758
#> [5,]  1.3703010
#> [6,]  0.1666065

jeksterslabRlinreg::.betahatsvd()

The jeksterslabRlinreg package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using singular decomposition.

result_svd <- jeksterslabRlinreg::.betahatsvd(
  X = X,
  y = y
)
result_svd
#>         betahat
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,]  1.0959758
#> [5,]  1.3703010
#> [6,]  0.1666065

jeksterslabRlinreg::betahat()

jeksterslabRlinreg::betahat() calculates coefficients using the normal equation. When that fails, QR decomposition is used when qr = TRUE or singular value decomposition when qr = FALSE.

result_betahat <- jeksterslabRlinreg::betahat(
  X = X,
  y = y
)

lm() Function

The lm() function is the default option for estimating parameters of linear regression models in R. It calculates estimates of regression coefficients and other statistics.

lmobj <- lm(
  y ~ gender + race + union + education + experience,
  data = jeksterslabRdatarepo::wages
)
result_lm <- coef(lmobj)
result_lm
#> (Intercept)      gender        race       union   education  experience 
#>  -7.1833382  -3.0748755  -1.5653133   1.0959758   1.3703010   0.1666065
summary(lmobj)
#> 
#> Call:
#> lm(formula = y ~ gender + race + union + education + experience, 
#>     data = jeksterslabRdatarepo::wages)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -20.781  -3.760  -1.044   2.418  50.414 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -7.18334    1.01579  -7.072 2.51e-12 ***
#> gender      -3.07488    0.36462  -8.433  < 2e-16 ***
#> race        -1.56531    0.50919  -3.074  0.00216 ** 
#> union        1.09598    0.50608   2.166  0.03052 *  
#> education    1.37030    0.06590  20.792  < 2e-16 ***
#> experience   0.16661    0.01605  10.382  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.508 on 1283 degrees of freedom
#> Multiple R-squared:  0.3233, Adjusted R-squared:  0.3207 
#> F-statistic: 122.6 on 5 and 1283 DF,  p-value: < 2.2e-16

Why use matrix solutions over lm()?

When you only need the regression coefficients you are better off just calculating said coefficients using matrix operations because it is faster. lm() calculates and provides a rich output that is very useful. However, there are situations when the coefficients are enough. For example, bootstrapping and simulations. Having an efficient option to calculate regression coefficients is handy.

microbenchmark::microbenchmark(
  lm = lm(y ~ gender + race + union + education + experience, data = jeksterslabRdatarepo::wages),
  betahatnorm = jeksterslabRlinreg::.betahatnorm(X = X, y = y),
  betahatqr = jeksterslabRlinreg::.betahatqr(X = X, y = y),
  betahatsvd = jeksterslabRlinreg::.betahatsvd(X = X, y = y),
  betahat = jeksterslabRlinreg::betahat(X = X, y = y)
)
#> Unit: microseconds
#>         expr      min        lq      mean    median        uq       max neval
#>           lm 2014.932 2379.1010 3330.6218 2761.0070 3279.7440 14319.002   100
#>  betahatnorm   81.619   94.5695  159.9868  130.8410  182.1045   678.173   100
#>    betahatqr  274.445  387.6460  796.7431  517.0380  661.5865  9283.684   100
#>   betahatsvd  281.918  361.9860  668.9583  431.5150  540.7130 10167.282   100
#>      betahat   97.689  128.4765  223.6165  173.8065  231.1500  2210.643   100
context("Test linreg-estimation-ols")
test_that("results_betahat", {
  result_lm <- as.vector(unname(result_lm))
  result_inv <- as.vector(unname(result_inv))
  result_qr <- as.vector(unname(result_qr))
  result_svd <- as.vector(unname(result_svd))
  result_betahat <- as.vector(unname(result_betahat))
  for (i in 1:length(result_lm)) {
    expect_equivalent(
      result_lm[i],
      result_inv[i]
    )
    expect_equivalent(
      result_lm[i],
      result_qr[i]
    )
    expect_equivalent(
      result_lm[i],
      result_svd[i]
    )
    expect_equivalent(
      result_lm[i],
      result_betahat[i]
    )
  }
})
#> Test passed 🌈
test_that("X is singular.", {
  expect_error(
    jeksterslabRlinreg::.betahatnorm(
      X = jeksterslabRdatarepo::svd.linreg$Xb,
      y = jeksterslabRdatarepo::svd.linreg$yb,
    )
  )
})
#> Test passed 😸
test_that("Using QR decomposition.", {
  expect_message(
    jeksterslabRlinreg::betahat(
      X = jeksterslabRdatarepo::svd.linreg$Xb,
      y = jeksterslabRdatarepo::svd.linreg$yb,
    ),
    "Using QR decomposition."
  )
})
#> Test passed 🎉
test_that("Using singular value decomposition.", {
  expect_message(
    jeksterslabRlinreg::betahat(
      X = jeksterslabRdatarepo::svd.linreg$Xb,
      y = jeksterslabRdatarepo::svd.linreg$yb,
      qr = FALSE
    ),
    "Using singular value decomposition."
  )
})
#> Test passed 🥇