vignettes/tests/test-linreg-estimation-ols-example.Rmd
test-linreg-estimation-ols-example.Rmd
# The Linear Regression Model: Ordinary Least Squares {#linreg-estimation-ols-example}
In this example, we demonstrate how regression coefficients are estimated using ordinary least squares.
The ordinary least squares estimator of regression slopes is given by
\[\begin{equation} \boldsymbol{\hat{\beta}} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \left( \mathbf{X}^{T} \mathbf{y} \right) \end{equation}\]
The jeksterslabRlinreg
package has several functions that can be used in solving for \(\boldsymbol{\hat{\beta}}\) namely
jeksterslabRlinreg::.betahatnorm()
- normal equationjeksterslabRlinreg::.betahatqr()
- using QR decompositionjeksterslabRlinreg::.betahatsvd()
- using singular value decomposition andjeksterslabRlinreg::betahat
- calculates coefficients using the normal equation. When that fails, QR decomposition is used when qr = TRUE
or singular value decomposition when qr = FALSE
.See jeksterslabRdatarepo::wages.matrix()
for the data set used in this example.
X <- jeksterslabRdatarepo::wages.matrix[["X"]]
# age is removed
X <- X[, -ncol(X)]
y <- jeksterslabRdatarepo::wages.matrix[["y"]]
head(X)
#> constant gender race union education experience
#> [1,] 1 1 0 0 12 20
#> [2,] 1 0 0 0 9 9
#> [3,] 1 0 0 0 16 15
#> [4,] 1 0 1 1 14 38
#> [5,] 1 1 1 0 16 19
#> [6,] 1 1 0 0 12 4
head(y)
#> wages
#> [1,] 11.55
#> [2,] 5.00
#> [3,] 12.00
#> [4,] 7.00
#> [5,] 21.15
#> [6,] 6.92
\[\begin{equation} \boldsymbol{\hat{\beta}} = \left( \mathbf{X}^{T} \mathbf{X} \right)^{-1} \left( \mathbf{X}^{T} \mathbf{y} \right) \end{equation}\]
The following implements the equation in R
.
solve(t(X) %*% X) %*% t(X) %*% y
#> wages
#> constant -7.1833382
#> gender -3.0748755
#> race -1.5653133
#> union 1.0959758
#> education 1.3703010
#> experience 0.1666065
The crossprod()
function is more efficient in calculating cross products. Below is a more efficient implementation of the equation.
solve(crossprod(X), crossprod(X, y))
#> wages
#> constant -7.1833382
#> gender -3.0748755
#> race -1.5653133
#> union 1.0959758
#> education 1.3703010
#> experience 0.1666065
jeksterslabRlinreg::.betahatnorm()
The jeksterslabRlinreg
package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using the normal equation.
result_inv <- jeksterslabRlinreg::.betahatnorm(
X = X,
y = y
)
result_inv
#> betahat
#> constant -7.1833382
#> gender -3.0748755
#> race -1.5653133
#> union 1.0959758
#> education 1.3703010
#> experience 0.1666065
Another way to solve for \(\boldsymbol{\hat{\beta}}\) is through QR decomposition.
The data matrix \(\mathbf{X}\) is decomposed into
\[\begin{equation} \mathbf{X} = \mathbf{Q} \mathbf{R}. \end{equation}\]
QR decomposition is implemeded in R
using the qr()
function.
Estimates are found by solving \(\boldsymbol{\hat{\beta}}\) in
\[\begin{equation} \mathbf{R} \boldsymbol{\hat{\beta}} = \mathbf{Q}^{T} \mathbf{y}. \end{equation}\]
backsolve(R, crossprod(Q, y))
#> [,1]
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,] 1.0959758
#> [5,] 1.3703010
#> [6,] 0.1666065
Another solution is to multiply both sides by the inverse of \(\mathbf{R}\) to isolate \(\boldsymbol{\hat{\beta}}\).
\[\begin{equation} \boldsymbol{\hat{\beta}} = \mathbf{R}^{-1} \mathbf{Q}^{T} \mathbf{y}. \end{equation}\]
solve(R) %*% crossprod(Q, y)
#> wages
#> constant -7.1833382
#> gender -3.0748755
#> race -1.5653133
#> union 1.0959758
#> education 1.3703010
#> experience 0.1666065
The first solution is more efficient.
jeksterslabRlinreg::.betahatqr()
The jeksterslabRlinreg
package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using QR decomposition.
result_qr <- jeksterslabRlinreg::.betahatqr(
X = X,
y = y
)
result_qr
#> betahat
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,] 1.0959758
#> [5,] 1.3703010
#> [6,] 0.1666065
Another way to solve for \(\boldsymbol{\hat{\beta}}\) is through singular value decomposition.
The data matrix \(\mathbf{X}\) is decomposed into
\[\begin{equation} \mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^{T}. \end{equation}\]
\(\boldsymbol{\hat{\beta}}\) is given by
\[\begin{equation} \boldsymbol{\hat{\beta}} = \mathbf{V} \mathbf{\Sigma}^{+} \mathbf{U}^{T} \mathbf{y} \end{equation}\]
where the superscript \(+\) indicates the pseudoinverse.
Singular value decomposition is implemeded in R
using the svd()
function.
Xsvd <- svd(X)
V <- Xsvd$v
U <- Xsvd$u
Sigma <- Xsvd$d
\(\boldsymbol{\hat{\beta}}\) is given by
V %*% ((1 / Sigma) * crossprod(U, y))
#> wages
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,] 1.0959758
#> [5,] 1.3703010
#> [6,] 0.1666065
jeksterslabRlinreg::.betahatsvd()
The jeksterslabRlinreg
package has a function that solves for \(\boldsymbol{\hat{\beta}}\) using singular decomposition.
result_svd <- jeksterslabRlinreg::.betahatsvd(
X = X,
y = y
)
result_svd
#> betahat
#> [1,] -7.1833382
#> [2,] -3.0748755
#> [3,] -1.5653133
#> [4,] 1.0959758
#> [5,] 1.3703010
#> [6,] 0.1666065
jeksterslabRlinreg::betahat()
jeksterslabRlinreg::betahat()
calculates coefficients using the normal equation. When that fails, QR decomposition is used when qr = TRUE
or singular value decomposition when qr = FALSE
.
result_betahat <- jeksterslabRlinreg::betahat(
X = X,
y = y
)
lm()
FunctionThe lm()
function is the default option for estimating parameters of linear regression models in R
. It calculates estimates of regression coefficients and other statistics.
lmobj <- lm(
y ~ gender + race + union + education + experience,
data = jeksterslabRdatarepo::wages
)
result_lm <- coef(lmobj)
result_lm
#> (Intercept) gender race union education experience
#> -7.1833382 -3.0748755 -1.5653133 1.0959758 1.3703010 0.1666065
summary(lmobj)
#>
#> Call:
#> lm(formula = y ~ gender + race + union + education + experience,
#> data = jeksterslabRdatarepo::wages)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -20.781 -3.760 -1.044 2.418 50.414
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.18334 1.01579 -7.072 2.51e-12 ***
#> gender -3.07488 0.36462 -8.433 < 2e-16 ***
#> race -1.56531 0.50919 -3.074 0.00216 **
#> union 1.09598 0.50608 2.166 0.03052 *
#> education 1.37030 0.06590 20.792 < 2e-16 ***
#> experience 0.16661 0.01605 10.382 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.508 on 1283 degrees of freedom
#> Multiple R-squared: 0.3233, Adjusted R-squared: 0.3207
#> F-statistic: 122.6 on 5 and 1283 DF, p-value: < 2.2e-16
lm()
?When you only need the regression coefficients you are better off just calculating said coefficients using matrix operations because it is faster. lm()
calculates and provides a rich output that is very useful. However, there are situations when the coefficients are enough. For example, bootstrapping and simulations. Having an efficient option to calculate regression coefficients is handy.
microbenchmark::microbenchmark(
lm = lm(y ~ gender + race + union + education + experience, data = jeksterslabRdatarepo::wages),
betahatnorm = jeksterslabRlinreg::.betahatnorm(X = X, y = y),
betahatqr = jeksterslabRlinreg::.betahatqr(X = X, y = y),
betahatsvd = jeksterslabRlinreg::.betahatsvd(X = X, y = y),
betahat = jeksterslabRlinreg::betahat(X = X, y = y)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> lm 2014.932 2379.1010 3330.6218 2761.0070 3279.7440 14319.002 100
#> betahatnorm 81.619 94.5695 159.9868 130.8410 182.1045 678.173 100
#> betahatqr 274.445 387.6460 796.7431 517.0380 661.5865 9283.684 100
#> betahatsvd 281.918 361.9860 668.9583 431.5150 540.7130 10167.282 100
#> betahat 97.689 128.4765 223.6165 173.8065 231.1500 2210.643 100
context("Test linreg-estimation-ols")
test_that("results_betahat", {
result_lm <- as.vector(unname(result_lm))
result_inv <- as.vector(unname(result_inv))
result_qr <- as.vector(unname(result_qr))
result_svd <- as.vector(unname(result_svd))
result_betahat <- as.vector(unname(result_betahat))
for (i in 1:length(result_lm)) {
expect_equivalent(
result_lm[i],
result_inv[i]
)
expect_equivalent(
result_lm[i],
result_qr[i]
)
expect_equivalent(
result_lm[i],
result_svd[i]
)
expect_equivalent(
result_lm[i],
result_betahat[i]
)
}
})
#> Test passed 🌈
test_that("X is singular.", {
expect_error(
jeksterslabRlinreg::.betahatnorm(
X = jeksterslabRdatarepo::svd.linreg$Xb,
y = jeksterslabRdatarepo::svd.linreg$yb,
)
)
})
#> Test passed 😸
test_that("Using QR decomposition.", {
expect_message(
jeksterslabRlinreg::betahat(
X = jeksterslabRdatarepo::svd.linreg$Xb,
y = jeksterslabRdatarepo::svd.linreg$yb,
),
"Using QR decomposition."
)
})
#> Test passed 🎉
test_that("Using singular value decomposition.", {
expect_message(
jeksterslabRlinreg::betahat(
X = jeksterslabRdatarepo::svd.linreg$Xb,
y = jeksterslabRdatarepo::svd.linreg$yb,
qr = FALSE
),
"Using singular value decomposition."
)
})
#> Test passed 🥇