Manual for the package: linearreg

library(ProxReg)

What you’ll find

Introduction

This is the introduction to the package “linearreg”, which is used for linear regression model’s construction such as OLS (Ordinary Least Squares regression), Ridge regression and Lasso regression implemented through ISTA algorithm. Moreover, the package also contains functions that realize the K-fold cross validation, a helpful procedure in choosing lambda values.

Functions

ols

The Ordinary Least Square (OLS) regression is one of the most common and simple techniques to estimate parametersof a linear regression model. It basically find the line that fits better to observations by minimizing the sum of square errors between the predicted and observed values.

However, it presents some disadvantages, since it is sensitive to Outliers, it may not be appropriate its applications to datasets with bounds of extreme values. Furthermore, overfitting and multicollinearity are as well two factors that undermine the model’s credibility. For this reason, OLS regression is no recommended for high-dimensional data.

The function generate by default a table that provides information about a ordinary least square regression. Here is a example below :

Given a dataset named df with its dependent variable: hours, and independent variables: scores and entertain_hours, we get the following table :

df = data.frame("hours"=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11, 12, 12, 14),
"score"=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91, 93, 89),
"entertain_hours"=c(6,5,3,2,2,2,1,1,0.5,1,0.3,0.3,0.2,0.2,0.1))
ols(df,"score",c("hours","entertain_hours"),alpha = 0.025,verbose = FALSE)
#>                   score
#> intercept       74.1312
#> hours            1.2277
#> entertain_hours -1.8668

The superior part is a general information about the regression model, where :

Under the lines, we have information about the coefficients, where:

k_fold_cross

In some cases, we might front situations when we have to compare linear models and choose the best-fitting one. The most popular method for estimating prediction error is cross validation.

The K-fold cross validations is a method consisted in divided the dataset into k independent training sets and one test set.

The following function \(\color{blue}{ k_-fold_-cross}\) create k different versions of the dataset , each has k-1 training subsets and one test set.

df = data.frame("hours"=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11, 12, 12, 14),
"score"=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91, 93, 89),
"entertain_hours"=c(6,5,3,2,2,2,1,1,0.5,1,0.3,0.3,0.2,0.2,0.1))

k_fold_cross(df,k=3)
#> # A tibble: 3 × 2
#>   train            test            
#>   <list>           <list>          
#> 1 <named list [2]> <named list [2]>
#> 2 <named list [2]> <named list [2]>
#> 3 <named list [2]> <named list [2]>

ols_KCV

This function realize k-fold cross validation for ordinary least square regression. In each k circle, a linear regression is obtained from a learning set , then , it is evaluated with the test set to calculate the prediction error, finally, we estimate the mean squared error (MSE) through the k prediction errors.

For instance, suppose k equals to 5, and the dataset is the same df that we have been using throughout the document. Applying the function we get the mean square error (MSE) and its squared value (RMSE).

ols_KCV(df,k=5,"hours",c("score","entertain_hours"))
#> 
#>  Model:           OLS              R-squared:   0.81 
#>  Df:              9               R.S-ajusted:   0.77 
#>  F-statistic:    19.64             prob(F-statistic): 0.0005213611 
#>  Resudual standard error: 1.71 
#>  ===================================================================== 
#>                 coefficients standard error t value  p(>|t|)  [ 0.025 0.975 ]
#> intercept           -11.9176        11.2453 -1.0598 0.316848 -37.3562 13.5210
#> score                 0.2595         0.1284  2.0210 0.074002  -0.0310  0.5500
#> entertain_hours      -0.9889         0.6831 -1.4477 0.181622  -2.5342  0.5564
#> 
#>  Model:           OLS              R-squared:   0.86 
#>  Df:              9               R.S-ajusted:   0.83 
#>  F-statistic:    28.56             prob(F-statistic): 0.0001266466 
#>  Resudual standard error: 1.72 
#>  ===================================================================== 
#>                 coefficients standard error t value  p(>|t|)  [ 0.025 0.975 ]
#> intercept           -15.2387        11.4248 -1.3338 0.215037 -41.0834 10.6060
#> score                 0.2971         0.1299  2.2871 0.048002   0.0032  0.5910
#> entertain_hours      -0.6107         0.6145 -0.9938 0.346286  -2.0008  0.7794
#> 
#>  Model:           OLS              R-squared:   0.87 
#>  Df:              9               R.S-ajusted:   0.84 
#>  F-statistic:    30.15             prob(F-statistic): 0.000102516 
#>  Resudual standard error: 1.55 
#>  ===================================================================== 
#>                 coefficients standard error t value  p(>|t|)  [ 0.025 0.975 ]
#> intercept            -9.1740        11.7285 -0.7822 0.454183 -35.7057 17.3577
#> score                 0.2243         0.1328  1.6890 0.125486  -0.0761  0.5247
#> entertain_hours      -0.8717         0.6194 -1.4073 0.192923  -2.2729  0.5295
#> 
#>  Model:           OLS              R-squared:   0.87 
#>  Df:              9               R.S-ajusted:   0.84 
#>  F-statistic:    29.28             prob(F-statistic): 0.0001149444 
#>  Resudual standard error: 1.65 
#>  ===================================================================== 
#>                 coefficients standard error t value  p(>|t|)  [ 0.025 0.975 ]
#> intercept           -19.0557        11.9193 -1.5987 0.144352 -46.0190  7.9076
#> score                 0.3384         0.1349  2.5085 0.033393   0.0332  0.6436
#> entertain_hours      -0.3790         0.6408 -0.5914 0.568807  -1.8286  1.0706
#> 
#>  Model:           OLS              R-squared:   0.83 
#>  Df:              9               R.S-ajusted:   0.79 
#>  F-statistic:    22.26             prob(F-statistic): 0.0003279207 
#>  Resudual standard error: 1.85 
#>  ===================================================================== 
#>                 coefficients standard error t value  p(>|t|)  [ 0.025 0.975 ]
#> intercept           -11.3934        13.6936 -0.8320 0.426933 -42.3705 19.5837
#> score                 0.2535         0.1581  1.6034 0.143310  -0.1041  0.6111
#> entertain_hours      -0.8356         0.6863 -1.2175 0.254372  -2.3881  0.7169
#> [1] 1.138722

L1 and L2 regularization

When a dataset is immense, it can lead to multicollinearity or ,the relation between the dependent variables we include in our linear regression model, can augment the model’s complexity and generate overfitting. L1 and L2 regularization are methods to create less complex model and prevent overfitting by adding penalty term to the loss function of the model. Both techniques add bias to the estimators but tend to have smaller mean square error and more stable than OLS estimators.

L1 regularization is known as Lasso regression, and L2 as Ridge regression. The main difference between them is the penalty term.

In Ridge regression, it adds squared magnitude of coefficient as the penalty term to the loss function, where \(\lambda\) is a real number choosed by ourselves.

meanwhiles in Lasso regression, the penalty term is instead absolute magnitude of coefficients.

AS a consequence, L2 regularizationt tend to shrinkage some coefficients toward zero ans remains all features in model althought with smaller weight one respect to another. However, L2 regularization drive some coefficients to exactly zero, these zero coefficients are considered irrelevant and do not contribute to the model’s prediction.

ridge

Suppose we have a dataframe named test , we generate the coefficients of ridge regression with the function \(\color{blue}{ridge}\) given some values of lambda.

library("glmnet")
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
data("QuickStartExample")
test<-as.data.frame(cbind(QuickStartExample$y,QuickStartExample$x))
ridge(data=test,y="V1",x=colnames(test)[2:21],lambda=c(0.01,0.1))
#>                   0.01          0.1
#> intercept  0.109116513  0.109551940
#> V2         1.380959646  1.379949878
#> V3         0.025036976  0.025228001
#> V4         0.767404571  0.766632886
#> V5         0.066720714  0.066302816
#> V6        -0.905882112 -0.905021421
#> V7         0.618372569  0.618237227
#> V8         0.124494060  0.124512059
#> V9         0.401019100  0.400720930
#> V10       -0.036571552 -0.036710140
#> V11        0.136478104  0.136012870
#> V12        0.251566242  0.251290431
#> V13       -0.069907751 -0.069857280
#> V14       -0.049381966 -0.049257632
#> V15       -1.163915499 -1.162992110
#> V16       -0.147286747 -0.146860043
#> V17       -0.051541308 -0.051267927
#> V18       -0.055874048 -0.055604583
#> V19        0.057081673  0.057089484
#> V20       -0.006411382 -0.006304197
#> V21       -1.148370996 -1.146910033