This is the introduction to the package “linearreg”, which is used for linear regression model’s construction such as OLS (Ordinary Least Squares regression), Ridge regression and Lasso regression implemented through ISTA algorithm. Moreover, the package also contains functions that realize the K-fold cross validation, a helpful procedure in choosing lambda values.
The Ordinary Least Square (OLS) regression is one of the most common and simple techniques to estimate parametersof a linear regression model. It basically find the line that fits better to observations by minimizing the sum of square errors between the predicted and observed values.
However, it presents some disadvantages, since it is sensitive to Outliers, it may not be appropriate its applications to datasets with bounds of extreme values. Furthermore, overfitting and multicollinearity are as well two factors that undermine the model’s credibility. For this reason, OLS regression is no recommended for high-dimensional data.
The function generate by default a table that provides information about a ordinary least square regression. Here is a example below :
Given a dataset named df with its dependent variable: hours, and independent variables: scores and entertain_hours, we get the following table :
df = data.frame("hours"=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11, 12, 12, 14),
"score"=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91, 93, 89),
"entertain_hours"=c(6,5,3,2,2,2,1,1,0.5,1,0.3,0.3,0.2,0.2,0.1))
ols(df,"score",c("hours","entertain_hours"),alpha = 0.025,verbose = FALSE)
#> score
#> intercept 74.1312
#> hours 1.2277
#> entertain_hours -1.8668
The superior part is a general information about the regression model, where :
model : current linear regression model
Df : degree of freedom
F-statistic : Is a statistical measure and indicates whether at least one independent variable explains the variation in the dependent variable. When it is true, the regression model has significance. The more large is F-statistic, the less is the probability of Type-I error.
R-squared : coefficient of determination,is the
ratio of Sum of squares due to regression (SSreg)to total sum of
squares(SYY).
\(R^2 = \frac {SSreg}{SYY}\)
This number is a summary of the strength of the relation between the x
and the y in the data. For instance, \(R^2=0.84\) explains that about 84% of the
variability in the observed values is explained by the model.
R.S-ajusted:It is a modified version of \(R^2\), it takes into account the number of independent variables in the model, since the value of \(R^2\) generally increases when independent variables are added event if they contribute little to the explanation of dependent variable. This can lead to a misunderstanding and people think the model fits better than it really does. The objective of \(R^2\) adjusted is to penalize the addition of the less relevant independent variables in the model.
prob(F-statsictic) : p-value of F test, when it is lower than a established value (generally 0.05),we shall reject the null hypothesis (all the regression coefficients are equal to 0), in other words, we may consider there’s a relations between X and Y .
Residual standard error: Is a measure that indicates the dispersion between the observed values and predicted values.
Under the lines, we have information about the coefficients, where:
coefficients: Is the estimated value of the coefficients
standard error: Standard error of each coefficient
t value:Is a statistical measure calculated from t-test with null hypothesis: \(\beta_i = 0\), which means that the i coefficient does not contribute to the explanation of Y values.
p(>|t|): P-value obtained from t-test of each coefficient, when it is lower than a pre-fixed value (0.025 in general case) we should reject the null hypothesis and accept the alternative one.
confidence interval: It presents an confidence interval of each coefficient given a value of alpha, which is by default 0,025.
In some cases, we might front situations when we have to compare linear models and choose the best-fitting one. The most popular method for estimating prediction error is cross validation.
The K-fold cross validations is a method consisted in divided the dataset into k independent training sets and one test set.
The following function \(\color{blue}{ k_-fold_-cross}\) create k different versions of the dataset , each has k-1 training subsets and one test set.
df = data.frame("hours"=c(1, 2, 4, 5, 5, 6, 6, 7, 8, 10, 11, 11, 12, 12, 14),
"score"=c(64, 66, 76, 73, 74, 81, 83, 82, 80, 88, 84, 82, 91, 93, 89),
"entertain_hours"=c(6,5,3,2,2,2,1,1,0.5,1,0.3,0.3,0.2,0.2,0.1))
k_fold_cross(df,k=3)
#> # A tibble: 3 × 2
#> train test
#> <list> <list>
#> 1 <named list [2]> <named list [2]>
#> 2 <named list [2]> <named list [2]>
#> 3 <named list [2]> <named list [2]>
This function realize k-fold cross validation for ordinary least square regression. In each k circle, a linear regression is obtained from a learning set , then , it is evaluated with the test set to calculate the prediction error, finally, we estimate the mean squared error (MSE) through the k prediction errors.
For instance, suppose k equals to 5, and the dataset is the same df that we have been using throughout the document. Applying the function we get the mean square error (MSE) and its squared value (RMSE).
ols_KCV(df,k=5,"hours",c("score","entertain_hours"))
#>
#> Model: OLS R-squared: 0.81
#> Df: 9 R.S-ajusted: 0.77
#> F-statistic: 19.64 prob(F-statistic): 0.0005213611
#> Resudual standard error: 1.71
#> =====================================================================
#> coefficients standard error t value p(>|t|) [ 0.025 0.975 ]
#> intercept -11.9176 11.2453 -1.0598 0.316848 -37.3562 13.5210
#> score 0.2595 0.1284 2.0210 0.074002 -0.0310 0.5500
#> entertain_hours -0.9889 0.6831 -1.4477 0.181622 -2.5342 0.5564
#>
#> Model: OLS R-squared: 0.86
#> Df: 9 R.S-ajusted: 0.83
#> F-statistic: 28.56 prob(F-statistic): 0.0001266466
#> Resudual standard error: 1.72
#> =====================================================================
#> coefficients standard error t value p(>|t|) [ 0.025 0.975 ]
#> intercept -15.2387 11.4248 -1.3338 0.215037 -41.0834 10.6060
#> score 0.2971 0.1299 2.2871 0.048002 0.0032 0.5910
#> entertain_hours -0.6107 0.6145 -0.9938 0.346286 -2.0008 0.7794
#>
#> Model: OLS R-squared: 0.87
#> Df: 9 R.S-ajusted: 0.84
#> F-statistic: 30.15 prob(F-statistic): 0.000102516
#> Resudual standard error: 1.55
#> =====================================================================
#> coefficients standard error t value p(>|t|) [ 0.025 0.975 ]
#> intercept -9.1740 11.7285 -0.7822 0.454183 -35.7057 17.3577
#> score 0.2243 0.1328 1.6890 0.125486 -0.0761 0.5247
#> entertain_hours -0.8717 0.6194 -1.4073 0.192923 -2.2729 0.5295
#>
#> Model: OLS R-squared: 0.87
#> Df: 9 R.S-ajusted: 0.84
#> F-statistic: 29.28 prob(F-statistic): 0.0001149444
#> Resudual standard error: 1.65
#> =====================================================================
#> coefficients standard error t value p(>|t|) [ 0.025 0.975 ]
#> intercept -19.0557 11.9193 -1.5987 0.144352 -46.0190 7.9076
#> score 0.3384 0.1349 2.5085 0.033393 0.0332 0.6436
#> entertain_hours -0.3790 0.6408 -0.5914 0.568807 -1.8286 1.0706
#>
#> Model: OLS R-squared: 0.83
#> Df: 9 R.S-ajusted: 0.79
#> F-statistic: 22.26 prob(F-statistic): 0.0003279207
#> Resudual standard error: 1.85
#> =====================================================================
#> coefficients standard error t value p(>|t|) [ 0.025 0.975 ]
#> intercept -11.3934 13.6936 -0.8320 0.426933 -42.3705 19.5837
#> score 0.2535 0.1581 1.6034 0.143310 -0.1041 0.6111
#> entertain_hours -0.8356 0.6863 -1.2175 0.254372 -2.3881 0.7169
#> [1] 1.138722
When a dataset is immense, it can lead to multicollinearity or ,the relation between the dependent variables we include in our linear regression model, can augment the model’s complexity and generate overfitting. L1 and L2 regularization are methods to create less complex model and prevent overfitting by adding penalty term to the loss function of the model. Both techniques add bias to the estimators but tend to have smaller mean square error and more stable than OLS estimators.
L1 regularization is known as Lasso regression, and L2 as Ridge regression. The main difference between them is the penalty term.
In Ridge regression, it adds squared magnitude of coefficient as the
penalty term to the loss function, where \(\lambda\) is a real number choosed by
ourselves.
meanwhiles in Lasso regression, the penalty term is instead absolute
magnitude of coefficients.
AS a consequence, L2 regularizationt tend to shrinkage some coefficients toward zero ans remains all features in model althought with smaller weight one respect to another. However, L2 regularization drive some coefficients to exactly zero, these zero coefficients are considered irrelevant and do not contribute to the model’s prediction.
Suppose we have a dataframe named test , we generate the coefficients of ridge regression with the function \(\color{blue}{ridge}\) given some values of lambda.
library("glmnet")
#> Loading required package: Matrix
#> Loaded glmnet 4.1-8
data("QuickStartExample")
test<-as.data.frame(cbind(QuickStartExample$y,QuickStartExample$x))
ridge(data=test,y="V1",x=colnames(test)[2:21],lambda=c(0.01,0.1))
#> 0.01 0.1
#> intercept 0.109116513 0.109551940
#> V2 1.380959646 1.379949878
#> V3 0.025036976 0.025228001
#> V4 0.767404571 0.766632886
#> V5 0.066720714 0.066302816
#> V6 -0.905882112 -0.905021421
#> V7 0.618372569 0.618237227
#> V8 0.124494060 0.124512059
#> V9 0.401019100 0.400720930
#> V10 -0.036571552 -0.036710140
#> V11 0.136478104 0.136012870
#> V12 0.251566242 0.251290431
#> V13 -0.069907751 -0.069857280
#> V14 -0.049381966 -0.049257632
#> V15 -1.163915499 -1.162992110
#> V16 -0.147286747 -0.146860043
#> V17 -0.051541308 -0.051267927
#> V18 -0.055874048 -0.055604583
#> V19 0.057081673 0.057089484
#> V20 -0.006411382 -0.006304197
#> V21 -1.148370996 -1.146910033