knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(CLLPackage)
The purpose of this package is to provide functionalities to estimate the regression coefficients of the linear model $$Y = X\beta+\varepsilon$$ where $X$ is the design matrix, $Y$ a response variable, $\beta$ a coefficients vector and $\varepsilon$ a normal random noise. The estimation is performed by minimizing the loss function $$L(\beta)=||X\beta-Y||_2^2$$
Two different iterative minimization methods are available in the package, the $\textit{gradient descend}$ and the $\textit{steepest descend}$ methods. Moreover, a $k$-fold cross-validation procedure is provided, using the MSE (Mean Squared Error) as performance index. Sequential and parallel versions are implemented for this task.
The gradient descend algorithm proceeds as follows:
The initial point $\hat{\beta}(0)$ is chosen.
For $t=0,1,\ldots$, the new iterate $\hat{\beta}(t+1)$ is obtained by the rule: $$\hat{\beta}(t+1) = \hat{\beta}(t) -\gamma \nabla L(\hat{\beta}(t)) $$ where $\gamma>0$ is the $\textit{stepsize}$ (chosen by the user).
Stop the procedure when the stopping conditions are satisfied. In the package, two criteria are implemented: the maximum number of iterations $\textit{maxit}$ and a condition on non-significant improvements of the estimate, i.e. $\max_i|\hat{\beta}_i(t+1) - \hat{\beta}_i(t)|< tol$, where $\textit{tol}$ is a pre-specified tolerance.
The following is an example of the use of the function $\texttt{ linear_gd_optim()}$. Note that the performance of the algorithm strongly depends on the choice of the parameters $\textit{tol, maxit}, \gamma$. In this example, the parameters are chosen properly. Later, some considerations on this issue are discussed. Comparing the estimates obtained with $\textit{gradient descend}$ method and the ones computed by $\texttt{lm()}$ R function, it is clear that estimates are quite similar.
set.seed(8675309) n<-10000 x1<-rnorm(n) x2<-rnorm(n) y<-1 + 0.5*x1 + 0.2*x2 + rnorm(n) beta0 <- rnorm(3,2,1) X<-cbind(rep(1,n),x1,x2) # beta estimated with gradient descend method (beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 1000, stepsize = 1e-5, tol=1e-5)) # beta estimated with OLS (beta_min = lm(y ~ x1 + x2)$coeff)
The steepest descend algorithm proceeds as follows:
The initial point $\hat{\beta}(0)$ is chosen.
For $t=0,1,\ldots$, the new iterate $\hat{\beta}(t+1)$ is obtained by the rule: $$\hat{\beta}(t+1) = \hat{\beta}(t) -\text{step}(t) \cdot \nabla L(\hat{\beta}(t)) $$ where, denoting with $H(\hat{\beta}(t)) = 4X^T X$ the Hessian of $L$, $$\text{step}(t) = \frac{||\nabla L(\hat{\beta}(t))||^2}{\nabla L(\hat{\beta}(t))^T\, H(\hat{\beta}(t)) \,\nabla L(\hat{\beta}(t))} $$
Stop the procedure when the stopping conditions are satisfied. Similarly to the gradient descend algorithm, two criteria are implemented: the maximum number of iterations $\textit{maxit}$ and a condition on non-significant improvements of the estimate, i.e. $\max_i|\hat{\beta}_i(t+1) - \hat{\beta}_i(t)|< tol$, where $\textit{tol}$ is a pre-specified tolerance.
Differently from the gradient descend method, where the $\textit{stepsize}$ was chosen by the user and fixed during the whole procedure, here the stepsize $\text{step}(t)$ is automatically computed and varies along the iterations.
The following is an example of the use of the function $\texttt{ linear_sd_optim()}$. Comparing the estimates obtained with $\textit{steepest descend}$ method and the ones computed by $\texttt{lm()}$ R function, it is clear that estimates are quite similar.
# beta estimated with gradient descend method (beta_estimated_sd <- linear_sd_optim(beta0, X, y, maxit = 1000, tol=1e-4)) # beta estimated with OLS (beta_min <- lm(y ~ x1 + x2)$coeff)
As already pointed out, the difference between the two methods consists of the $\textit{stepsize}$ choice, which modulates step length along the direction identified by the gradient: in the $\textit{gradient descend}$ algorithm it is chosen by the user and remains fixed for the whole iterative process, while in the $\textit{steepest descend}$ methods it is automatically computed by the algorithm based on the current estimate of $\beta$, thus it should better exploit the geometry of the problem.
Two main issues can be encountered performing the $\textit{gradient descend}$ method:
beta0 <- c(0,0,0) # beta estimated with gradient descend method beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 10000, stepsize = 1e-2, tol=1e-2)
beta0 <- c(0,0,0) X<-cbind(rep(1,n),x1,x2) # beta estimated with gradient descend method (beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 10000, stepsize = 1e-6, tol=1e-2)) # beta estimated with OLS (beta_min = lm(y ~ x1 + x2)$coeff)
The $\textit{steepest descend}$ method solves the two issues by computing the $stepsize$ for each iteration, taking into account the previous estimate of $\beta$.
The package includes a sequential ($\texttt{kfold_cv_seq()}$) and a parallel ($\texttt{kfold_cv_parallel()}$) implementation of the k-fold cross-validation procedure, which returns an estimate of the Mean Squared Error (MSE) of predictions based on the linear model, where $\beta$ is estimated via steepest or gradient descend method, according to the choice specified by the user (default option is steepest descend).
In the following example, the MSE is computed via $\texttt{kfold_cv_seq()}$, $\texttt{kfold_cv_parallel()}$ and $\texttt{kfold_cv_lm_seq()}$, which is a function that returns MSE computed performing k-fold cross-validation using R function $\texttt{lm()}$. It can be observed that they are similar. Moreover, for a sufficiently large value of $k$, which is the number of folds, the execution time of $\texttt{kfold_cv_seq()}$ is longer than $\texttt{kfold_cv_parallel()}$, as expected. In fact, the parallel version starts to gain in terms of execution time when fixed costs related to cluster creation and splitting of the jobs to the workers are compensated.
library(tictoc) # Sequential CV MSE tic() (MSE_seq <- kfold_cv_seq(X,y,k=1000)) toc() # Parallel CV MSE tic() (MSE_parallel <- kfold_cv_parallel(X,y,k=1000)) toc() # lm MSE (MSE <- kfold_cv_lm_seq(X = X, y=y, k=1000))
To implement the parallel version, library $\texttt{doSNOW}$ has been used and map-reduce paradigm has been adopted: $map$ function includes the creation of the training and the test sets, the estimation of the $\beta$ on the training set, the predictions of the responses on the test set and, finally, the computation of the MSE. $Reduce$ function computes the average of the $k$ values of the MSE obtained during $map$ process.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.