knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(CLLPackage)

The purpose of this package is to provide functionalities to estimate the regression coefficients of the linear model $$Y = X\beta+\varepsilon$$ where $X$ is the design matrix, $Y$ a response variable, $\beta$ a coefficients vector and $\varepsilon$ a normal random noise. The estimation is performed by minimizing the loss function $$L(\beta)=||X\beta-Y||_2^2$$

Two different iterative minimization methods are available in the package, the $\textit{gradient descend}$ and the $\textit{steepest descend}$ methods. Moreover, a $k$-fold cross-validation procedure is provided, using the MSE (Mean Squared Error) as performance index. Sequential and parallel versions are implemented for this task.

Gradient descend method

The gradient descend algorithm proceeds as follows:

  1. The initial point $\hat{\beta}(0)$ is chosen.

  2. For $t=0,1,\ldots$, the new iterate $\hat{\beta}(t+1)$ is obtained by the rule: $$\hat{\beta}(t+1) = \hat{\beta}(t) -\gamma \nabla L(\hat{\beta}(t)) $$ where $\gamma>0$ is the $\textit{stepsize}$ (chosen by the user).

  3. Stop the procedure when the stopping conditions are satisfied. In the package, two criteria are implemented: the maximum number of iterations $\textit{maxit}$ and a condition on non-significant improvements of the estimate, i.e. $\max_i|\hat{\beta}_i(t+1) - \hat{\beta}_i(t)|< tol$, where $\textit{tol}$ is a pre-specified tolerance.

The following is an example of the use of the function $\texttt{ linear_gd_optim()}$. Note that the performance of the algorithm strongly depends on the choice of the parameters $\textit{tol, maxit}, \gamma$. In this example, the parameters are chosen properly. Later, some considerations on this issue are discussed. Comparing the estimates obtained with $\textit{gradient descend}$ method and the ones computed by $\texttt{lm()}$ R function, it is clear that estimates are quite similar.

set.seed(8675309)
n<-10000
x1<-rnorm(n)
x2<-rnorm(n)
y<-1 + 0.5*x1 + 0.2*x2 + rnorm(n)

beta0 <- rnorm(3,2,1)
X<-cbind(rep(1,n),x1,x2)

# beta estimated with gradient descend method
(beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 1000, stepsize = 1e-5, tol=1e-5))

# beta estimated with OLS
(beta_min = lm(y ~ x1 + x2)$coeff)

Steepest descend method

The steepest descend algorithm proceeds as follows:

  1. The initial point $\hat{\beta}(0)$ is chosen.

  2. For $t=0,1,\ldots$, the new iterate $\hat{\beta}(t+1)$ is obtained by the rule: $$\hat{\beta}(t+1) = \hat{\beta}(t) -\text{step}(t) \cdot \nabla L(\hat{\beta}(t)) $$ where, denoting with $H(\hat{\beta}(t)) = 4X^T X$ the Hessian of $L$, $$\text{step}(t) = \frac{||\nabla L(\hat{\beta}(t))||^2}{\nabla L(\hat{\beta}(t))^T\, H(\hat{\beta}(t)) \,\nabla L(\hat{\beta}(t))} $$

  3. Stop the procedure when the stopping conditions are satisfied. Similarly to the gradient descend algorithm, two criteria are implemented: the maximum number of iterations $\textit{maxit}$ and a condition on non-significant improvements of the estimate, i.e. $\max_i|\hat{\beta}_i(t+1) - \hat{\beta}_i(t)|< tol$, where $\textit{tol}$ is a pre-specified tolerance.

Differently from the gradient descend method, where the $\textit{stepsize}$ was chosen by the user and fixed during the whole procedure, here the stepsize $\text{step}(t)$ is automatically computed and varies along the iterations.

The following is an example of the use of the function $\texttt{ linear_sd_optim()}$. Comparing the estimates obtained with $\textit{steepest descend}$ method and the ones computed by $\texttt{lm()}$ R function, it is clear that estimates are quite similar.

# beta estimated with gradient descend method
(beta_estimated_sd <- linear_sd_optim(beta0, X, y, maxit = 1000, tol=1e-4))

# beta estimated with OLS
(beta_min <- lm(y ~ x1 + x2)$coeff)

Comparison between gradient and steepest descend methods

As already pointed out, the difference between the two methods consists of the $\textit{stepsize}$ choice, which modulates step length along the direction identified by the gradient: in the $\textit{gradient descend}$ algorithm it is chosen by the user and remains fixed for the whole iterative process, while in the $\textit{steepest descend}$ methods it is automatically computed by the algorithm based on the current estimate of $\beta$, thus it should better exploit the geometry of the problem.

Two main issues can be encountered performing the $\textit{gradient descend}$ method:

  1. if the pre-specified $\textit{stepsize}$, $\gamma$, is too large for the problem, the algorithm might never converge to the minimum, since it keeps bouncing around the optimal solution and possibly diverges from it. In the following example, the error does not decrease along the iterations, even reaching $Inf$ value.
beta0 <- c(0,0,0)
# beta estimated with gradient descend method
beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 10000, stepsize = 1e-2, tol=1e-2)
  1. for good performance of the algorithm, it is important to choose values for $stepsize$ and $tolerance$ with the same order of magnitude. Otherwise, if $stepsize$ is too small with respect to $tolerance$, the algorithm might stop too early having no chance to reach the solution to the minimization problem. In the following example, comparing estimates provided by $\texttt{linear_gd_optim()}$ and the ones provided by $\texttt{lm()}$, it is clear that the $\textit{gradient descend}$ method has not reached the minimum point.
beta0 <- c(0,0,0)
X<-cbind(rep(1,n),x1,x2)

# beta estimated with gradient descend method
(beta_estimated_gd <- linear_gd_optim(beta0, X, y, maxit = 10000, stepsize = 1e-6, tol=1e-2))

# beta estimated with OLS
(beta_min = lm(y ~ x1 + x2)$coeff)

The $\textit{steepest descend}$ method solves the two issues by computing the $stepsize$ for each iteration, taking into account the previous estimate of $\beta$.

Cross-validation

The package includes a sequential ($\texttt{kfold_cv_seq()}$) and a parallel ($\texttt{kfold_cv_parallel()}$) implementation of the k-fold cross-validation procedure, which returns an estimate of the Mean Squared Error (MSE) of predictions based on the linear model, where $\beta$ is estimated via steepest or gradient descend method, according to the choice specified by the user (default option is steepest descend).

In the following example, the MSE is computed via $\texttt{kfold_cv_seq()}$, $\texttt{kfold_cv_parallel()}$ and $\texttt{kfold_cv_lm_seq()}$, which is a function that returns MSE computed performing k-fold cross-validation using R function $\texttt{lm()}$. It can be observed that they are similar. Moreover, for a sufficiently large value of $k$, which is the number of folds, the execution time of $\texttt{kfold_cv_seq()}$ is longer than $\texttt{kfold_cv_parallel()}$, as expected. In fact, the parallel version starts to gain in terms of execution time when fixed costs related to cluster creation and splitting of the jobs to the workers are compensated.

library(tictoc)
# Sequential CV MSE
tic()
(MSE_seq <- kfold_cv_seq(X,y,k=1000))
toc()

# Parallel CV MSE
tic()
(MSE_parallel <- kfold_cv_parallel(X,y,k=1000))
toc()

# lm MSE
(MSE <- kfold_cv_lm_seq(X = X, y=y, k=1000))

To implement the parallel version, library $\texttt{doSNOW}$ has been used and map-reduce paradigm has been adopted: $map$ function includes the creation of the training and the test sets, the estimation of the $\beta$ on the training set, the predictions of the responses on the test set and, finally, the computation of the MSE. $Reduce$ function computes the average of the $k$ values of the MSE obtained during $map$ process.



LGhilotti/CLLPackage documentation built on April 23, 2022, 3:48 a.m.