knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Quick Start

We will use the simulated data which are provided in the package and can be loaded via:

library(BDcocolasso)
data("simulated_data_missing")
data("simulated_data_additive")
data("simulated_data_missing_block")
data("simulated_data_additive_block")

These datasets correspond to corrupted datasets in the additive error setting and the missing data setting, and to partially corrupted datasets in additive error setting and missing data setting. In the missing data setting, those datasets contain NAs values, whereas in the additive error setting, they do not contain NAs values but the covariates are measured with additive error. We will note that it is essential to know the standard deviation corresponding to the additive error matrix in the additive error setting, in order to run the CoCoLasso algorithm.

CoCoLasso algorithm

We can first perform a classic CoCoLasso regression. To do that, it is important to note that the datasets must be converted to a matrix:

y_missing <- simulated_data_missing[,1]
Z_missing = simulated_data_missing[,2:dim(simulated_data_missing)[2]]
Z_missing = as.matrix(Z_missing)
n_missing <- dim(Z_missing)[1]
p_missing <- dim(Z_missing)[2]
y_missing = as.matrix(y_missing)

We can then fit the CoCoLasso model using our data. It is important to specify the type of noise (additive or missing) and the use of CoCoLasso (block=FALSE) or BD-CoCoLasso (block=TRUE). Here, noise is equal to missing because we have NAs values in the dataset. We can use two types of penalties : the classic lasso penalty (penalty="lasso") or the SCAD penalty (penalty="SCAD"). The latter should lead to less bias in the solution, when the signals are strong.

set.seed(1234567)
fit_missing.lasso = coco(Z=Z_missing,y=y_missing,n=n_missing,p=p_missing,step=100,K=4,mu=10,tau=NULL, etol = 1e-4,noise = "missing",block=FALSE, penalty="lasso")
set.seed(1234567)
fit_missing.SCAD = coco(Z=Z_missing,y=y_missing,n=n_missing,p=p_missing,step=100,K=4,mu=10,tau=NULL, etol = 1e-4,noise = "missing",block=FALSE, penalty="SCAD")

It is possible to print the fitted object, so as to display the evolution of mean-squared error as a function of the lambda values:

print(fit_missing.lasso)

It is also possible to display the coefficients obtained for all values of lambda, or for some specific values of lambda.

coef(fit_missing.lasso, s=fit_missing.lasso$lambda.opt)
coef(fit_missing.SCAD, s=fit_missing.SCAD$lambda.opt)

It is also possible to obtain a prediction for new covariates. Let's simulate values following the same simulation pattern used to obtain Z_missing, and look at the obtained prediction. Default configuration is to use coefficients for lambda.sd:

cov = cov_autoregressive(p_missing)
X = MASS::mvrnorm(1,mu=rep(0,p_missing),Sigma=cov)
beta = c(3,2,0,0,1.5,rep(0,p_missing - 5))
y = X %*% beta + rnorm(1,0,2)

y

Let's compare the prediction obtained with the lasso penalty and with the SCAD penalty :

y_predict.sd.lasso <- predict(fit_missing.lasso, newx = X, type="response")
y_predict.sd.lasso
y_predict.sd.SCAD <- predict(fit_missing.SCAD, newx = X, type="response")
y_predict.sd.SCAD

We can see that in this case using the SCAD penalty leads to less bias.

It is also possible to specify the lambda value with which we wish to perform the prediction:

y_predict.opt <- predict(fit_missing.lasso, newx = X, type="response", lambda.pred = fit_missing.lasso$lambda.opt)
y_predict.opt

It is then possible to visualize the solution path of the coefficients:

BDcocolasso::plotCoef(fit_missing.lasso)
BDcocolasso::plotCoef(fit_missing.SCAD)

It is also possible to visualize the mean squared error for all values of lambda:

BDcocolasso::plotError(fit_missing.lasso)
BDcocolasso::plotError(fit_missing.SCAD)

The red stands for the mean squared error (without a constant term, which is why we obtain negative values), and the black depicts the standard deviation for each error value. On each of the plots, the left dashed line represents the optimal lambda, while the right dashed line represents the lambda corresponding to the one-standard-error rule.

We can do the same with additive error setting (in the following we will show the use of the lasso penalty, although the SCAD penalty can also be adopted)

y_additive <- simulated_data_additive[,1]
Z_additive = simulated_data_additive[,2:dim(simulated_data_additive)[2]]
Z_additive = as.matrix(Z_additive)
n_additive <- dim(Z_additive)[1]
p_additive <- dim(Z_additive)[2]
y_additive = as.matrix(y_additive)

Let's fit a CoCoLasso model:

fit_additive.lasso = coco(Z=Z_additive,y=y_additive,n=n_additive,p=p_additive,center.Z = FALSE, scale.Z = FALSE, step=100,K=4,mu=10,tau=0.3,etol = 1e-4,noise = "additive", block=FALSE, penalty="lasso")
fit_additive.SCAD = coco(Z=Z_additive,y=y_additive,n=n_additive,p=p_additive,center.Z = FALSE, scale.Z = FALSE, step=100,K=4,mu=10,tau=0.3,etol = 1e-4,noise = "additive", block=FALSE, penalty="SCAD")

Here, we do not center Z because it might lead to estimation bias, because of the additive error setting. It is very important to know (from literature or by estimating) \code{tau} parameter corresponding to the standard deviation of the error matrix. Without it, the algorithm cannot run. In our example with simulated data, \code{tau} is equal to 0.3.

We can plot coefficients and the mean-squared-error:

BDcocolasso::plotCoef(fit_additive.lasso)
BDcocolasso::plotError(fit_additive.lasso)

BDcocolasso::plotCoef(fit_additive.SCAD)
BDcocolasso::plotError(fit_additive.SCAD)

We cannot compare results obtained for the additive error setting and the missing data setting, as we did not scale Z in the additive error setting. If scaling was performed on the corrupted matrix in the additive error setting, it is necessary to take into account the impact of the scaling on the error parameter tau that we need to use in the model.

Block-Descent CoCoLasso algorithm

We can fit the BDCoCoLasso model for both datasets using \code{block=TRUE}:

p1 <- 180
p2 <- 20
y_missing <- simulated_data_missing_block[,1]
Z_missing = simulated_data_missing_block[,2:dim(simulated_data_missing_block)[2]]
Z_missing = as.matrix(Z_missing)
n_missing <- dim(Z_missing)[1]
p_missing <- dim(Z_missing)[2]
y_missing = as.matrix(y_missing)

fit_missing = coco(Z=Z_missing,y=y_missing,n=n_missing,p=p_missing,p1=p1,p2=p2,step=100,K=4,mu=10,tau=NULL,noise="missing",block=TRUE, penalty="lasso")

y_additive <- simulated_data_additive_block[,1]
Z_additive = simulated_data_additive_block[,2:dim(simulated_data_additive_block)[2]]
Z_additive = as.matrix(Z_additive)
n_additive <- dim(Z_additive)[1]
p_additive <- dim(Z_additive)[2]
y_additive = as.matrix(y_additive)

fit_additive = coco(Z=Z_additive,y=y_additive,n=n_additive,p=p_additive,p1=p1,p2=p2,center.Z = FALSE, scale.Z = FALSE, step=100,K=4,mu=10,tau=0.3,noise="additive",block=TRUE, penalty="lasso")

Note that the algorithm requires that the first p1 columns of Z be the uncorrupted covariates, and that the last p2 columns be the corrupted covariates. It it also important to keep in mind that this algorithm has a relatively high computational cost. In the example given above, it is normal that the code should run for a couple of minutes before obtaining a result. When the number of features reaches one thousand, the memory demand would be high.

We can plot the coefficients and the error for the missing data scenario. Here we should expect to have 6 non-zero coefficients, with pairs of coefficients having similar values, since data was simulated with beta = c(3,2,0,0,1.5,0,...,0,1.5,0,0,2,3).

BDcocolasso::plotCoef(fit_missing)
BDcocolasso::plotError(fit_missing)
BDcocolasso::plotCoef(fit_additive)
BDcocolasso::plotError(fit_additive)

We can also use SCAD penalty for the Block-Descent-CoCoLasso :

fit_missing.SCAD = coco(Z=Z_missing,y=y_missing,n=n_missing,p=p_missing,p1=p1,p2=p2,step=100,K=4,mu=10,tau=NULL,noise="missing",block=TRUE, penalty="SCAD")

fit_additive.SCAD = coco(Z=Z_additive,y=y_additive,n=n_additive,p=p_additive,p1=p1,p2=p2,center.Z = FALSE, scale.Z = FALSE, step=100,K=4,mu=10,tau=0.3,noise="additive",block=TRUE, penalty="SCAD")

Once again, we get more erratic solution paths with slightly larger coefficients.

BDcocolasso::plotCoef(fit_missing.SCAD)
BDcocolasso::plotError(fit_missing.SCAD)

BDcocolasso::plotCoef(fit_additive.SCAD)
BDcocolasso::plotError(fit_additive.SCAD)

We will note that here, due to the fact that there are only 3 coefficients that get activated, convergence should be very quick. We should select lambda.opt and not lambda.sd for getting good estimates for the coefficient values in this case. This would not be the case in a model where there are more activated coefficients with varying amplitude.

One important remark : using too high missing rate may lead to dysfunction of the algorithms. If the error "eigen(W, symmetric=TRUE) : infinite or NaN values" appears, this may indicate that you are using a matrix with too high missing rates. A good empirical threshold for missing values is 0.7.



celiaescribe/BDcocolasso documentation built on Feb. 11, 2020, 11:41 p.m.