# biglasso In biglasso: Extending Lasso Model Fitting to Big Data

y <- Heart$y X.bm <- as.big.matrix(X) fit <- biglasso(X.bm, y, family = "binomial") plot(fit)  ### 1.1.4 Cox Regression library(survival) X <- heart[,4:7] y <- Surv(heart$stop - heart$start, heart$event)
X.bm <- as.big.matrix(X)
fit <- biglasso(X.bm, y, family = "cox")
plot(fit)


### 1.1.5 Multiple responses Linear Regression

set.seed(10101)
n=300; p=300; m=5; s=10; b=1
x = matrix(rnorm(n * p), n, p)
beta = matrix(seq(from=-b,to=b,length.out=s*m),s,m)
y = x[,1:s] %*% beta + matrix(rnorm(n*m,0,1),n,m)
x.bm = as.big.matrix(x)
fit = biglasso(x.bm, y, family = "mgaussian")
plot(fit)


## 1.2 Big Data

When the raw data file is very large, it's better to convert the raw data file into a file-backed big.matrix by using a file cache. We can call function setupX, which reads the raw data file and creates a backing file (.bin) and a descriptor file (.desc) for the raw data matrix:

## The data has 200 observations, 600 features, and 10 non-zero coefficients.
## This is not actually very big, but vignettes in R are supposed to render
## quickly. Much larger data can be handled in the same way.
if(!file.exists('BigX.bin')) {
X <- matrix(rnorm(1000 * 5000), 1000, 5000)
beta <- c(-5:5)
y <- as.numeric(X[,1:11] %*% beta)
write.csv(X, "BigX.csv", row.names = F)
write.csv(y, "y.csv", row.names = F)
## Pretend that the data in "BigX.csv" is too large to fit into memory
X.bm <- setupX("BigX.csv", header = T)
}


It's important to note that the above operation is just one-time execution. Once done, the data can always be retrieved seamlessly by attaching its descriptor file (.desc) in any new R session:

rm(list = c("X", "X.bm", "y")) # Pretend starting a new session
X.bm <- attach.big.matrix("BigX.desc")


This is very appealing for big data analysis in that we don't need to "read" the raw data again in a R session, which would be very time-consuming. The code below again fits a lasso-penalized linear model, and runs 10-fold cross-validation:

system.time({fit <- biglasso(X.bm, y)})

plot(fit)

# 10-fold cross validation in parallel
tryCatch(
{
system.time({cvfit <- cv.biglasso(X.bm, y, seed = 1234, ncores = 4, nfolds = 10)})
},
error = function(cond) {
system.time({cvfit <- cv.biglasso(X.bm, y, seed = 1234, ncores = 2, nfolds = 10)})
}
)

par(mfrow = c(2, 2), mar = c(3.5, 3.5, 3, 1), mgp = c(2.5, 0.5, 0))
plot(cvfit, type = "all")


# 2 Useful Reference

• biglasso R manual: https://cran.r-project.org/package=biglasso/biglasso.pdf
• biglasso on GitHub: https://github.com/YaohuiZeng/biglasso
• biglasso website: https://yaohuizeng.github.io/biglasso/index.html
• big.matrix manipulation: https://cran.r-project.org/package=bigmemory/index.html

## Try the biglasso package in your browser

Any scripts or data that you put into this service are public.

biglasso documentation built on Oct. 6, 2022, 1:07 a.m.