knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette is the occasion to explore the possibilities offered by the bigmemory
package family to efficiently work with Epidemium data. For the moment, it is based on a sample of true data: results presented here should not be relied
We will assume data have been imported using the OpenCancer
package (see dedicated vignette). A csv file has been stored in a subdirectory inst
of the current working directory. To see an example, use /vignettes/inst
, in the package directory. This example dataframe will show the interest of working with C++ pointers rather than dataframes imported in R memory (hence in RAM).
library(OpenCancer) datadir <- if (stringr::str_detect(getwd(),"/vignettes")) paste0(getwd(),"/inst") else paste0(getwd(),"/vignettes/inst")
url <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/exampledf.csv" download.file(url,destfile = paste0(datadir,"/exampledf.csv"))
X <- bigmemory::read.big.matrix(paste0(datadir,"/exampledf.csv"), header = TRUE)
The matrix is not explicitly imported in R. X
is a C++ pointer, a trick made possible by bigmemory
package. As any big.matrix
object, it is possible to access X
content by importing it in the RAM. Working with pointers is a huge advantage in terms of memory:
pryr::mem_used(X)
The memory gain comes has a cost in terms of flexibility since working with pointers requires C++ functions. However, a series of package (mostly biglasso
and biganalytics
) allow to apply statistical functions to pointers.
The big.simplelasso
function we created has been designed to perform a feature selection on an OpenCancer dataframes that is imported as a pointer. Assuming our explained variable is called 'incidence'
(default) and we want to perform a cross-validation on 5 folds
pooledLASSO <- big.simplelasso(X,yvar = 'incidence', labelvar = c("cancer", "age", "Country_Transco", "year", "area.x", "area.y"), crossvalidation = T, nfolds = 5, returnplot = F) summary(pooledLASSO$model)
labelvar
argument is here to exclude these variables from the set of features included in the LASSO.
plot(pooledLASSO$model)
In that case, we see that from r length(pooledLASSO$model$fit$beta@i)
variables, LASSO selects r sum(pooledLASSO$coeff != 0)
variables.
Now, let's say we want to make a feature selection for each age classes separately. While a standard dataframe would allow to use group_by + do
or nest + mutate
, we must find another method for pointers. The bigsplit
function is useful for such a project. As an example, we only keep three groups,
groupingvar <- c('age') indices <- bigtabulate::bigsplit(X,groupingvar, splitcol=NA_real_) indices <- indices[5:8] # ESTIMATE MODEL WITH PARALLELIZED GROUPS model <- foreach(i = indices, .combine='list', .multicombine = TRUE, .maxcombine = nrow(X), .errorhandling = 'pass', .packages = c("bigmemory","biglasso","biganalytics", 'OpenCancer')) %do% { return( list(results = big.simplelasso(bigmemory::deepcopy(X, rows = i), yvar = 'incidence', labelvar = c("cancer", 'sex', "Country_Transco", "year", "area.x", "area.y"), crossvalidation = T, nfolds = 5, returnplot = F), indices = i ) ) }
Results are stored as a list and have the same order as indices
groups.
summary(model[[1]]$results$model) summary(model[[2]]$results$model) summary(model[[3]]$results$model)
big.model.FElasso
performs feature selection on a big.matrix
and returns a linear regression with selected features.
# POOLED OLS pooledOLS <- big.model.FElasso(X,yvar = "incidence",returnplot = F, relabel = T) DTsummary.biglm(pooledOLS)$coefftab DTsummary.biglm(pooledOLS)$modeltab
It is also possible to perform regressions by group using groupingvar
argument. In that case,
model <- big.model.FElasso(X,yvar = "incidence", groupingvar = c('sex','age'), returnplot = F, relabel = T) DTsummary.biglm(model[[38]]$results)$coefftab DTsummary.biglm(model[[38]]$results)$modeltab
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.