knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette is the occasion to explore the possibilities offered by the bigmemory package family to efficiently work with Epidemium data. For the moment, it is based on a sample of true data: results presented here should not be relied

We will assume data have been imported using the OpenCancer package (see dedicated vignette). A csv file has been stored in a subdirectory inst of the current working directory. To see an example, use /vignettes/inst, in the package directory. This example dataframe will show the interest of working with C++ pointers rather than dataframes imported in R memory (hence in RAM).

library(OpenCancer)
datadir <- if (stringr::str_detect(getwd(),"/vignettes")) paste0(getwd(),"/inst") else paste0(getwd(),"/vignettes/inst")

Variable selection by LASSO

url <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/exampledf.csv"
download.file(url,destfile = paste0(datadir,"/exampledf.csv"))
X <- bigmemory::read.big.matrix(paste0(datadir,"/exampledf.csv"), header = TRUE)

The matrix is not explicitly imported in R. X is a C++ pointer, a trick made possible by bigmemory package. As any big.matrix object, it is possible to access X content by importing it in the RAM. Working with pointers is a huge advantage in terms of memory:

pryr::mem_used(X)

The memory gain comes has a cost in terms of flexibility since working with pointers requires C++ functions. However, a series of package (mostly biglasso and biganalytics) allow to apply statistical functions to pointers.

The big.simplelasso function we created has been designed to perform a feature selection on an OpenCancer dataframes that is imported as a pointer. Assuming our explained variable is called 'incidence' (default) and we want to perform a cross-validation on 5 folds

pooledLASSO <- big.simplelasso(X,yvar = 'incidence', labelvar = c("cancer", "age",
  "Country_Transco", "year", "area.x", "area.y"), crossvalidation = T,
  nfolds = 5, returnplot = F)
summary(pooledLASSO$model)

labelvar argument is here to exclude these variables from the set of features included in the LASSO.

plot(pooledLASSO$model)

In that case, we see that from r length(pooledLASSO$model$fit$beta@i) variables, LASSO selects r sum(pooledLASSO$coeff != 0) variables.

Now, let's say we want to make a feature selection for each age classes separately. While a standard dataframe would allow to use group_by + do or nest + mutate, we must find another method for pointers. The bigsplit function is useful for such a project. As an example, we only keep three groups,

groupingvar <- c('age')
indices <- bigtabulate::bigsplit(X,groupingvar, splitcol=NA_real_)
indices <- indices[5:8]

# ESTIMATE MODEL WITH PARALLELIZED GROUPS
model <- foreach(i = indices, .combine='list',
                 .multicombine = TRUE,
                 .maxcombine = nrow(X),
                 .errorhandling = 'pass',
                       .packages = c("bigmemory","biglasso","biganalytics",
                                     'OpenCancer')) %do% {
                         return(
                           list(results = big.simplelasso(bigmemory::deepcopy(X, rows = i),
                                           yvar = 'incidence',
                                           labelvar = c("cancer", 'sex',
                                                        "Country_Transco", "year", "area.x", "area.y"),
                                           crossvalidation = T, nfolds = 5, returnplot = F),
                                indices = i
                           )
                         )
                       }

Results are stored as a list and have the same order as indices groups.

summary(model[[1]]$results$model)
summary(model[[2]]$results$model)
summary(model[[3]]$results$model)

Feature selection and linear regression on selected features

big.model.FElasso performs feature selection on a big.matrix and returns a linear regression with selected features.

# POOLED OLS

pooledOLS <- big.model.FElasso(X,yvar = "incidence",returnplot = F,
                               relabel = T)

DTsummary.biglm(pooledOLS)$coefftab
DTsummary.biglm(pooledOLS)$modeltab

It is also possible to perform regressions by group using groupingvar argument. In that case,

model <- big.model.FElasso(X,yvar = "incidence",
                              groupingvar = c('sex','age'),
                              returnplot = F,
                           relabel = T)

DTsummary.biglm(model[[38]]$results)$coefftab
DTsummary.biglm(model[[38]]$results)$modeltab


EpidemiumOpenCancer/OpenCancer documentation built on May 12, 2019, 7:46 a.m.