In linogaliana/OpenCancer: Build statistical models to understand cancer causes

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette is the occasion to explore the possibilities offered by the bigmemory package family to efficiently work with Epidemium data.

We will assume data have been imported using the OpenCancer package (see dedicated vignette). A csv file has been stored in a subdirectory inst of the current working directory. To see an example, use /vignettes/inst, in the package directory. This example dataframe will show the interest of working with C++ pointers rather than dataframes imported in R memory (hence in RAM).

library(OpenCancer)
datadir <- paste0(getwd(),"/inst")

Variable selection by LASSO

url <- "https://github.com/EpidemiumOpenCancer/OpenCancer/raw/master/vignettes/inst/exampledf.csv"
download.file(url,destfile = paste0(datadir,"/exampledf.csv"))

X <- bigmemory::read.big.matrix(paste0(datadir,"/exampledf.csv"), header = TRUE)

The matrix is not explicitly imported in R. X is a C++ pointer, a trick made possible by bigmemory package. As any big.matrix object, it is possible to access X content by importing it in the RAM. Working with pointers is a huge advantage in terms of memory:

pryr::mem_used(X)

The memory gain comes has a cost in terms of flexibility since working with pointers requires C++ functions. However, a series of package (mostly biglasso and biganalytics) allow to apply statistical functions to pointers.

The big.simplelasso function we created has been designed to perform a feature selection on an OpenCancer dataframes that is imported as a pointer. Assuming our explained variable is called 'incidence' (default) and we want to perform a cross-validation on 5 folds

pooledLASSO <- big.simplelasso(X,yvar = 'incidence', labelvar = c("cancer", "age",
  "Country_Transco", "year", "area.x", "area.y"), crossvalidation = T,
  nfolds = 5, returnplot = F)
summary(pooledLASSO$model)

labelvar argument is here to exclude these variables from the set of features included in the LASSO.

plot(pooledLASSO$model)

In that case, we see that from r length(pooledLASSO$model$fit$beta@i) variables, LASSO selects r sum(pooledLASSO$coeff != 0) variables.

Now, let's say we want to make a feature selection for each age classes separately. While a standard dataframe would allow to use group_by + do or nest + mutate, we must find another method for pointers. The bigsplit function is useful for such a project. As an example, we only keep three groups,

groupingvar <- c('age')
indices <- bigtabulate::bigsplit(X,groupingvar, splitcol=NA_real_)
indices <- indices[5:8]

# ESTIMATE MODEL WITH PARALLELIZED GROUPS
model <- foreach(i = indices, .combine='list', .errorhandling = 'pass',
                       .packages = c("bigmemory","biglasso","biganalytics",
                                     'OpenCancer')) %do% {
                         return(
                           big.simplelasso(bigmemory::deepcopy(X, rows = i),
                                           yvar = 'incidence',
                                           labelvar = c("cancer", 'sex',
                                                        "Country_Transco", "year", "area.x", "area.y"),
                                           crossvalidation = T, nfolds = 5, returnplot = F)

                         )
                       }

Results are stored as a list. To easily access its components, we need to arrange it a little bit

x <- list()
x[[1]] <- model[[2]]
for (i in 2:length(indices)){
    eval(parse(text = paste0("x[[",i,"]] <- ",
                             "model",paste(rep("[[1]]",i-1), collapse = ""),"[[2]]")))
}
model <- x

Our three groups results is

summary(model[[1]]$model)
summary(model[[2]]$model)
summary(model[[3]]$model)

Feature selection and linear regression on selected features

big.model.FElasso performs feature selection on a big.matrix and returns a linear regression with selected features.

# POOLED OLS

pooledOLS <- big.model.FElasso(X,yvar = "incidence",returnplot = F,
                               relabel = T)

DTsummary.biglm(pooledOLS)$coefftab
DTsummary.biglm(pooledOLS)$modeltab

It is also possible to perform regressions by group using groupingvar argument. In that case,

model <- big.model.FElasso(X,yvar = "incidence",
                              groupingvar = c('sex','age'),
                              returnplot = F,
                           relabel = T)

DTsummary.biglm(model[[2]])$coefftab
DTsummary.biglm(model[[2]])$modeltab