Guide to Compressive Big Data Analytics [CBDA]"

CBDA R Package Installation

N.B.: Please download ONLY the Windows binary or source files from our Github repository and install it (either in R or R-RStudio) for testing.

# Installation from the Windows binary (recommended for Windows systems)
install.packages("/filepath/CBDA_1.0.0.zip", repos = NULL, type = "win.binary") 
# Installation from the source (recommended for Macs and Linux systems)
install.packages("/filepath/CBDA_1.0.0.tar.gz", repos = NULL, type = "source")

The CRAN release is pending. Once published on CRAN, the installation can be done by the following command:

install.packages("CBDA") 

The necessary packages to run the CBDA algortihm can be installed/loaded by launching the CBDA_initialization() function (see example in the R chunk below). If the parameter install is set to TRUE (by default it's set to false), then the CBDA_initialization() function installs (if needed) and attaches all the necessary packages to run the CBDA package v1.0.0. This function can be run before any production run or test. The output shows a table (see Figure below) where for each package a TRUE or FALSE is displayed. Thus the necessary steps can be pursued in case some package has a FALSE.

N.B.: to eliminate a warning in Windows due to the "doMC" package not available (it was intended for Mac), install the "doMC" with the following command "install.packages("doMC", repos="http://R-Forge.R-project.org")"

ipaktable{width=90%}

Memory and storage limits to run CBDA

See Memory limits under different OS for various limitations on memory needs while running R. As far as CBDA is concerned, a CBDA object can be up to 200-300 Mb. The default limits are acceptable to run the CBDA algorithm. To know or change the memory allocation, the memory.limit() function reports or increases the limit in force on the total allocation.

memory.limit(50000)  # to allocate 50Gb of memory

The function only works on Windows. For Macs , try ulimit -t 600 -v 4000000. This will set the time linmit to 600 seconds and the memory to 4G.

The space needed to save all the workspaces may need to be as large as 1-5 Gb, depending on the number of subsamples. We are working on an new CBDA implementation that reduces the storage constraints.

Background

This document illustrates the first release of the CBDA protocol as described in the manuscript under review entitled "Controlled Feature Selection and Compressive Big Data Analytics: Applications to Big Biomedical and Health Studies" by Simeone Marino, Jiachen Xu, Yi Zhao, Nina Zhou, Yiwang Zhou, Ivo D. Dinov. University of Michigan, Ann Arbor.

The CBDA protocol has been developed in the R environment. Since a large number of smaller training sets are needed for the convergence of the protocol, we created a workflow that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of independent instances/jobs of the CBDA protocol. The methods, software and protocols developed here are openly shared on our GitHub repository. All software, workflows, and datasets are publicly accessible. The CBDA protocol steps are illustrated in Figure 1. figure1

This is an introduction to version control and LONI Version Control and LONI.

This is a simple and clear Introduction to the main algorithm used in the CBDA protocol, namely SuperLearner_Intro.

N.B.: This version 1.0.0 is fully tested ONLY on continuous features Xtemp and binary outcome Ytemp.

Main function of the CBDA package -- CBDA()

The CBDA package comprises several functions. The main function is CBDA() that has all the input specifications to run a set M of subsamples from the Big Data [Xtemp, Ytemp]. We assume that the Big Data is already clean and harmonized.

After the necessary data wrangling (i.e., imputation, normalization and rebalancing), an ensemble predictor (i.e., SuperLearner) is applied to each subsample for training/learning.

The list of algorithms (or wrappers) available from the SuperLearner can be displayed by typing listWrappers(). By default, the CBDA package operates with the following algorithms (set with their default values, as described in the SuperLearner package):

The ensemble predictive model is then validated on a fraction alpha of the Big Data. Each subsample generates a predictive model that is ranked based on performance metrics (e.g., Mean Square Error-MSE and Accuracy) during the first validation step.

CBDA input specifications

The array of input specifications comprises the following labels (in the square brackets is the default value):

Sampling ranges for cases (CSR - Cases Sampling Range) and features (FSR - Feature Sampling Range) are then defined as follow:

Secondary CBDA functions

After all the M subsamples have been generated and each predictive model computed, the CBDA function CBDA() calls 4 more functions to perform, respectively:

At the end of successfull CBDA() call, 2 workspaces are generated:

Throughout the execution of the CBDA protocol, a workspace with the main CBDA specifications is created and loaded whenever necessary (e.g., "CBDA_label_info.RData").

CBDA object

A CBDA object is created (e.g., CBDA_object <- CBDA()) with the following data:

Example

# Set the specs for the synthetic dataset to be tested
n = 300          # number of observations
p = 900          # number of variables

# Generate a nxp matrix of IID variables (e.g., ~N(0,1))
X1 = matrix(rnorm(n*p), nrow=n, ncol=p)

# Setting the nonzero variables - signal variables
nonzero=c(1,100,200,300,400,500,600,700,800,900)

# Set the signal amplitude (for noise level = 1)
amplitude = 10

# Allocate the nonzero coefficients in the correct places
beta = amplitude * (1:p %in% nonzero)

# Generate a linear model with a bias (e.g., white  noise ~N(0,1))
ztemp <- function() X1 %*% beta + rnorm(n)
z = ztemp()

# Pass it through an inv-logit function to
# generate the Bernoulli response variable Ytemp
pr = 1/(1+exp(-z))
Ytemp = rbinom(n,1,pr)
X2 <- cbind(Ytemp,X1)

dataset_file ="Binomial_dataset_3.txt"

# Save the synthetic dataset
write.table(X2,dataset_file,sep=",")

# Load the Synthetic dataset
Data = read.csv(dataset_file,header = TRUE)
Ytemp <- Data[,1] # set the outcome
original_names_Data <- names(Data)
cols_to_eliminate=1
Xtemp <- Data[-cols_to_eliminate] # set the matrix X of features/covariates
original_names_Xtemp <- names(Xtemp)

workspace_directory <- getwd()

SL.glmnet.0.75 <- function(..., alpha = 0.75,family="binomial"){
  SL.glmnet(..., alpha = alpha, family = family)}

simeone <- c("SL.glm","SL.glmnet",
             "SL.svm","SL.randomForest","SL.bartMachine","SL.glmnet.0.75")

# Call the Main CBDA function
# Multicore functionality NOT enabled
CBDA_object <- CBDA(Ytemp , Xtemp , M = 16 , Nrow_min = 50, Nrow_max = 70,
                    top = 15, max_covs = 10 , min_covs = 3)

# Multicore functionality enabled
CBDA_object <- CBDA(Ytemp , Xtemp , M = 24 , Nrow_min = 60, Nrow_max = 80,
                    N_cores = 4 , top = 20, max_covs = 15 ,  min_covs = 3,
                    algorithm_list = simeone , label = "CBDA_package_test_multicore")

As soon as it is launched, the progress of the CBDA algorithm is shown on the screen (see below).

After each step of the CBDA algorithm, completion messages and some output is displayed on the screen. For example, after the "CONSOLIDATION" phase is done, the screen shows:

Or after the Validation and STopping Criteria steps:

If the multicore is enabled, no screen messages are displayed due to the nature of the job allocations.

Datasets

The CBDA protocol has been applied to 3 different datasets in the manuscript: 2 synthetic ones (Null and Binomial) and a real one (ADNI dataset). The synthetic datasets all have the same number of cases (i.e., 300), but they differ in the number of features. The binomial datasets all have only 10 "true" predictive features, while the null datasets have none.

The Null datasets are organized as follow:

The Binomial datasets are organized as follow:

LONI pipeline implementation

As mentioned earlier, a large number of smaller training sets are needed for the convergence of the protocol. Thus the CBDA R package has dedicated functions for a workflow implementation that runs on the LONI pipeline environment, a free platform for high performance computing that allows the simultaneous submission of hundreds of simultaneous independent instances/jobs of the CBDA protocol. The main and secondary functions are duplicated and modified to run on the LONI pipeline. The main difference is that they have a job_id parameter that is used to identify the specific job on the server. The functions are CBDA.pipeline(), CBDA_Consolidation.pipeline(), CBDA_Validation.pipeline() and CBDA_Stopping_Criteria.pipeline(). The complete pipeline workflow implementation can be downloaded in this LONI.zip file.

CBDA complete set of results for the synthetic datasets (Null and Binomial)

Due to the large number of experiments and the many different specification, the complete set of results for the 3 binomial datasets are shown at the following links:

The complete set of results for the histograms in Figure 3 are shown at the following links:

The complete set of results shown in the next figure (i.e., Figure 2 of the manuscript under review) are here.

The figure below shows a set of heatmaps and histograms to summarize all the experiments in the manuscript under review {width=100%}

In the heatmaps, the x axis represents the 16 combinations between the choice of the subsets of M (i.e., 1,000, 3,000, 6,000 and 9,000) and the choice for top-ranked predictions (i.e., 100, 200, 500 and 1,000, as described in the last 2 columns of Table 2 in the Methods section of the manuscript under review). Namely, the combinations are ordered as follows: Combination 1 = (1,000,100), Combination 2 = (1,000,200), Combination 3 = (1,000,500), Combination 4 = (1,000,1,000), Combination 5 = (3,000,100), Combination 6 = (3,000,200), Combination 7 = (3,000,500), Combination 8 = (3,000,1,000), Combination 9 = (6,000,100), Combination 10 = (6,000,200),Combination 11 = (6,000,500), Combination 12 = (6,000,1,000), Combination 13 = (9,000,100), Combination 14 = (9,000,200), Combination 15 = (9,000,500), Combination 16 = (9,000,1,000). The y axis represents the CBDA experiment specs, where Experiments 1-6 have no missing values (i.e., missValperc = 0%), and Experiments 7-12 have 20% missing values (i.e., missValperc = 20%). Both sets of experiments have the FSR and CSR ranges combined in ascending order, namely Exp1and Exp 7 = (FSR,CSR) =(1-5%,30-60%), Exp2 and Exp 8= (FSR,CSR) =(5-15%,30-60%), Exp3 and Exp 9 = (FSR,CSR) =(15-30%,30-60%), Exp4 and Exp 10 = (FSR,CSR) =(1-5%,60-80%), Exp5 and Exp 11= (FSR,CSR) =(5-15%,60-80%), Exp6 and Exp 12 = (FSR,CSR) =(15-30%,60-80%). See Table 2 for details. Panels A, C and E show the CBDA results using the Accuracy performance metric. Panels B, D and F show the CBDA results using the Mean Square Error-MSE performance metric (see Methods for details on the performance metrics). Panels A and B, C and D, E and F show the results for the 3 Binomial datasets tested, respectively.

The heatmaps show in light blue color CBDA experiments with high true positive rates, indicating a high frequency of correct feature identification (i.e., with more than 7 true features identified out of a total of 10) and in dark blue color CBDA experiments with low true positive rates, depicting a low frequency of correct feature identification (i.e., with less than 7 true features identified out of a total of 10). Our goal is to ultimately suggest the optimal input specifications for the CBDA protocol (i.e., decrease the computational time by reducing the number of samples, and increase the rate of discovery of "true" features).

The histograms illustrates a comparison between the Null and Binomial datasets experiments. The goal in this set of histograms is to validate the CBDA protocol when no signal is present in the data. For the best accuracy in the comparison, we used all the 9,000 CBDA samples, ranking the top 1,000 predictions (this is equivalent to the combination 16 in the heatmaps figure). Each histogram combines the results of all 12 experiments (using the MSE metric).

The function CBDA_spectrum_plots() generates the histograms on the figure. Specifically, it generates 4 plots/histograms: count and density histograms for the top features as selected by both performance metrics (e.g., MSE and Accuracy).



Try the CBDA package in your browser

Any scripts or data that you put into this service are public.

CBDA documentation built on May 1, 2019, 8:23 p.m.