knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The biomarker candidates selected by INDEED lead to more accurate survival time prediction compared with those selected by differential expression (DE) analysis and differential network (DN) analysis.

Introduction

Differential expression (DE) analysis is commonly used to identify biomarker candidates that have significant changes in their expression levels between distinct biological groups. One drawback of DE analysis is that it only considers the changes on single biomolecular level. In differential network (DN) analysis, network is typically built based on the correlation and biomarker candidates are selected by investigating the network topology. However, correlation tends to generate over-complicated networks and the selection of biomarker candidates purely based on network topology ignores the changes on single biomolecule level. Thus, we have proposed a novel method INDEED, which considers both the changes on single biomolecular and network levels by integrating DE and DN analysis. INDEED has been published in Methods journal (PMID: 27592383). This is the R package that implements the algorithm.

This R package will generate a csv file containing information such as p-values, node degree and activity score for each biomolecule. A higher activity score indicates that the corresponding biomolecule has more neighbors connceted in the differential network and their p-values are more statistically significant. It will also generate a csv file for the differential network created by INDEED.

The INDEED package doesn't get loaded automatically, so remember to load it first:

library(INDEED)

glasso function will also be loaded as INDEED depends on it to obtain the sparse differential network.

Using the function select_sig()

To use the function sig_select(), users will need to have these data sets in hand:

The demo data is presented in the Demo Data section below. It will be automatically loaded with the package.

Now, users can test select_sig() to see how it works within sample data sets. One good thing about this package is that it gives users great flexibility. For example, although correlation tends to generate over-complicated networks, select_sig() still allows users to choose between using correlation or partial correlation to generate networks. Setting partial = TRUE will generate a sparse differential network using partial correlation. By setting partial = FALSE, users will get network based on correlation. If users forgot to specify the partial parameter, there will be an error message: Error in if (partial == TRUE) { : argument is of length zero}. In addtion, once partial = TRUE is specified, users will be able to interact with console to select their desired regularization parameter rho in order to obtain the sparse differential network. If answered [n] (Default), another question will pop up asking the users whether they want to select the rho based on the minimum rule [m] or one standard error rule [o] (Default). Also, network generated based on correlation can be either Pearson or Spearman correlation. The default is Pearson correlation. method = "spearman" will use Spearman instead. For both correlation and partial correlation, users get to determine the permutation times based on their preferences.

Remember, a file containing p-values is optional as p-values could be calculated using logistic regression if p_val is set to NULL or not specified.

The following example demonstrates how to use select_sig function:

select_sig(x = Met_GU, class_label = Met_Group_GU, id = Met_name_GU, partial = TRUE)

In this case, the sparse differential network is based on partial correlation and p-value for each biomolecule is calculated for users. Also, rho is picked based on one standard error rule and number of permutations is set to 1000.

## Draw error curve
devtools::load_all()
set.seed(100)
choose_rho <- function(data, n_fold, rho) {
  # randomly shuffle the data
  Data <- data[sample(nrow(data)), ]
  # create n_fold equally size folds
  folds <- cut(seq(1, nrow(Data)), breaks = n_fold, labels = FALSE)
  # tune parameters
  d <- ncol(Data)
  loglik_cv <- c()
  loglik_rho <- c()
  pb <- txtProgressBar(min = 0, max = length(rho), style = 3) # create progress bar
  for(i in 1 : length(rho)){
    Sys.sleep(0.1)
    # perform n_fold cross validation
    loglik <- c()
    for(j in 1 : n_fold){
      # segement your data by fold using the which() function
      testIndexes <- which(folds == j, arr.ind = TRUE)
      testData <- Data[testIndexes, ]
      trainData <- Data[-testIndexes, ]
      # use test and train data partitions however you desire...
      cov <- var(trainData) # compute the covariance matrix
      pre<- glasso(cov, rho = rho[i])
      loglik <- c(loglik, loglik_ave(testData, pre$wi))
    }
    loglik_cv <- c(loglik_cv, sum(loglik) / n_fold)
    loglik_rho <- c(loglik_rho, sd(loglik) / sqrt(n_fold))
    setTxtProgressBar(pb, i) # update progress bar
  }
  close(pb)
  plot(rho, loglik_cv, xlab = expression(lambda), ylab = "Error")
  lines(rho, loglik_cv)
  error <- list("log.cv" = loglik_cv, "log.rho" = loglik_rho)
  return(error)
}
data_bind <- rbind(Met_GU, Met_Group_GU)
raw_group_1 <- data_bind[,data_bind[nrow(data_bind),] == 0][1:(nrow(data_bind) - 1),]  # Group 1: p*n1
raw_group_2 <- data_bind[,data_bind[nrow(data_bind),] == 1][1:(nrow(data_bind) - 1),]  # Group 2: p*n2
# Z-transform the data for group-specific normalization
data_group_1 <- scale(t(raw_group_1)) # Group 1: n1*p
data_group_2 <- scale(t(raw_group_2)) # Group 2: n2*p
n_fold <- 5 # number of folds
rho = exp(seq(log(0.6), log(0.01), length.out = 20))
# draw error curve
error <- choose_rho(data_group_1, n_fold, rho)
title(main = "Group 1: Error curve using corss validation")
# chosse optimal rho
rho[error$log.cv == min(error$log.cv)] # rho based on minimum rule
abline(v = rho[error$log.cv == min(error$log.cv)], col = "red", lty = 3)
# one standard error rule
abline(h = min(error$log.cv) + error$log.rho[error$log.cv == min(error$log.cv)], col = "blue")
[1] "The list of rhos for group 1:"
 [1] 0.01000000 0.01240472 0.01538770 0.01908801 0.02367814 0.02937207 0.03643522 0.04519687 0.05606544 0.06954760 0.08627184 0.10701779
[13] 0.13275256 0.16467581 0.20427570 0.25339825 0.31433340 0.38992173 0.48368692 0.60000000
Choose your own regularization parameter rho for group 1 (Default: n)? [y/n]: n
rho based on minimum rule/ rho based on one standard error rule (Default: o) [m/o]: o
[1] 0.2533983
## Draw error curve
set.seed(100)
choose_rho <- function(data, n_fold, rho) {
  # randomly shuffle the data
  Data <- data[sample(nrow(data)), ]
  # create n_fold equally size folds
  folds <- cut(seq(1, nrow(Data)), breaks = n_fold, labels = FALSE)
  # tune parameters
  d <- ncol(Data)
  loglik_cv <- c()
  loglik_rho <- c()
  pb <- txtProgressBar(min = 0, max = length(rho), style = 3) # create progress bar
  for(i in 1 : length(rho)){
    Sys.sleep(0.1)
    # perform n_fold cross validation
    loglik <- c()
    for(j in 1 : n_fold){
      # segement your data by fold using the which() function
      testIndexes <- which(folds == j, arr.ind = TRUE)
      testData <- Data[testIndexes, ]
      trainData <- Data[-testIndexes, ]
      # use test and train data partitions however you desire...
      cov <- var(trainData) # compute the covariance matrix
      pre<- glasso(cov, rho = rho[i])
      loglik <- c(loglik, loglik_ave(testData, pre$wi))
    }
    loglik_cv <- c(loglik_cv, sum(loglik) / n_fold)
    loglik_rho <- c(loglik_rho, sd(loglik) / sqrt(n_fold))
    setTxtProgressBar(pb, i) # update progress bar
  }
  close(pb)
  plot(rho, loglik_cv, xlab = expression(lambda), ylab = "Error")
  lines(rho, loglik_cv)
  error <- list("log.cv" = loglik_cv, "log.rho" = loglik_rho)
  return(error)
}

# draw error curve
error <- choose_rho(data_group_2, n_fold, rho)
title(main = "Group 2: Error curve using corss validation")
# chosse optimal rho
rho[error$log.cv == min(error$log.cv)] # rho based on minimum rule
abline(v = rho[error$log.cv == min(error$log.cv)], col = "red", lty = 3)
# one standard error rule
abline(h = min(error$log.cv) + error$log.rho[error$log.cv == min(error$log.cv)], col = "blue")
[1] "The list of rhos for group 2:"
 [1] 0.01000000 0.01240472 0.01538770 0.01908801 0.02367814 0.02937207 0.03643522 0.04519687 0.05606544 0.06954760 0.08627184 0.10701779
[13] 0.13275256 0.16467581 0.20427570 0.25339825 0.31433340 0.38992173 0.48368692 0.60000000
Choose your own regularization parameter rho for group 2 (Default: n)? [y/n]: n
rho based on minimum rule/ rho based on one standard error rule (Default: o) [m/o]: o
[1] 0.2533983
Enter your desired number of permutations to build differential network using partial correlation [Default: 1000]: 1000

The table below is part of the output:

x <- read.csv("../inst/INDEED_result.csv", header = FALSE)
names(x) <- NULL
knitr::kable(head(x, 11))

There are more examples in the More Examples section below.

Demo Data

Met_GU <- get("Met_GU")[1:10, 1:10]   # Displying the fisrt 10*10 data in Met_GU
knitr::kable(Met_GU, caption = "**Met_GU**")
p_val <- read.table("../inst/pvalue_M_GU.csv", sep = ",", header = TRUE)
knitr::kable(head(p_val, 10), caption = "**P-values**")     # display only the first 10 metabolites associated with their p-values
Met_Group_GU <- get("Met_Group_GU")[1:10]
knitr::kable(Met_Group_GU)   # Displying the fisrt 10 data in Met_Group_GU

An error message: Error in Data[testIndexes, ] : subscript out of bounds will occur if this class_lable is missing.

Met_name_GU <- get("Met_name_GU")[1:10]
knitr::kable(Met_name_GU)     # Displying the fisrt 10 data in Met_name_GU

Also, if the users want the p-values to be calculated for them without providing this data, an error message will occur:
Error in colnames(X_df)[1:ncol(X_df) - 1] <- Met_name : replacement has length zero.

More Examples

Example 1

select_sig(x = Met_GU, class_label = Met_Group_GU, id = Met_name_GU, partial = FALSE)

By calling the above function, the differential network is obtained based on Pearson correlation coefficient. And the table below is part of the csv file output.

x <- read.csv("../inst/INDEED_result_pearson.csv", header = FALSE)
names(x) <- NULL
knitr::kable(head(x, 11))

Example 2

select_sig(x = Met_GU, class_label = Met_Group_GU, Met_name = Met_name_GU, partial = FALSE, method = "spearman", p_val = "data/pvalue_M_GU.csv")

By calling the above function, the differential network is obtained based on Spearman correlation coefficient. And the table below is part of the csv file output.

x <- read.csv("../inst/INDEED_result_spearman.csv", header = FALSE)
names(x) <- NULL
knitr::kable(head(x, 11))


cx30/INDEED documentation built on May 5, 2019, 2:41 a.m.