MarkRank Tutorial
In Corbi: Collection of Rudimentary Bioinformatics Tools

Introduction

MarkRank is a network-based gene ranking method for identifying the cooperative biomarkers for heterogenous diseases. MarkRank uses the gene cooperation network to explicitly model the gene cooperative effects. MarkRank suggests that explicit modeling of gene cooperative effects can greatly improve the performance of biomarker identification for complex diseases, especially for diseases with high heterogeneity. This tutorial could help the user to execute the markrank function compiled in the Corbi package.

We first import Corbi and other required packages:

rm(list=ls(all=TRUE))
library(Corbi)
library(Matrix)
options(scipen=0)

MarkRank example

The inputs of markrank function include an expression dataset with labelled (e.g. disease/normal) samples and an adjacent matrix of biological network (e.g. PPI network). Here we use simulated dataset to illustrate the usage of markrank function.

Simulate dataset

First, we load in a small network using another function read_net compiled in Corbi.

net <- read_net("network.txt")

This network contains 100 genes. Then we set the number of preset differential expression genes.

size <- 10

We randomly extract a connected subnetwork with the preset size from the loaded network. Here we use the function search_net to implement.

source("search_net.R")
subnet  <- search_net(net, node_size = size, ori_name = TRUE)
deg_list <- as.character(unique(as.vector(subnet)))

The preset differentially expression genes are:

deg_list

Now we simulate the expression matrix. The sample number is set as

sample_num <- 50

The number of disease samples and normal samples are equal.

disease_num <- 25

The code of simulating the expression dataset is as follows. We up-regulated the expression values of preset differentially expression gene set. The detailed description of this process can be found in the Supplementary Materials in our manuscript.

library(matrixcalc)
library(MASS)
l <- net$size
p <- length(deg_list)
exp_dataset <- matrix(0, sample_num, l, dimnames = list(paste("sample", 1:sample_num, sep=""), net$node))
vars  <- 1
sigma <- matrix(0)
while(!is.positive.definite(sigma)){
  vars  <- vars + 1
  sigma <- as.matrix(as(net$matrix,'dgCMatrix'))
  sigma[which(sigma == 1)] <- rnorm(length(which(sigma == 1)), 4, 1)
  sigma[which(sigma == 0)] <- rnorm(length(which(sigma == 0)), 2, 1)
  diag(sigma) <- rnorm(l, vars, 1)
  sigma <- (sigma + t(sigma))/2
}
sample_mean <- rnorm(l, 5, 1)
exp_dataset <- mvrnorm(sample_num, sample_mean, sigma)                              
exp_dataset[1:disease_num, deg_list] <- exp_dataset[1:disease_num, deg_list] * rnorm(disease_num*p, 2, 0.1)

The final simulated gene expression dataset contains 50 samples and 100 genes. The number of preset marker genes is 10.

dim(exp_dataset)

The sample label is

label <- c(rep(0, disease_num), rep(1, sample_num-disease_num))

The adjacent matrix of the network is

adj_matrix <- as.matrix(net$matrix)
adj_matrix <- adj_matrix[colnames(exp_dataset), colnames(exp_dataset)]
adj_matrix <- adj_matrix + t(adj_matrix)

Run markrank

With the above simulated datasets as inputs, we now execute the markrank function to test whether MarkRank could prioritize the preset genes. We use the default parameter combination as alpha=0.8 and lambda=0.2 to run the markrank.

time1 <- system.time(
  result1 <- markrank(exp_dataset, label, adj_matrix, alpha=0.8, lambda=0.2, trace=TRUE)
  )

The output result of markrank contains the following variables:

names(result1)

The scores of top 10 markrank genes are:

s1 <- sort(result1$score, decreasing=TRUE)
s1[1:10]

The scores of pre-set differential expression genes are:

result1$score[deg_list]

The false discovery genes are:

setdiff(names(s1[1:10]), deg_list)

The iteration steps in the random walk iteration is

result1$steps

The user could find the input parameters by using the following code:

result1$initial_pars

Reuse gene cooperation network

The computation of gene cooperation network is time-consuming. To reduce the redundant computation, we can reuse the gene cooperation network computed in previous step. The computed gene cooperation network is stored in

NET2 <- result1$NET2

Using the parameter Given_NET2, we could tune other parameters without the repeated computation of gene cooperation network. For example, we use the alpha=0.8 and lambda=0.5 to recompute the result:

time2 <- system.time(
  result2 <- markrank(exp_dataset, label, adj_matrix, alpha=0.8, lambda=0.5, trace=FALSE, Given_NET2=NET2)
  )

The running time of two results is:

time1
time2

The running time of result2 is far less than result1, because the result2 just contains the step of random walk algorithm. Now the new scores of top 10 markrank genes are:

s2 <- sort(result2$score, decreasing=TRUE)
s2[1:10]

The scores of pre-set differential expression genes are:

result2$score[deg_list]

The false discovery genes are:

setdiff(names(s2[1:10]), deg_list)

Fast construction of gene cooperation network

By using the input parameter d, markrank could reduce the computation time for constructing the gene cooperation network. Only the gene pairs, whose shortest distances in the biological network are less than d, participate in computation. For example, we could run

time3 <- system.time(
  result3 <- markrank(exp_dataset, label, adj_matrix, trace=F, d=2)
  )

The running time of two results is:

time1
time3

In this situation, the distance information of each gene pair can be found in output variable dis. For example, the distance matrix of gene 1 to 10 is: