In MalathiSIDona/pathDESeq: Pathway-Based Differential Expression Analysis for RNA-Seq Data

Introduction

The package pathDESeq provides a method for performing differential expression analysis of RNA-Seq data. The statistical methodology is based on a Poisson-Gamma-Beta Markov Random Field model. Prior knowledge of biological pathways and gene interaction network information is used to improve the sensitivity and specificity while reducing the false discovery rate [@DO16]. Parameter estimation is carried out using maximum likelihood and method of moment via Iterative Conditional Mode (ICM) algorithm.

Input data and preparations

pathDESeq requires normalized RNA-seq gene expression data (e.g., in FPKM or RPKM format) as its input. The gene expression data for G number of genes correspond to m replicates in the control group and n replicates in the treatment group, should be in the form of a rectangular G X (m+n+1) matrix. The first column must contains the gene names. The next m columns must contain gene expression values for the control group and the last n columns must contain gene expression values for the treatment group. Each row represents the gene expression data for a particular gene, while each column represents the gene expression data for a replicate.

Step 1: Load data gene expression data

In this example, we start by loading a subset of GSE50760 dataset from NCBI database. This dataset comes from a study aimed to identify prognostic indicators, including individual responses to chemotherapy, in colorectal cancer (CRC) patients [@KI14]. The full RNA-Seq data consist of 54 samples (normal colon, primary CRC, and liver metastases) from 18 colorectal cancer patients. All patients were treated at the Asian Medical Centre (Seoul, Korea) between May 2011 and February 2012 (AMC cohort). As a subset from the original data, this data set contains 18 samples (9 normals and 9 primary CRC) and 2000 genes.

require(pathDESeq)
data(CRC.data)

The format of the data set should be like this,

knitr::kable(CRC.data[1:3,1:5])

Step 2: Select genes that can be found in Reactome pathways

We will restrict our analysis to genes which can be found in Reactome pathways. Therefore it is required to filter the data set according to the available pathway information. The object "Reactome.genes" contains 7356 unique gene names which are in Reactome pathway database [@CR14]. However in general, any database which contains biological pathway information can be used to filter the genes.

# call the object "Reactome.genes" 
data(Reactome.genes)

#filtering data for selected genes which can be found in Reactome pathways.
dataset <- data.frame(subset(CRC.data,CRC.data$genes%in%Reactome.genes))

Step 3: Remove duplicate genes

It is required to remove rows with duplicate gene names prior to the analysis.

#remove duplicate genes 
dataset<-dataset[!duplicated(dataset$genes), ]

Step 4: Remove rows with missing values

Rows (genes) with missing expression values need to be removed.

dataset<-dataset[rowSums(is.na(dataset))==0,]

Step 5: Remove rows with all 0's

It is required to remove genes (rows) with all 0's

# remove genes with all 0's 
dataset <- dataset[rowSums(dataset[,-1])>0,]
dim(dataset)

Perform PGBMRF analysis using ICM algorithm

Step 6: Load gene interaction information

data(BioGRID.table)
head(BioGRID.table)

The object "BioGRID.table" contains 237045 gene-gene interactions retrieved from BioGRID database version 3.4.129 [@CH15]. The object contains two variables "Gene.1" and "Gene.2". Gene.1 repesents the official gene symbol for Interacting Partner A and Gene.2 represents the official gene symbol for Interacting Partner B. Each row is a gene-gene interaction pair.

In this analysis, we only use BioGRID database to form the neighbourhood structure. However, in general, any databases that contain gene network/interaction information can be used to form the neighbourhood structure.

Step 7: perform PGBMRF analysis

The pgbmrfICM function is a wrapper function which consists functions to perform the PGBMRF analysis. This function will perform two independent sample t-test to obtain initial DE states for given genes and will create the neighbourhood matrix based on available gene interaction information. After that this will estimate the parameters for PGBMRF model and will estimate the DE states for given genes using Iterative Conditional Mode algorithm [@BE86] with three iterative steps.

(a) Estimating the Poisson-Gamma-Beta parameters using Method of Moment (MoM).

(b) Estimating the Markov Random Field model parameters using MLE's.

This function will write following text files in your working directory.

selected dataset.txt : the gene expression dataset used for the PGBMRF analysis
neib_matrix.txt : the neighbourhood matrix
PGBMRF identified UR genes.txt : UR genes identified by PGBMRF model
PGBMRF identified DR genes.txt : DR genes identified by PGBMRF model
PGBMRF states.txt : the final estimated DE states
PGBMRF results.txt : a summary table for the PGBMRF analysis.

Let's use pgbmtfICM function to perform the analysis with following arguments.

data : A data frame which consists of the gene expression data. The first column must contains gene names.
interactions : A data frame contains gene interaction information.
m : Number of replicates for the control group.
n : Number of replicates for the treatment group.

#User can change initial model parameters, the level of significance, the appropriate number of 
#- Gaussian quadrature points(k) and the maximum number of ICM iterations for the best performance.
#Note that with small datasets we expect the beta estimates to be a bit unstable but the ICM algorithm 
#- convergence is determined by the convergence of estimated DE states.

#pgbmrfICM(data=dataset,interactions=BioGRID.table,m=9,n=9)