Introduction

The GeneExpressionSignature package utilizes gene expression profiles to measure the similarity between different biological states. It provides two algorithms for similarity measurement: the GSEA algorithm which is mentioned in @iorio2010discovery and the PGSEA algorithm in PGSEA R package. A further description of the measurement methods based on gene expression signature can be found in @lamb2006connectivity, @hu2009human and @iorio2010discovery.

This manual is a brief introduction to structure, functions and usage of GeneExpressionSignature package. It shows how the biological similarity is determined through a series of calculation steps and how that information can be used for further cluster analysis.

The current version of GeneExpressionSignature can be used only with data coming from the same platform, examples are on the HG-U133A platform.

Getting Started

A complete analysis procedure accepts a set of gene expression profiles representing different biological states as input, and generates a similarity matrix as output. It can be divided into three steps: 1)data ranking, 2)rank merging, and 3)similarity measuring.

First, we load the package by entering the following command in your R session:

library(GeneExpressionSignature)

Data Ranking

Gene expression profiles should be properly preprocessed before analysis as prerequisite, including background correction, normalization and summarization. Instead of the exact values, ranks of gene expression levels are used in the following procedure. A ranked list of genes was obtained first by sorting the microarray probe-set identifiers according to the different expression values (count or ratio). It should be noticed that there is no standard methods for data preprocessing, and there is a function getRLs which takes the method in C-MAP for data preprocessing just for reference. We can obtain ranked lists matrix by calling getRLs.

Your experimental data could be used for analysing, or users can download gene-expression profiles from the GEO database with R package GEOquery. Users can see the doc in the package GEOquery for more details.

As an example, we download data from GEO database with package GEOquery. Then combined the treatment expression values to form a treatment matrix as well as the control expression values.

# If you have network access
#GSM118720 <- getGEO('GSM118720')
# GSM118721 <- getGEO('GSM118721')
if (require(GEOquery)){
  #treatment gene-expression profiles
  GSM118720 <- getGEO(filename=system.file("extdata/GSM118720.soft",package=
  "GeneExpressionSignature"))
  #control gene-expression profiles
  GSM118721 <- getGEO(filename=system.file("extdata/GSM118721.soft",package=
  "GeneExpressionSignature"))
  #data ranking according to the different expression values 
  control <- as.matrix(as.numeric(Table(GSM118721)[,2]))
  treatment <- as.matrix(as.numeric(Table(GSM118720)[,2]))
  ranked_list <-getRLs(control,treatment)}

Rank Merging

By rank merging, multiple ranked lists are merging into a single ranked list, referred as prototype ranked list (PRL), representing certain kind of biological state. This procedure is mainly performed before similarity measuring, and applied to specific situations that occur when multiple ranked list are assigned to one single biological state with different cell types or experimental condition.

However, two different cases should be considered: 1) all ranked list with the same biological state are treated eaually important; 2) each individual ranked lists has its own ranked weights. This package provides two commonly employed algorithms: one utilizes the Kruskal algorithm proposed by @iorio2010discovery for the former case and another takes the average ranking technique a simple but ranther useful method. Function RankMering is provided for aggregating the ranked lists into one or many PRLs according their phenotypic data. All the things that we need to do is construct a ExpressionSet object as input, with ranked lists as assay data and corresponding biological states as phenotypic data.

For convenience, ranking data stored as ExpressionSet class in eset object as input data, with ranked lists (obtained by calling getRLs} as assay data and corresponding biological states as phenotypic data. As an example, we start from loading cultured the exampleSet data, a subset of C-MAP @lamb2006connectivity as sample data, which is a large reference catalogue of gene expression data from cultured human cells perturbed with many chemicals and genetic reagents. The sub dataset is composed of 50 paired gene expression profiles involving 22283 genes. This profiles are obtained from cells treated 15 compounds respectively, the values of which already converted to rank orders.

data(exampleSet)
show(exampleSet)
exprs(exampleSet)[c(1:10),c(1:3)]
levels(as(phenoData(exampleSet),"data.frame")[,1])

Rank merging process will generate a mergingSet of 15 PRLs from 50 paired expression profiles with each PRL corresponding one of 15 compounds respectively.

MergingSet <- RankMerging(exampleSet,"Spearman",weighted=TRUE)
show(MergingSet)

Similarity Measuring

One single combined PRL for a state was obtained after rank merging procedure. These PRLs are used to measure the similarity of the gene signature across different biological states by scoring functions ScoreGSEA and ScorePGSEA. Not all the genes are involved in similarity measuring, but only a subset of genes called gene signature whose combined expression pattern is uniquely characteristic of the biological state. Generally the genes used as gene signatures in the similarity scoring procedure are predefined by priori knowledge. @iorio2010discovery proposed an "optimal signature" approach by taking the most up-regulated genes and the most down-regulated genes as gene signature.The size of gene signatures need to be considered, which is taken as another parameter besides the PRLs in similarity measuring. In most cases, the default size of gene signature is 250 for genome-wide expression profile.

Suppose N is the number of PRLs (also same as the number of biological states), an N x N distance matrix is generated by similarity measurement. For mergingSet, we will get a 15 x 15 matrix corresponding to the similarity distances between these compounds.

ds <- ScoreGSEA(MergingSet,250,"avg")
ds[1:5,1:5]

Signature Distance

As we mentioned above, four algorithms implemented as functions getRLs, RankMerging, ScoreGSEA and ScorePGSEA, one is for data preprocessing, one called Iorio algorithm is for rank merging, the other two algorithms called GSEA and PGSEA are for similarity measuring. Moreover, function SignatureDistance is provided to serve as a single entry and easy access point to rank merging and similarity measuring, which runs through the including rank merging and scoring, and is recommended to use in most cases. Data ranking is not integration into this funciton for no standard methods for data preprocessing and gene-expression data types is uncertain. Furthermore, there is no effective method to integrate data from different platforms. Function getRLs which takes the method in C-MAP for data preprocessing just for reference.

ds1 <- SignatureDistance(exampleSet,SignatureLength=250,MergingDistance="Spearman",
                  ScoringMethod="GSEA",ScoringDistance="avg",weighted=TRUE)
ds1[1:5,1:5]

Implementation Details

Adaptively Weighted Rank Merging

The Iorio's rank merging algorithm utilizes Kruskal algorithm @cormen1990introduction to merge the ranked lists which corresponding to a same biological state. The distance of these ranked lists must be calculated first, a measure of the distance between two ranked lists is computed using Spearman algorithm or Kendall tau algorithm. It should be noticed that is rank merging with Kendall tau distance is time consuming, so we recommend selecting the Spearman distance. Next, merge the two or more ranked lists with the same biological state using Borda merging algorithm.

According to the Kruskal algorithm method @cormen1990introduction, this rank merging algorithm searches for the two ranked lists with the smallest Spearman's Footrule distance first, and then merges them using the Borda Merging method, obtaining a new ranked list. Finally, the new list replaces the two unmerged lists. This process won't terminate until only one list remains.

For convenience, users can directly obtain a PRL for each state by the function Iorio.RankMerging, which uses Sprearman, BordaMerging, and Kruskal algorithms to aggregate the ranked lists obtained with the same biological state. For instance, we will merge the sample data which with 50 samples into 15 samples.

MergingSet <- RankMerging(exampleSet,"Spearman",weighted=TRUE)
show(MergingSet)

Eqully Weighted Rank Merging

A simple but rather useful method for this problem is the average ranking technique. The technique is a two step process when we are under the assumption that importance is equally weighted for each ranked list. First step is to calculate average rank for each ranked list and then the second step is to construct their final rankings.

Similarity Measuring

Once ranked lists with same biological states are merged to one single PRL, Gene Set Enrichment Analysis (GSEA) and Parametric Gene Set Enrichment Analysis (PGSEA) are adopted to measure the similarity among these PRLs.

GSEA algorithm @subramanian2005gene is a nonparametric, rank-based method for similarity measuring to determine whether a priori defined set of genes shows statistically significant, concordant differences between two biological states. PGSEA algorithm is a modified gene set enrichment analysis method based on a parametric statistical analysis model, and we use the functions in R package PGSEA for similarity measuring. Both of these two functions gives the corresponding p value, function ScoreGSEA calcutes the empirical p values from Monte Carlo Procedures @north2002note.

ds <- ScoreGSEA(MergingSet,250,"avg")
ds[1:5,1:5]
ds <- ScorePGSEA(MergingSet,250,"avg")
ds[1:5,1:5]

Futher Analysis

To illustrate how to use GeneExpressionSignature in analysis of gene expression signatures, affinity propagation clustering can be used to group these biological states by the similarity of gene signature. Affinity propagation cluster algorithm iteratively searches for optimal clustering by maximizing an objective function called net similarity. Here, we use function in R apcluster package to classify the 15 biological states into 3 groups. In this step, R package apcluster should also be installed on your computer.

if (require(apcluster)){
  library(apcluster)
  clusterResult <- apcluster(1-ds)
  show(clusterResult)
}

Cytoscape is used to visualize the result of clustering. In the network, nodes denotes different compounds (cell states treated with different compounds), and the edge means the similarity distance between these two compounds is lower than a threshold, which is 0.68 here. Different colors denote different groups, as the classification of compounds. We note that the largest group is numbered 9 nodes, and the other two consist of 3 nodes for each group.

cluster

Session Information

The version number of R and packages loaded for generating the vignette were:

sessionInfo()

References



zhilongjia/GeneExpressionSignature documentation built on May 4, 2019, 11:21 p.m.