The dcGSA Package: Distance-correlation based Gene set Analysis for longitudinal

Overview

Gene set analysis methods, widely used tools in biological studies, have been shown to be more powerful and have better biological interpretation than single gene based analysis. Recently, longitudinal gene expression profiles of patients are collected in many clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and hence may lead to better treatments. We developed a gene set analysis method for longitudinal gene expression data, called dcGSA (Distance-correlation based Gene set Analysis, see [@sun2015dcGSA] for details about this method). And, the dcGSA package can be used to perform this analysis.

This document provides a tutorial for using the dcGSA package. The tutorial includes information on (1) how to load gene sets of interest into R, (2) how to prepare your data sets for dcGSA, and (3) how to apply dcGSA to assess associations between phenotypes and gene sets. As with any R package, detailed information on functions, along with their arguments and values, can be obtained in the help files. The analyses performed in this tutorial are based on a sample gene set file and a sample test data, which come along with the dcGSA package. The data can be loaded from the package as shown below.

Load gene sets of interest into R

Gene set, a group of genes that are interacting with each other to perform a biological function, is core part of all gene set analysis methods. To define a biological meaningful gene set is a difficult task. Fortunately, a plethora of gene sets have been created thanks to all the efforts from the acedemic area. The MsigDB database [@liberzon2011molecular] at broad institute stores all kinds of gene sets, among which the KEGG, Biocarta, and REACTOME gene sets are commonly used, and these gene sets can be downloaded in different format (e.g. XML, GMT). (see more about the gene sets and different format at http://www.broadinstitute.org/gsea/msigdb/collections.jsp)

The first step to perform gene set analysis is to load gene sets of interest into R. Assume the gene sets of interests in GMT format have been downloaded from MsigDB (or gene sets of your own have been created in GMT format. Make sure that the Gene ID format in the gene set is the same as in the gene expression data). The readGMT function in dcGSA package can be used to read GMT file into R. The code to use readGMT function is shown below. What the following code does is to read 'sample.gmt.txt' into the R (this file comes with the package and there are 10 KEGG gene sets in the file). The first gene set in the file is called 'KEGG_GLYCOLYSIS_GLUCONEOGENESIS' and there are 62 genes in this set as printed below. To read your gene sets into R, 'fpath' has to be changed to '/path_to_your_file/filename'. The R object 'GS' will be later used as input to dcGSA to perform gene set analysis.

library(dcGSA)
fpath <- system.file("extdata", "sample.gmt.txt", package="dcGSA")
GS <- readGMT(file=fpath)
GS[1]

Prepare your datasets

Besides gene sets, another essential part that is needed by dcGSA is the gene expression data together with phenotyes of interest. As an exmample of the data format, a test dataset 'dcGSAtest' is available after the dcGSA package is installed and can be loaded into R as follows.

library(dcGSA)
data(dcGSAtest)

The dcGSAtest is a list comprised of ID, data (phenotypes of interest), gene (longitudinal gene expresion profiles).

names(dcGSAtest)

The 'ID' is used to indicate the multiple visits for each subject. For example, in the 'dcGSAtest' dataset, there are 5 patients with ID's 'Patient001',...,'Patient005' and there are multiple visits for each subject (e.g. the 'Patient001' has three visits). dcGSA requires that each subject has more than two visits.

dcGSAtest$ID

The 'pheno' is for the phenotypes of interest with each column being one phenotype. The dcGSA function works with continuous phenotypes, such as BMI and blood pressure (Binary phenotype has not been tested). And, dcGSA can test the association between a gene set and multiple related phenotypes simultaneously. For example, if we are interested gene sets that are associated with both BMI and blood pressure, then we can put BMI and blood pressure in the 'pheno'. In the 'dcGSAtest' dataset, there are two phenotypes (i.e. 'pheno1' and 'pheno2') and each row of 'pheno' corresponds to the phenotypes of one specific vist for the subject indicated by 'ID' (e.g. the fourth row of 'pheno' indicate the two phenotypes measured for the 'Patient002' at the first visit.).

dcGSAtest$pheno

The 'gene' is for the gene expression with each column being one gene. Similar to 'pheno', each row of 'gene' corresponds to the gene expression profile of one specific vist for the subject indicated by 'ID'. In the 'dcGSAtest' dataset, there are 16 rows (totally 16 visits by all patients) and 265 genes. Again, make sure that the gene ID in 'gene' is the same as in the gene sets loaded. For example, gene symbols are used for both the 'dcGSAtest' dataset and 'sample.gmt.txt'.

dim(dcGSAtest$gene)
dcGSAtest$gene[1:6,1:6]

Apply dcGSA function to perform gene set analysis

Now, we will use the loaded gene sets (R object 'GS') and the test data ('dcGSAtest') to perform gene set analysis using dcGSA as follows. The P values from dcGSA is based on permutation and the parameter 'nperm' in dcGSA function can be used to control the number of permutations (100 permutations are used for the analysis below). Also, parallel computing is also possible for dcGSA (see the help page for more details.).

set.seed(0824)
library(dcGSA)
fpath <- system.file("extdata", "sample.gmt.txt", package="dcGSA")
GS <- readGMT(file=fpath)
data(dcGSAtest)
dcGSA(data=dcGSAtest, geneset=GS, nperm=100, c=0)

The 'OverlapSize' in the result indicate the number of overlapping genes between the gene expression data ('gene') and gene sets. If we only want to focus on analyzing the gene sets that have at least 20 overlapping genes with the gene expression data, we can set parameters $c=20$ in dcGSA function as illusrated below.

dcGSA(data=dcGSAtest, geneset=GS, nperm=100, c=20)

References



Try the dcGSA package in your browser

Any scripts or data that you put into this service are public.

dcGSA documentation built on Nov. 8, 2020, 7:53 p.m.