BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
    eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
    cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({
    library(longevityTools) 
    library(ggplot2) }) 

Introduction

This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website here. The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here.

The project component covered by this vignette analyzes drug- and age-related genome-wide expression data from public microarray and RNA-Seq experiments. One of the main objective is the identification drug candidates modulating the expression of longevity genes and pathways. For this, we compare age-related expression signatures with those from drug treamtments. The age-related query signatures are from recent publications such as Peters et al. [-@Peters2015-fc] and Sood et al. [-@Sood2015-pb], while the drug-related reference signatures are from the Connectivity Map (CMAP) and LINCS projects [@Lamb2006-uv].

[Back to Table of Contents]()

Getting Started

Installation

The R software for running longevityTools can be downloaded from CRAN. The longevityTools package can be installed from the R console using the following biocLite install command.

source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script 
biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub
[Back to Table of Contents]()

Loading package and documentation

library("longevityTools") # Loads the package
library(help="longevityTools") # Lists package info
vignette(topic="longevityTools_eDRUG", package="longevityTools") # Opens vignette
[Back to Table of Contents]()

Setup of working environment

Import custom functions

Preliminary functions that are still under developement, or not fully tested and documented can be imported with the source command from the inst/extdata directory of the package.

fctpath <- system.file("extdata", "longevityTools_eDRUG_Fct.R", package="longevityTools")
source(fctpath)
[Back to Table of Contents]()

Create expected directory structure

The following creates the directory structure expected by this workflow. Input data will be stored in the data directory and results will be written to the results directory. All paths are given relative to the present working directory of the user's R session.

dir.create("data"); dir.create("data/CEL"); dir.create("results") 
[Back to Table of Contents]()

Data downloads

Download data from Connectivity Map project site

The drug-related expression data are downloaded from the CMAP web site here. The getCmap function downloads the CMAP rank matrix along with the compound annotations, and getCmapCEL downloads the corresponding 7,056 CEL files. The functions will write the downloaded files to the data and data/CEL directories within the present working directory of the user's R session. Since some of the raw data sets are large, the functions will only rerun the download if the argument rerun is assigned TRUE. If the raw data are not needed then users can skip this time consuming download step and work with the preprocessed data obtained in the next section.

getCmap(rerun=FALSE) # Downloads cmap rank matrix and compound annotation files
getCmapCEL(rerun=FALSE) # Download cmap CEL files. Note, this will take some time
[Back to Table of Contents]()

Overview of CMAP data

The experimental design of the CMAP project is defined in the file cmap_instances_02.xls. Note, this file required some cleaning in LibreOffice (Excel would work for this too). After this it was saved as tab delimited txt file named cmap_instances_02.txt. The following count statisitics are extracted from this file.

The panel of cell lines used by CMAP includes MCF7, ssMCF7, HL60, PC3 and SKMEL5. Each cell type was subjected to the following number of total treatments and number of distinct drugs, respectively. The total number of drugs used by CMAP is 1,309.

library(ggplot2); library(reshape2) 
cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) 
df <- data.frame(table(cmap[, "cell2"]), as.numeric(table(cmap[!duplicated(paste0(cmap$cmap_name, cmap$cell2)),"cell2"])))
colnames(df) <- c("Cell_type", "Treatments", "Drugs")
df <- melt(df, id.vars=c("Cell_type"), variable.name = "Samples", value.name="Counts")
ggplot(df, aes(Cell_type, Counts, fill=Samples)) + 
       geom_bar(position="dodge", stat="identity") + 
       ggtitle("Number of treatments by cell types")

The number Affymetrix chip used in the experiments is plotted here for each of the three chip types used by CMAP:

df <- data.frame(table(cmap$array3)); colnames(df) <- c("Chip_type", "Counts") 
ggplot(df, aes(Chip_type, Counts)) + 
       geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + 
       ggtitle("Number of chip types")
[Back to Table of Contents]()

Pre-processing of CEL files

Determine chip type from CEL files

The CMAP data set is based on three different Affymetrix chip types (HG-U133A, HT_HG-U133A and U133AAofAv2). The following extracts the chip type information from the CEL files and stores the result in an rds file with the path ./data/chiptype.rds. Users who skipped the download of the CEL files can download this file here.

celfiles <- list.files("./data/CEL", pattern=".CEL$")
chiptype <- sapply(celfiles, function(x) affxparser::readCelHeader(paste0("data/CEL/", x))$chiptype)
if(FALSE) saveRDS(chiptype, "./data/chiptype.rds") # if(FALSE) protects this line from accidental execution!
[Back to Table of Contents]()

Normalization of CEL files

The follwoing processes the CEL files from each chip type separately using the MAS5 normalization algorithm. The results will be written to 3 subdirectores under data that are named after the chip type names. To save time, the processing is parallelized with BiocParallel to run on 100 CPU cores of a computer cluster with a scheduler (e.g. Torque). The number of CEL files from each chip type are: 807 CEL files from HG-U133A, 6029 CEL files from HT_HG-U133A, and 220 CEL files from U133AAofAv2. Note, these numbers are slightly different than those reported in the cmap_instances_02.txt file. The MAS5 normalized data sets can be downloaded here: HG-U133A, HT_HG-U133A, U133AAofAv2.

library(BiocParallel); library(BatchJobs); library(affy)
chiptype <- readRDS("./data/chiptype.rds")
chiptype_list <- split(names(chiptype), as.character(chiptype))
normalizeCel(chiptype_list, rerun=FALSE) 
[Back to Table of Contents]()

Combine results from same chip type in single data frame

chiptype_dir <- unique(readRDS("./data/chiptype.rds"))
combineResults(chiptype_dir, rerun=FALSE)
[Back to Table of Contents]()

Clean-up of intermediate files

This deletes intermediate files. Before executing these lines, please make sure that this is what you want.

for(i in seq_along(chiptype_dir)) unlink(list.files(paste0("data/", chiptype_dir[i]), pattern="cellbatch", full.names=TRUE), recursive=TRUE)
unlink("data/CEL/*.CEL") # Deletes downloaded CEL files
[Back to Table of Contents]()

Obtain annotation information

The following generates annotation information for the Affymetirx probe set identifiers. Note, the three different Affymetrix chip types used by CMAP share most probe set ids (>95%), meaning it is possible to combine the data after normalization and use the same annotation package for all of them. The annotation libraries for the chip types HG-U133A and HT_HG-U133A are hgu133a.db and hthgu133a.db, respectively. However, there is no annotation library (e.g. CDF) available for U133AAofAv2. The annotation file can be downloaded from here: myAnnot.xls.

library(hgu133a.db)
myAnnot <- data.frame(ACCNUM=sapply(contents(hgu133aACCNUM), paste, collapse=", "), 
                             SYMBOL=sapply(contents(hgu133aSYMBOL), paste, collapse=", "), 
                             UNIGENE=sapply(contents(hgu133aUNIGENE), paste, collapse=", "), 
                             ENTREZID=sapply(contents(hgu133aENTREZID), paste, collapse=", "), 
                             ENSEMBL=sapply(contents(hgu133aENSEMBL), paste, collapse=", "), 
                             DESC=sapply(contents(hgu133aGENENAME), paste, collapse=", "))
write.table(myAnnot, file="./results/myAnnot.xls", quote=FALSE, sep="\t", col.names = NA) 
saveRDS(myAnnot, "./results/myAnnot.rds")
[Back to Table of Contents]()

DEG analysis with limma

Generate list of CEL names defining treatment vs. control comparisons

The sampleList function extracts the sample comparisons (contrasts) from the CMAP annotation table and stores them as a list.

cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) 
# comp_list <- sampleList(cmap, myby="CMP")
comp_list <- sampleList(cmap, myby="CMP_CELL")
[Back to Table of Contents]()

Load normalized expression data

The following loads the MAS5 normalized expression data into a single data.frame. To accelerate the import, the data is read from rds files.

chiptype_dir <- unique(readRDS("./data/chiptype.rds"))
df1 <- readRDS(paste0("data/", chiptype_dir[1], "/", "all_mas5exprs.rds"))
df2 <- readRDS(paste0("data/", chiptype_dir[2], "/", "all_mas5exprs.rds"))
df3 <- readRDS(paste0("data/", chiptype_dir[3], "/", "all_mas5exprs.rds"))
affyid <- rownames(df1)[rownames(df1) %in% rownames(df2)]; affyid <- affyid[affyid %in% rownames(df3)]
mas5df <- cbind(df1[affyid,], df2[affyid,], df3[affyid,])

Transform probe set to gene level data

The next step generates gene level expression values. If genes are represented by several probe sets then their mean intensities are used. This is necessary because the U133 chip contains many genes with duplicated probe sets. Probe sets not matching any gene are removed.

myAnnot <- readRDS("./results/myAnnot.rds") 
myAnnot <- myAnnot[as.character(myAnnot[,"ENTREZID"]) != "NA",]
mas5df <- mas5df[rownames(myAnnot),]
idlist <- tapply(row.names(myAnnot), as.character(myAnnot$ENTREZID), c)
mas5df <- t(sapply(names(idlist), function(x) colMeans(mas5df[idlist[[x]], ])))
[Back to Table of Contents]()

DEG analysis with limma

The analysis of differentially expressed genes (DEGs) is performed with the limma package. Genes meeting the chosen cutoff criteria are reported as DEGs (below set to FDR of 10% and a minimum fold change of 2). The DEG matrix is written to a file named degMA.xls.

degList <- runLimma(df=log2(mas5df), comp_list, fdr=0.10, foldchange=1, verbose=TRUE, affyid=NULL)
write.table(degList$DEG, file="./results/degMA.xls", quote=FALSE, sep="\t", col.names = NA) 
saveRDS(degList$DEG, "./results/degMA.rds") # saves binary matrix 
saveRDS(degList, "./results/degList.rds") # saves entire degList
[Back to Table of Contents]()

Number of DEGs across drug treatments

The following plots the number of drug treatments (y-axis) for increasing bin sizes (x-axis) of DEGs.

degMAgene <- readRDS("./results/degMA.rds")
y <- as.numeric(colSums(degMAgene))
interval <- table(cut(y, right=FALSE, dig.lab=5,  breaks=c(0, 5, 10, 50, 100, 200, 500, 1000, 10000)))
df <- data.frame(interval); colnames(df) <- c("Bins", "Counts")
ggplot(df, aes(Bins, Counts)) + 
       geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + 
       ggtitle("DEG numbers by bins")
[Back to Table of Contents]()

Identify DEG overlaps with Peters et al. [-@Peters2015-fc]

Peters et al. [-@Peters2015-fc] reported 1,497 age-related gene expression signatures. The intersectStats function computes their intersects with each of the 3,318 drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple similarity metric for gene sets as well as the raw and adjusted p-values based on the hypergeometric distribution expressing how likely it is to obtain the observed intersect sizes just by chance. The results for the 20 top scoring drugs are given below and the full data set is written to a file named degOL_PMID26490707.xls.

PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#")
myAnnot <- readRDS("./results/myAnnot.rds") 
geneid <- as.character(PMID26490707$"NEW.Entrez.ID")
degMAgene <- readRDS("./results/degMA.rds") # Faster than read.delim()
degMAsub <- degMAgene[rownames(degMAgene) %in% geneid,]
degOL_PMID26490707 <- intersectStats(degMAgene, degMAsub)
write.table(degOL_PMID26490707, file="./results/degOL_PMID26490707.xls", quote=FALSE, sep="\t", col.names = NA) 
sum(degOL_PMID26490707[,1] > 0) # Drugs with any overlap
degOL_PMID26490707[1:20,]
[Back to Table of Contents]()

Identify DEG overlaps with Sood et al. [-@Sood2015-pb]

Sood et al. [-@Sood2015-pb] reported 150 age-related gene expression signatures. The intersectStats function computes their intersects with each of the 3,318 drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple similarity metric for gene sets as well as the raw and adjusted p-values based on the hypergeometric distribution expressing how likely it is to observe the observed intersect sizes just by chance. The results for the 20 top scoring drugs are given below and the full data set is written to a file named degOL_PMID26343147.xls.

PMID26343147 <- read.delim("./data/PMID26343147_S1T1.xls", check.names=FALSE, comment="#")
myAnnot <- readRDS("./results/myAnnot.rds") 
geneid <- as.character(myAnnot[rownames(myAnnot) %in% as.character(PMID26343147[,1]), "ENTREZID"])
geneid <- geneid[geneid!="NA"]
degMA <- readRDS("./results/degMA.rds") # Faster then read.delim()
degMA <- degMA[ , !is.na(colSums(degMA))] # Remove columns where DEG analysis was not possible
degMAsub <- degMA[geneid,]
degOL_PMID26343147 <- intersectStats(degMAgene, degMAsub)
write.table(degOL_PMID26343147, file="./results/degOL_PMID26343147.xls", quote=FALSE, sep="\t", col.names = NA) 
sum(degOL_PMID26343147[,1] > 0) # Drugs with any overlap
degOL_PMID26343147[1:20,] # Top 20 scoring drugs
[Back to Table of Contents]()

Drugs affecting known longevity genes

The following identifies CMAP drugs affecting the expression of the IGF1 or IGF1R genes. The final result is written to a file named deg_IGF1.xls.

genesymbols <- c("IGF1", "IGF1R")
geneids <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"ENTREZID"]))
names(geneids) <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"SYMBOL"]))
degList <- readRDS("./results/degList.rds") 
df <- data.frame(row.names=colnames(degList$DEG), check.names=FALSE)
index <- which(colSums(degList$DEG[geneids,])>= 1) 
for(i in seq_along(geneids)) {
    tmp <- data.frame(DEG=degList$DEG[geneids[i],index], logFC=degList$logFC[geneids[i],index], FDR=degList$FDR[geneids[i],index])
    colnames(tmp) <- paste0(names(geneids)[i], "_", colnames(tmp))
    df <- cbind(df, tmp[rownames(df),] )    
}
df <- df[names(index),]
write.table(df, file="./results/deg_IGF1.xls", quote=FALSE, sep="\t", col.names = NA) 

Now the final data.frame can be sorted by increasing mean FDR values.

igfDF <- read.delim("./results/deg_IGF1.xls", row.names=1)
igfDF[order(rowMeans(igfDF[,c(3,6)])),][1:20,]

Back to Table of Contents

Plot structures of compounds

library(ChemmineR)
mypath <- system.file("extdata", "longevitydrugs.sdf", package="longevityTools")
mypath <- "../inst/extdata/longevitydrugs.sdf"
sdfset <- read.SDFset(mypath)
data(sdfsample)
sdfsample
plot(sdfsample[1:4], print=FALSE)
[Back to Table of Contents]()

Connectivity maps enrichment analysis

The connectivity maps approach is a rank-based enrichment method utilizing the KS test [@Lamb2006-uv]. It measures the similarities of expression signatures based on the enrichment of up- and down-regulated genes at the top and bottom of sorted (ranked) gene lists.

Query drug signatures

The following uses the 1,497 age-related gene expression signatures from Peters et al. [-@Peters2015-fc] as a query against the CMAP signatures. The results are sorted by the ES Distance and the top scoring 20 drugs are given below. The full result table is
written to a file named drugcmap2.xls.

library(DrugVsDisease)
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE)
data(drugRL)
PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(drugRL),]
PMID26490707sub <- PMID26490707sub[order(PMID26490707sub$Zscore, decreasing=TRUE),]
PMID26490707sub <- rbind(head(PMID26490707sub, 100), tail(PMID26490707sub, 100)) # Subsets to top 200 DEGs 
testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), 
                     pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])))
drugcmap <- classifyprofile(data=testprofiles$ranklist, case="disease", signif.fdr=0.5, no.signif=20)
drugcmap2 <- classifyprofile(data=testprofiles$ranklist, case="disease", 
                            pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", 
                            dynamic.fdr=5, signif.fdr=5, adj="BH", no.signif=1000)
write.table(drugcmap2, file="./results/drugcmap2.xls", quote=FALSE, sep="\t", col.names = NA) 
drugcmap2[[1]][1:20,]
[Back to Table of Contents]()

Query disease signatures

The same query is performed against a reference set of disease expression signatures. The results are sorted by the ES Distance and the top scoring 20 drugs are given below. The full result table is written to a file named diseasecmap2.xls.

PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE)
data(diseaseRL)
PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(diseaseRL),]
testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), 
                     pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])))
diseasecmap <- classifyprofile(data=testprofiles$ranklist, case="drug", signif.fdr=0.5, no.signif=20)
diseasecmap2 <- classifyprofile(data=testprofiles$ranklist, case="drug", 
                            pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", 
                            dynamic.fdr=5, adj="BH", no.signif=100)
write.table(diseasecmap2, file="./results/diseasecmap2.xls", quote=FALSE, sep="\t", col.names = NA) 
diseasecmap2[[1]][1:20,]
[Back to Table of Contents]()

Age-drug network analysis

In progress...

[Back to Table of Contents]()

Age-disease network analysis

In progress...

[Back to Table of Contents]()

Funding

This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).

[Back to Table of Contents]()

Version information

sessionInfo()
[Back to Table of Contents]()

References



tgirke/longevityTools documentation built on May 31, 2019, 9:07 a.m.