BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({ library(longevityTools) library(ggplot2) })
This vignette is part of the NIA funded Longevity Genomics project. For more information on this project please visit its website here. The GitHub repository of the corresponding R package is available here and the most recent version of this vignette can be found here.
The project component covered by this vignette analyzes drug- and age-related genome-wide expression data from public microarray and RNA-Seq experiments. One of the main objective is the identification drug candidates modulating the expression of longevity genes and pathways. For this, we compare age-related expression signatures with those from drug treamtments. The age-related query signatures are from recent publications such as Peters et al. [-@Peters2015-fc] and Sood et al. [-@Sood2015-pb], while the drug-related reference signatures are from the Connectivity Map (CMAP) and LINCS projects [@Lamb2006-uv].
The R software for running longevityTools
can be downloaded from CRAN. The longevityTools
package can be installed from the R console using the following biocLite
install command.
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script biocLite("tgirke/longevityTools", build_vignettes=FALSE) # Installs package from GitHub
library("longevityTools") # Loads the package library(help="longevityTools") # Lists package info vignette(topic="longevityTools_eDRUG", package="longevityTools") # Opens vignette
Preliminary functions that are still under developement, or not fully tested and documented
can be imported with the source
command from the inst/extdata
directory of the package.
fctpath <- system.file("extdata", "longevityTools_eDRUG_Fct.R", package="longevityTools") source(fctpath)
The following creates the directory structure expected by this workflow. Input data
will be stored in the data
directory and results will be written to the results
directory.
All paths are given relative to the present working directory of the user's R session.
dir.create("data"); dir.create("data/CEL"); dir.create("results")
The drug-related expression data are downloaded from the CMAP web site
here. The getCmap
function downloads
the CMAP rank matrix along with the compound annotations, and getCmapCEL
downloads the corresponding 7,056 CEL files. The functions will write the
downloaded files to the data
and data/CEL
directories within the present
working directory of the user's R session. Since some of the raw data sets
are large, the functions will only rerun the download if the argument rerun
is assigned TRUE
. If the raw data are not needed then users can skip this
time consuming download step and work with the preprocessed data
obtained in the next section.
getCmap(rerun=FALSE) # Downloads cmap rank matrix and compound annotation files getCmapCEL(rerun=FALSE) # Download cmap CEL files. Note, this will take some time
The experimental design of the CMAP project is defined in the file
cmap_instances_02.xls
. Note, this file required some cleaning in LibreOffice
(Excel would work for this too). After this it was saved as tab delimited txt
file named
cmap_instances_02.txt.
The following count statisitics are extracted from this file.
The panel of cell lines used by CMAP includes MCF7, ssMCF7, HL60, PC3 and SKMEL5. Each cell type was subjected to the following number of total treatments and number of distinct drugs, respectively. The total number of drugs used by CMAP is 1,309.
library(ggplot2); library(reshape2) cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) df <- data.frame(table(cmap[, "cell2"]), as.numeric(table(cmap[!duplicated(paste0(cmap$cmap_name, cmap$cell2)),"cell2"]))) colnames(df) <- c("Cell_type", "Treatments", "Drugs") df <- melt(df, id.vars=c("Cell_type"), variable.name = "Samples", value.name="Counts") ggplot(df, aes(Cell_type, Counts, fill=Samples)) + geom_bar(position="dodge", stat="identity") + ggtitle("Number of treatments by cell types")
The number Affymetrix chip used in the experiments is plotted here for each of the three chip types used by CMAP:
df <- data.frame(table(cmap$array3)); colnames(df) <- c("Chip_type", "Counts") ggplot(df, aes(Chip_type, Counts)) + geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + ggtitle("Number of chip types")
The CMAP data set is based on three different Affymetrix chip types (HG-U133A,
HT_HG-U133A and U133AAofAv2). The following extracts the chip type information
from the CEL files and stores the result in an rds
file with the path
./data/chiptype.rds
. Users who skipped the download of the CEL files can
download this file here.
celfiles <- list.files("./data/CEL", pattern=".CEL$") chiptype <- sapply(celfiles, function(x) affxparser::readCelHeader(paste0("data/CEL/", x))$chiptype) if(FALSE) saveRDS(chiptype, "./data/chiptype.rds") # if(FALSE) protects this line from accidental execution!
The follwoing processes the CEL files from each chip type separately using the
MAS5 normalization algorithm. The results will be written to 3 subdirectores
under data
that are named after the chip type names. To save time, the
processing is parallelized with BiocParallel
to run on 100 CPU cores of a
computer cluster with a scheduler (e.g. Torque). The number of CEL files from
each chip type are: 807 CEL files from HG-U133A, 6029 CEL files from
HT_HG-U133A, and 220 CEL files from U133AAofAv2. Note, these numbers are slightly
different than those reported in the cmap_instances_02.txt
file. The MAS5 normalized data
sets can be downloaded here:
HG-U133A,
HT_HG-U133A,
U133AAofAv2.
library(BiocParallel); library(BatchJobs); library(affy) chiptype <- readRDS("./data/chiptype.rds") chiptype_list <- split(names(chiptype), as.character(chiptype)) normalizeCel(chiptype_list, rerun=FALSE)
chiptype_dir <- unique(readRDS("./data/chiptype.rds")) combineResults(chiptype_dir, rerun=FALSE)
This deletes intermediate files. Before executing these lines, please make sure that this is what you want.
for(i in seq_along(chiptype_dir)) unlink(list.files(paste0("data/", chiptype_dir[i]), pattern="cellbatch", full.names=TRUE), recursive=TRUE) unlink("data/CEL/*.CEL") # Deletes downloaded CEL files
The following generates annotation information for the Affymetirx probe set
identifiers. Note, the three different Affymetrix chip types used by CMAP
share most probe set ids (>95%), meaning it is possible to combine the data
after normalization and use the same annotation package for all of them. The
annotation libraries for the chip types HG-U133A and HT_HG-U133A are
hgu133a.db
and hthgu133a.db
, respectively. However, there is no annotation
library (e.g. CDF) available for U133AAofAv2. The annotation file can be downloaded
from here: myAnnot.xls
.
library(hgu133a.db) myAnnot <- data.frame(ACCNUM=sapply(contents(hgu133aACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(hgu133aSYMBOL), paste, collapse=", "), UNIGENE=sapply(contents(hgu133aUNIGENE), paste, collapse=", "), ENTREZID=sapply(contents(hgu133aENTREZID), paste, collapse=", "), ENSEMBL=sapply(contents(hgu133aENSEMBL), paste, collapse=", "), DESC=sapply(contents(hgu133aGENENAME), paste, collapse=", ")) write.table(myAnnot, file="./results/myAnnot.xls", quote=FALSE, sep="\t", col.names = NA) saveRDS(myAnnot, "./results/myAnnot.rds")
limma
The sampleList
function extracts the sample comparisons (contrasts) from the
CMAP annotation table and stores them as a list.
cmap <- read.delim("./data/cmap_instances_02.txt", check.names=FALSE) # comp_list <- sampleList(cmap, myby="CMP") comp_list <- sampleList(cmap, myby="CMP_CELL")
The following loads the MAS5 normalized expression data into a single data.frame
.
To accelerate the import, the data is read from rds
files.
chiptype_dir <- unique(readRDS("./data/chiptype.rds")) df1 <- readRDS(paste0("data/", chiptype_dir[1], "/", "all_mas5exprs.rds")) df2 <- readRDS(paste0("data/", chiptype_dir[2], "/", "all_mas5exprs.rds")) df3 <- readRDS(paste0("data/", chiptype_dir[3], "/", "all_mas5exprs.rds")) affyid <- rownames(df1)[rownames(df1) %in% rownames(df2)]; affyid <- affyid[affyid %in% rownames(df3)] mas5df <- cbind(df1[affyid,], df2[affyid,], df3[affyid,])
The next step generates gene level expression values. If genes are represented by several probe sets then their mean intensities are used. This is necessary because the U133 chip contains many genes with duplicated probe sets. Probe sets not matching any gene are removed.
myAnnot <- readRDS("./results/myAnnot.rds") myAnnot <- myAnnot[as.character(myAnnot[,"ENTREZID"]) != "NA",] mas5df <- mas5df[rownames(myAnnot),] idlist <- tapply(row.names(myAnnot), as.character(myAnnot$ENTREZID), c) mas5df <- t(sapply(names(idlist), function(x) colMeans(mas5df[idlist[[x]], ])))
limma
The analysis of differentially expressed genes (DEGs) is performed with the limma
package.
Genes meeting the chosen cutoff criteria are reported as DEGs (below set to FDR of 10% and
a minimum fold change of 2). The DEG matrix is written to a file named
degMA.xls
.
degList <- runLimma(df=log2(mas5df), comp_list, fdr=0.10, foldchange=1, verbose=TRUE, affyid=NULL) write.table(degList$DEG, file="./results/degMA.xls", quote=FALSE, sep="\t", col.names = NA) saveRDS(degList$DEG, "./results/degMA.rds") # saves binary matrix saveRDS(degList, "./results/degList.rds") # saves entire degList
The following plots the number of drug treatments (y-axis) for increasing bin sizes (x-axis) of DEGs.
degMAgene <- readRDS("./results/degMA.rds") y <- as.numeric(colSums(degMAgene)) interval <- table(cut(y, right=FALSE, dig.lab=5, breaks=c(0, 5, 10, 50, 100, 200, 500, 1000, 10000))) df <- data.frame(interval); colnames(df) <- c("Bins", "Counts") ggplot(df, aes(Bins, Counts)) + geom_bar(position="dodge", stat="identity", fill="cornflowerblue") + ggtitle("DEG numbers by bins")
Peters et al. [-@Peters2015-fc] reported 1,497 age-related gene expression
signatures. The intersectStats
function computes their intersects with each
of the 3,318 drug-responsive DEG sets from CMAP. The result includes the
Jaccard index as a simple similarity metric for gene sets as well as the raw
and adjusted p-values based on the hypergeometric distribution expressing how
likely it is to obtain the observed intersect sizes just by chance. The
results for the 20 top scoring drugs are given below and the full data set is
written to a file named
degOL_PMID26490707.xls
.
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#") myAnnot <- readRDS("./results/myAnnot.rds") geneid <- as.character(PMID26490707$"NEW.Entrez.ID") degMAgene <- readRDS("./results/degMA.rds") # Faster than read.delim() degMAsub <- degMAgene[rownames(degMAgene) %in% geneid,] degOL_PMID26490707 <- intersectStats(degMAgene, degMAsub) write.table(degOL_PMID26490707, file="./results/degOL_PMID26490707.xls", quote=FALSE, sep="\t", col.names = NA) sum(degOL_PMID26490707[,1] > 0) # Drugs with any overlap degOL_PMID26490707[1:20,]
Sood et al. [-@Sood2015-pb] reported 150 age-related gene expression signatures.
The intersectStats
function computes their intersects with each of the 3,318
drug-responsive DEG sets from CMAP. The result includes the Jaccard index as a simple
similarity metric for gene sets as well as the raw and adjusted p-values based on the
hypergeometric distribution expressing how likely it is to observe the observed intersect
sizes just by chance. The results for the 20 top scoring drugs are given below and the full
data set is written to a file named degOL_PMID26343147.xls
.
PMID26343147 <- read.delim("./data/PMID26343147_S1T1.xls", check.names=FALSE, comment="#") myAnnot <- readRDS("./results/myAnnot.rds") geneid <- as.character(myAnnot[rownames(myAnnot) %in% as.character(PMID26343147[,1]), "ENTREZID"]) geneid <- geneid[geneid!="NA"] degMA <- readRDS("./results/degMA.rds") # Faster then read.delim() degMA <- degMA[ , !is.na(colSums(degMA))] # Remove columns where DEG analysis was not possible degMAsub <- degMA[geneid,] degOL_PMID26343147 <- intersectStats(degMAgene, degMAsub) write.table(degOL_PMID26343147, file="./results/degOL_PMID26343147.xls", quote=FALSE, sep="\t", col.names = NA) sum(degOL_PMID26343147[,1] > 0) # Drugs with any overlap degOL_PMID26343147[1:20,] # Top 20 scoring drugs
The following identifies CMAP drugs affecting the expression of the IGF1 or IGF1R genes.
The final result is written to a file named deg_IGF1.xls
.
genesymbols <- c("IGF1", "IGF1R") geneids <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"ENTREZID"])) names(geneids) <- unique(as.character(myAnnot[myAnnot$SYMBOL %in% genesymbols,"SYMBOL"])) degList <- readRDS("./results/degList.rds") df <- data.frame(row.names=colnames(degList$DEG), check.names=FALSE) index <- which(colSums(degList$DEG[geneids,])>= 1) for(i in seq_along(geneids)) { tmp <- data.frame(DEG=degList$DEG[geneids[i],index], logFC=degList$logFC[geneids[i],index], FDR=degList$FDR[geneids[i],index]) colnames(tmp) <- paste0(names(geneids)[i], "_", colnames(tmp)) df <- cbind(df, tmp[rownames(df),] ) } df <- df[names(index),] write.table(df, file="./results/deg_IGF1.xls", quote=FALSE, sep="\t", col.names = NA)
Now the final data.frame
can be sorted by increasing mean FDR values.
igfDF <- read.delim("./results/deg_IGF1.xls", row.names=1) igfDF[order(rowMeans(igfDF[,c(3,6)])),][1:20,]
library(ChemmineR) mypath <- system.file("extdata", "longevitydrugs.sdf", package="longevityTools") mypath <- "../inst/extdata/longevitydrugs.sdf" sdfset <- read.SDFset(mypath) data(sdfsample) sdfsample plot(sdfsample[1:4], print=FALSE)
The connectivity maps approach is a rank-based enrichment method utilizing the KS test [@Lamb2006-uv]. It measures the similarities of expression signatures based on the enrichment of up- and down-regulated genes at the top and bottom of sorted (ranked) gene lists.
The following uses the 1,497 age-related gene expression signatures from Peters et al.
[-@Peters2015-fc] as a query against the CMAP signatures. The results are sorted by the
ES Distance and the top scoring 20 drugs are given below. The full result table is
written to a file named drugcmap2.xls
.
library(DrugVsDisease) PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE) data(drugRL) PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(drugRL),] PMID26490707sub <- PMID26490707sub[order(PMID26490707sub$Zscore, decreasing=TRUE),] PMID26490707sub <- rbind(head(PMID26490707sub, 100), tail(PMID26490707sub, 100)) # Subsets to top 200 DEGs testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"]))) drugcmap <- classifyprofile(data=testprofiles$ranklist, case="disease", signif.fdr=0.5, no.signif=20) drugcmap2 <- classifyprofile(data=testprofiles$ranklist, case="disease", pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", dynamic.fdr=5, signif.fdr=5, adj="BH", no.signif=1000) write.table(drugcmap2, file="./results/drugcmap2.xls", quote=FALSE, sep="\t", col.names = NA) drugcmap2[[1]][1:20,]
The same query is performed against a reference set of disease expression signatures.
The results are sorted by the ES Distance and the top scoring 20 drugs are given below.
The full result table is written to a file named diseasecmap2.xls
.
PMID26490707 <- read.delim("./data/PMID26490707_S1.xls", comment="#", check.names=FALSE) data(diseaseRL) PMID26490707sub <- PMID26490707[PMID26490707[,"NEW-Gene-ID"] %in% rownames(diseaseRL),] testprofiles <- list(ranklist=matrix(PMID26490707sub$Zscore, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"])), pvalues=matrix(PMID26490707sub$P, dimnames=list(PMID26490707sub[,"NEW-Gene-ID"]))) diseasecmap <- classifyprofile(data=testprofiles$ranklist, case="drug", signif.fdr=0.5, no.signif=20) diseasecmap2 <- classifyprofile(data=testprofiles$ranklist, case="drug", pvalues=testprofiles$pvalues, cytoout=FALSE, type="dynamic", dynamic.fdr=5, adj="BH", no.signif=100) write.table(diseasecmap2, file="./results/diseasecmap2.xls", quote=FALSE, sep="\t", col.names = NA) diseasecmap2[[1]][1:20,]
In progress...
In progress...
This project is funded by NIH grant U24AG051129 awarded by the National Intitute on Aging (NIA).
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.