knitr::opts_chunk$set(echo = TRUE)
Welcome to an introduction to the ksRepo
package. ksRepo
enables investigators to mix and match various types of case/control gene lists with any gene::drug interaction database to predict repositioning oportunities. This document will introduce you to the kinds of data you'll need to use ksRepo, and walk you through some techniques for obtaining it.
More information about ksRepo
is available in open-access form at BMC Bioinformatics.
In order to use ksRepo
, you'll need to first install and load it into your R environment. You can either do so by downloading/forking the GitHub Repository and installing it manually (for experienced users) or by using the Devtools
package:
# Install devtools if necessary install.packages('devtools') # Load the package require(devtools) # Download and install ksRepo from GitHub directly install_github('adam-sam-brown/ksRepo') library(ksRepo)
We will also install some additional packages to be used over the course of this introduction:
# Get Bioconductor to enable package installation install.packages('BiocInstaller') require('BiocInstaller') # Install packages biocLite(c('rmeta', 'GEOquery', 'Biobase') # Load into memory require(GEOquery) require(rmeta) require(Biobase)
ksRepo
requires as input a list of genes ordered by statistical significance. Statistical significance can be defined by, for example, differential expression between normal and disease samples, as in this introduction. For this analysis, we're going to use publically available data from the Gene Expression Omnibus. The analysis we'll be performing is based off of GEO's built-in GEO2R
tool.
The dataset we'll be using is GSE39645
, a series of microarray experiments on normal human nerve tissue, and tumor tissue from vestibular schwannoma (VS) patients. VS is a rare tumor syndrome characterized by schwann cell tumors in the inner ear, which lead to dizziness and sensorineural deficits.
To begin, we'll download the dataset directly from GEO using the GEOquery
package:
# Load series and platform data from GEO gset <- getGEO("GSE39645", GSEMatrix =TRUE, AnnotGPL=TRUE) gset <- gset[[1]] fvarLabels(gset) <- make.names(fvarLabels(gset))
Once we have the data, we can perform simple differential gene expression analysis on the probe-level using limma
:
# Set normal/tumor tissue identifiers gsms <- "0000000001111111111111111111111111111111" sml <- c() for (i in 1:nchar(gsms)) { sml[i] <- substr(gsms,i,i) } sml <- paste("G", sml, sep="") # set group names fl <- as.factor(sml) gset$description <- fl design <- model.matrix(~ description + 0, gset) colnames(design) <- levels(fl) # Perform simple limma analysis fit <- lmFit(gset, design) cont.matrix <- makeContrasts(G1-G0, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.01) # Get relevant information for further analysis tT <- topTable(fit2, adjust="fdr", sort.by="B", number=17587) tT <- subset(tT, select=c("ID","t","Gene.symbol")) tT$Gene.symbol <- as.character(tT$Gene.symbol)
Now we have a list of differentially expressed probes, but not genes, so we need to combine them together. To do this, we'll perform a fixed effects meta analysis to consolidate probes to genes. First, we'll define some useful functions
# Back out Cohen's d-statistics for meta-analysis dfromt <- function(t, n1, n2) { df <- n1 + n2 - 2 d <- (2*t) / sqrt(df) return(d) } sefromt <- function(t, n1, n2) { d <- dfromt(t, n1, n2) var <- ( (n1 + n2)/(n1*n2) + d^2/(2*(n1+n2-2)) ) * ((n1 + n2)/(n1 + n2 - 2)) se <- sqrt(var) return(se) } # Perform gene-level fixed-effects meta analysis meta <- function(gFr, nSamp) { # Parse Gene Names genel <- strsplit(gFr$Gene.symbol, split='///') names(genel) <- gFr$ID genel <- lapply(genel, unique) # Annotate with IDs gene <- setNames(unlist(genel, use.names=F),rep(names(genel), lengths(genel))) gene <- data.frame(gene=gene, probe=names(gene), stringsAsFactors = F) # Output Init omni <- data.frame(gene = unique(unlist(genel)), stringsAsFactors = F) omni$d <- rep(NA, nrow(omni)) omni$se <- rep(NA, nrow(omni)) omni$p <- rep(NA, nrow(omni)) # Output loop for (i in 1:nrow(omni)) { probes <- subset(gene, gene==omni$gene[i])$probe slice <- subset(gFr, ID %in% probes) metasum <- meta.summaries(slice$d,slice$se) # Fixed-effects meta-analysis of probe values omni$d[i] <-metasum$summary omni$se[i] <- metasum$se.summary omni$p[i] <- 2*pnorm(-abs(omni$d[i]*sqrt(nSamp))) # Two-sided p-value from d-statistic } omni <- omni[order(omni$p),] }
With these functions defined, we can easily convert our probe list into a gene list:
# Remove unmapped probes tT <- subset(tT, Gene.symbol != '') # Effect sizes and standard errors tT$d <- dfromt(tT$t, 9, 31) # 9 normal and 31 tumor samples tT$se <- sefromt(tT$t, 9, 31) # Convert probe- to gene-level differential expression dge <- meta(tT, 40) genelist <- dge$gene
We're now all set to perform our ksRepo
analysis.
With our sorted gene list, we can move on to ksRepo
analysis. We'll be using the built in Comparative Toxicogenomics Database (CTD) dataset, which contains a set of expert-curated gene::drug interactions. ksRepo
automatically selects a subset of the CTD that overlaps with our gene list, making running ksRepo as easy as a single line of code:
# Fetch the CTD data data(CTD) # Run in a single line (on most computers should take < 1 hour) results <- ksRepo(genelist, CTD) # Beyond the default options, you can change the number of resamples performed # As well as the type of resamples (resampling compound lists or your input)
Once you've run a ksRepo analysis, it's time to look at the results. ksRepo
outputs a data.frame
with columns for:
1. compound name, which exactly corresponds to an entry in drugbank
2. the number of genes interacting with a compound in the CTD (or your own database)
3. enrichment score (ks, for Kolmogorov-Smirnov enrichment)
4. a bootstrapped p-value (# of times bootstrapped ks exceeds actual ks)
5. an FDR q-value
A good rule of thumb for ksRepo
analysis is to look first at your FDR significant results and subset from there:
load('data.RDAta')
# Significant subset(results, boot.fdr < 0.05) # Significant and highly enriched subset(results, boot.fdr < 0.05 & ks > 0.2) # Significant, highly enriched, with known interactions with multiple target genes subset(results, boot.fdr < 0.05 & ks > 0.2 & n.genes > 1)
In this introduction, we're worked through an example of repositioning using ksRepo
. We first downloaded freely accessible gene expression data and pre-processed it for ksRepo
analysis. We then performed ksRepo
analysis in a single line of code and examined the results.
ksRepo
provides an easy and accessible way to perform drug repositioning with any data, and without a lot of the hassle of other methods.
If you're interested in learning more about ksRepo
or licensing our software, please email Adam Brown.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.