We illustrate the use of the tcor
package to compute a thresholded
gene expression correlation matrix of gene expression data from the
Cancer Genome Atlas (TCGA).
TCGA (http://cancergenome.nih.gov/) is a joint effort of the National Cancer Institute and the National Human Genome Research Institute. TCGA provides curated data and analyses, including large-scale genome sequencing, for many cancer tumor types.
The example proceeds in two parts:
tcor
The data for this example are obtained from the Broad Institute's GDAC Firehose http://firebrowse.org/?cohort=BRCA&download_dialog=true. The GDAC provides a convenient way to download versioned and standardized TCGA data organized as sample by measurement tables in tab delimited text form.
Some of the following steps use Unix-like pipeline processing with shell
utilities and R's pipe
function. Windows users may need to install the Rtools
suite (https://cran.r-project.org/bin/windows/Rtools/).
We select breast invasive carcinoma gene expression data, one of the larger
available datasets. The GDAC dashboard is available at
http://gdac.broadinstitute.org/. Data may be browsed and manually downloaded
directly from the dashboard, or downloaded and uncompressed using the
download.file
and untar
lines in the script below.
url = "http://gdac.broadinstitute.org/runs/stddata__2015_11_01/data/BRCA/20151101/gdac.broadinstitute.org_BRCA.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2015110100.0.0.tar.gz" destfile = "gdac.broadinstitute.org_BRCA.Merge_rnaseq__illuminahiseq_rnaseq__unc_edu__Level_3__gene_expression__data.Level_3.2015110100.0.0.tar.gz" download.file(url, destfile) # (about 300 MB) untar(destfile) # (about 600 MB) # data file name: fn = dir(gsub("\\.tar\\.gz", "", destfile), pattern="*.data.txt", full.names=TRUE)
The data file is a tab-delimited text file with two header lines followed by 20,532 data lines. Each data line specifies a gene followed by sample measurements for that gene in the columns. The measurements include raw counts, normalized, and reads per kilobase of transcript per million mapped reads (RPKM) values. We'll use the RPKM-normalized values in this example.
The header lines look like:
# Hybridiz... ID1 ID1 ID1 ID2 ID2 ID2 ... # gene raw_counts median_length_normalized RPKM raw_counts median_length_normalized raw_counts ...
where, IDn
indicates the nth TCGA barcode ID.
We could simply read these data into R using read.table
, for example with:
# Simple but not particularly efficient way to read the data... # (Don't run this, continue reading instead...) id = unlist(read.table(fn, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=1)) brca = read.table(fn, sep="\t", stringsAsFactors=FALSE, header=FALSE, skip=2) # ... now filter out just the gene and RPKM columns...
But since we're only interested in the RPKM-normalized values that approach reads too much from the file. It can be more efficient to skip the columns we're not interested in and read in just the gene and RPKM columns.
Instead we can use some the simple but effective shell tool cut
and the idea
of pipelines to process the data file to remove all but the gene and RPKM
columns on the fly as we read it into R. There are two distinct advantages to
this approach: we read in only what we're interested in (cutting processing
time and memory consumption), and we employ pipeline-style parallel processing
to further speed things up, running the column-skipping cut
process in
parallel with the data parsing R read.table
function.
The pipelined processing as described in the last paragraph can use at most two CPU cores to process the data (in practice on average somewhat less due to I/O and other overhead). Lots of even cheap PCs today have more than two cores, and often quite fast storage systems (for example, solid state disk drives). We can wring even more performance out of such systems by combining the pipeline parallelism with explicit parallel processing using R's myriad available parallel processing functions.
The example code below uses Steve Weston's elegant foreach
framework for
explicit parallel processing to read the data file in chunks. Chunks are
processed concurrently using the pipelined parallelism described above: on
a four-CPU computer this yields four R and four cat
worker processes
plus the controlling R process.
Any foreach
parallel back end can be used for this task. We use the
Unix-specific doMC
backend below but Windows users can equivalently use the
doParallel
backend. The work can even be distributed across more than one
computer with doSNOW
, doMPI
, or doRedis
backends (without changing the
code). The example below runs in under 30 seconds on my inexpensive quad-core
Athlon home PC.
There are 2,635 columns (878 unique samples) of tab-separated data. If we're interested only in RPKM values, then we want columns 1, 4, 7, 10, ..., 2635. One efficient way to get just the columns of interest uses an external shell pipeline.
h = 2 # total header lines in the data file N = 20534 # total number of lines in the data file # The argument to cut -d is a TAB symbol inside of single quotes. # You can generate that by typing CTRL+V followed by TAB. For some # reason it often does not copy right (coming over as spaces instead), # so beware here... command = sprintf("cat %s | cut -d ' ' -f %s", fn, paste(c(1,seq(from=4, to=2635, by=3)), collapse=",")) # Read the first header line of sample IDs: f = pipe(sprintf("%s | head -n 1", command), open="r") id = unlist(read.table(f, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=1)) id[1] = "gene" close(f) # Read the rest of the file in parallel. library(doMC) cores = 4 registerDoMC(cores) block = floor((N - h)/cores) brca = foreach(j=1:cores, .combine=rbind) %dopar% { skip = block * (j - 1) + h + 1 nrows = ifelse(j == cores, -1, block) f = pipe(sprintf("%s | tail -n +%.0f", command, skip), open="r") on.exit(close(f)) read.table(f, sep="\t", stringsAsFactors=FALSE, header=FALSE, nrows=nrows) } # Finally, label the variables we've just read in using the TCGA sample IDs. names(brca) = id
Once finished, we have a data frame named brca
with 20,532 rows and 879
columns. The first column contains gene names, the rest contain sample
RPKM values. The data frame column names include the TCGA barcode sample
IDs.
See https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode for help understanding the TCGA barcode, a sequence of dash separated identifiers. In particular, the fourth identifier (sample/vial) indicates if the sample comes from normal tissue, solid tumor, or elsewhere, as described in https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTable=sample%20type. We can identify columns associated with tumor, metastatic, and normal tissue samples by:
tumor = grep("^....-..-....-01.-...-....-..", names(brca)) normal = grep("^....-..-....-11.-...-....-..", names(brca)) metastatic = grep("^....-..-....-06.-...-....-..", names(brca))
The next step of our example computes thresholded gene correlation matrices and works with data in matrix form, not data frames. The final step in this section assembles two matrices corresponding to tumor and normal samples:
brca_tumor = t(as.matrix(brca[, tumor])) brca_normal = t(as.matrix(brca[, normal])) colnames(brca_tumor) = brca$gene # gene names for reference colnames(brca_normal) = brca$gene # gene names for reference print(dim(brca_tumor)) print(dim(brca_normal))
[1] 775 20532 [1] 100 20532
The tcor package (https://github.com/bwlewis/tcor, and companion preprint paper http://arxiv.org/abs/1512.07246) provides an implementation of the a new algorithm for fast and efficient thresholded correlation. You can install the development version of the R package directly from GitHub with
devtools::install_github("bwlewis/tcor") library(tcor)
Because we're interested in correlation among the columns (gene expression), we need to filter out constant-valued columns (including, for example, columns of all zeros):
brca_tumor_filtered = brca_tumor[, apply(brca_tumor, 2, sd) > 0] brca_normal_filtered = brca_normal[, apply(brca_normal, 2, sd) > 0]
Let's find all pairs of gene expression vectors among the filtered tumor data with correlation values at least 0.99:
tumor_cor = tcor(brca_tumor_filtered, t=0.99) str(tumor_cor)
List of 6 $ indices : num [1:529, 1:3] 16749 8316 4320 4319 4320 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:3] "i" "j" "val" $ n : num 195369 $ longest_run: num 5467 $ t : num 0.99 $ svd_time : num 5.48 $ total_time : num 34.3
The tcor
function found 529 such correlated gene expression vectors (out of a
total 20522^2 = 421,152,484 possible gene pairs) in about 35 seconds on my
quad-core home PC. We can translate the listed matrix column indices into more
easily readable gene names with, for example:
tumor = data.frame(i=colnames(brca_tumor_filtered)[tumor_cor$indices[,1]], j=colnames(brca_tumor_filtered)[tumor_cor$indices[,2]], val=tumor_cor$indices[,3]) head(tumor)
i j val 1 SNORD115-2|100033437 KRTAP20-3|337985 1.0000000 2 INS-IGF2|723961 IGF2|3481 0.9999759 3 CSN1S2A|286828 CSN2|1447 0.9999411 4 GC|2638 CSN1S2A|286828 0.9998493 5 GC|2638 CSN2|1447 0.9997900 6 NRAP|4892 CSRP3|8048 0.9997411
We can verify the result by explicitly computing a full correlation matrix, but this takes a lot longer and uses much more memory (270 seconds and 6 GB on my PC):
# Uncomment the following lines if you want to run the full correlation # for comparison... # C = cor(brca_tumor_filtered) # sum(C[upper.tri(C)] >= 0.99) # (You will get 529, same as computed above with tcor.)
We can similarly identify the 20,331 pairs of correlated gene expression vectors for the normal samples (in about 17 seconds on my PC):
normal_cor = tcor(brca_normal_filtered, t=0.99) str(normal_cor) normal = data.frame(i=colnames(brca_normal_filtered)[normal_cor$indices[,1]], j=colnames(brca_normal_filtered)[normal_cor$indices[,2]], val=normal_cor$indices[,3]) head(normal)
i j val 1 OR1S1|219959 LELP1|149018 1 2 LELP1|149018 OR9K2|441639 1 3 OR5M3|219482 SNORD115-10|100033447 1 4 SNORD115-10|100033447 OR4C46|119749 1 5 OR5T2|219464 OR8D4|338662 1 6 OR8D4|338662 OR2M5|127059 1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.