GSE73555: Genes regulated by TAZ transcription co-factor in a lung fibroblast cell line; an RNAseq dataset (Ion Torrent). Here's a link to the paper. I've used this dataset in the past and know that the signal is relatively clean, so shall include it within metaDEGth. Also, it's a perfect 'spike-in' study, so should be a nice critical assessment excercise.
It is a pretty interesting experiment; the study authors used siRNA to knock down the TAZ gene. TAZ operates a transcription co-activation and the protein contains a PDZ binding linear motif (probably somewhere within its disordered regions http://d2p2.pro/view/sequence/up/Q9GZV5), which is likely PPI partner. This GEO dataset describes a knock-out experiment performed in a lung fibroblast cell line; two different siRNA's for TAZ and a negative control (NTC) siRNA.
With the plummeting costs of transcriptomic profiling by RNAseq, these kind of datasets are appearing in increasing abundance on GEO. Turning these data into useful insight occurs in two stages: - first, we must download and align short DNA reads against a model of the target genome (typically human) so as to estimate transcript abundance (literally how many copies of each gene transcript/splice-isoform are there) - second, we need to read in the transcript abundance per sample, aggregate to the gene-level, fit a regularised linear model of gene expression and then perform differential expression analysis based on the study design
From the GSE73555 project page, the study corresponds to the sequence read archive project of SRP064305. We shall use the SRAtoolkit to download the fastq data.
I've elected to work in the directory /mnt/c/GSE73555, downloading the reads into a sub-directory of FASTQ
library(data.table) SRP064305details <- fread("http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP064305") fwrite(SRP064305details, "/mnt/c/GSE73555/SRP064305details.tsv", sep = "\t", quote = FALSE)
fasterq-dump from the SRAtoolkit can take SSR identifiers and download the fastq files directly:
time cut -f 1 ../SRP064305details.tsv | grep 'SRR' | parallel -j 5 fasterq-dump {} -e 1 -t /tmp/scratch pigz -p 10 ./*.fastq # Save some space!!
This took around 90 minutes. Pretty slow for 18Gb in my opinion!
In my workflows, I tend to use the pseudo-alignment tool Kallisto, which uses counts of sub-strings in common between reads and reference gene sequences. It is dramatically faster than using a full sequence aligner such as TopHat or BWA and in every study that I have ever processed, just as accurate. A full tutorial exists on the github page, but the commands that I used are detailed below. This was performed in a subdirectory 'alignedReads'
Create an index of the current Homo sapiens cDNA library (downloaded from ENSMB ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/cdna/)
kallisto index -i Homo_sapiens.GRCh38.cDNA.idx /tmp/Homo_sapiens.GRCh38.cdna.all.fa.gz
Kallisto requires a fragment length and standard deviation estimate, but we do not have access to the bioanalyser result! As a result, we'll use a value in rought the 150-200 range and use (roughly) a poisson standard deviation.
time cut -f 1 ../SRP064305details.tsv | grep 'SRR' | parallel -j 1 kallisto quant --single -l 180 -s 15 -t 10 -i ../Homo_sapiens.GRCh38.cDNA.idx -o ./{} ../FASTQ/{}.fastq.gz
Takes under 10 minutes.
We shall use the tximport
package to read in the transcript estimate data and then the limma-voom
pipeline to process it. Such a procedure is advised in a comprehensive benchmarking study from 2016.
First we will need a transcript-to-gene mapping. This can actually be extracted directly from the cDNA file (a catalogue of all of the coding regions in the genome), downloadable directly from enseml. If you just performed the step above, then the relavent file is in /tmp. On a Linux machine, one can simply use the first fread
command and do a bit of text manipulation on the fly using command line tools/
library(data.table) #Suck out the ENST to ENSG (and gene Symbol) mapping from the GTF file, downloaded to /tmp/ ENST2ENSG <- fread('zcat /tmp/Homo_sapiens.GRCh38.cdna.all.fa.gz | grep "^>" | perl -pe "s/(>|chromosome:|gene:|GRCh38:|gene_biotype:|transcript_biotype:|gene_symbol:)//g" | perl -ne "s/\\s+/\\t/g; CORE::say $_" | cut -f1-7 ' , header = FALSE, fill = TRUE) colnames(ENST2ENSG) <- c("ENST","havanaStatus","chromosome","ENSG","geneStatus","geneStatus","geneSymbol")
We are dealing with SRP064305, so we shall load the short-read metadata files directly from SRA and then download and extract the sample annotation from GEO before merging them together into a single table.
library(GEOquery) library(purrr) library(stringr) library(magrittr) parsedSOFT <- getGEO(GEO = "GSE73555", GSEMatrix = FALSE, AnnotGPL = FALSE, getGPL = FALSE) # This is a data cleanup step. "Hell is other people's data" GSMdetails <- map_dfr(parsedSOFT@gsms, ~ {sampleAttr <- .x@header$characteristics_ch1 DT <- structure(sampleAttr %>% str_extract("[^:]+$") %>% str_remove_all("\\s+") %>% str_remove_all("#"), names = sampleAttr %>% str_extract("^[^:]+")) %>% as.list %>% as.data.table return(DT)} ) setnames(GSMdetails, str_remove(colnames(GSMdetails),"\\s+")) suppressWarnings(GSMdetails[,GSM := names(parsedSOFT@gsms)]) #A frivolous warning about nothing important might crop up, so with supress it GSMdetails[SRP064305details, Run := Run, on = .(GSM == SampleName)] GSMdetails
The end result is a table that maps together a SRR sample (the reads downloaded from SRA) with the experimental condition; which siRNA was used in this case.
The cornerstone of a linear model regression is the design matrix - aka a Model Matrix. It relates samples to experimental conditions in the experimental study; in this case, TAZ siRNA 1, TAZ siRNA 2 and control siRNAs.
Luckily, base R has a helpful method ?model.matrix
to assist in the construction of such matrices.
designMat <- model.matrix(~ 0 + transfection, data = GSMdetails) colnames(designMat) %<>% str_remove("transfection") # model.matrix keeps all of the condition information in the column headings. This is useful when there are many covariates, but in this case short titles are preffered. row.names(designMat) <- GSMdetails[,GSM] designMat
The design matrix is very simple in this case - a 1 when a sample is in a condition, 0 otherwise. One might anticipate a more complex design matrix when we are studying samples from patients with other features (aka covariates) that we wish to account for (gender, age etc.) in which case the design matrix wouldn't be just ones and zeros and there could be more than one entry per row. Such examples are beyond the scope of this tutorial.
library(limma) library(edgeR) library(tximport) projDir <- "/mnt/c/GSE73555/" # Each kallisto alignment run produces a director (labelled by the SRR ID), with an hdf5 file # We shall add a column with file locations to our experimental details table SRP064305details[,path := str_c(projDir,"alignedReads/",Run,"/abundance.tsv")] # Notice that we pass in a transcript to gene mapping (tx2gene) and that we normalise TPM. See tximport documentation for more details kallistoTranscriptEstimates <- tximport(files = SRP064305details[, path], type = "kallisto", tx2gene = ENST2ENSG[,.(ENST,ENSG)], countsFromAbundance = "lengthScaledTPM", dropInfReps = TRUE) # A function for computing the geometic mean of a series of observation # See https://en.wikipedia.org/wiki/Geometric_mean gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geneAbundances <- data.table(ENSG = row.names(kallistoTranscriptEstimates$abundance), gm_abundance = apply(kallistoTranscriptEstimates$abundance, 1, gm_mean) ) colnames(kallistoTranscriptEstimates$counts) <- SRP064305details[, SampleName] # Low quantified transcripts/genes can introduce strange artifacts. Filter these out with a very, very low cutoff. We only want to remove the lowest expressed genes transcriptDGElist <- kallistoTranscriptEstimates$counts %>% .[ geneAbundances[gm_abundance > 1.01, ENSG], ] %>% DGEList transcriptDGElist %<>% calcNormFactors
An important difference between micro-array and next-generation sequencing data is that the gene expression measurements have a property called 'over-dispersion', where the standard deviation of a dataset is greater than one might expect given typical statistical models. In short, there is a non-trivial mean-variance trend (i.e. the variance/standard deviation is not easily predicted from the mean alone, so we must introduce additional parameters to the model). This non-constant behaviour of variance in the data is a property called 'heteroskedascity' (in contrast with 'homoskedascity', where the variance relationship is constant).
Approaches in the field have gravitated around two methods; 1.) a very detailed generalised linear modeling framework with an additional parameter for the over-dispersion, 2.) fitting an empirical mean-variance trend and subtracting it from the data before running more standard linear modeling routines. The first of these options is the approach employed by packages such as edgeR and DESeq whereas the second lies behind limma-voom (which is written by the same author as edgeR) and has proven to be considerably faster and more reliable, robust and accurate in estimating differential expression. I use limma-voom and I believe that the field is moving towards a consensus on this topic.
The next step in our pipeline is to estimate the mean-variance trend (consult the voom help ?voom
for more details), and then to fit a linear model based on the design matrix that we built above.
voomTrendModelFromCounts <- voom(transcriptDGElist, designMat, plot = TRUE) GSE73555_fit <- lmFit(voomTrendModelFromCounts, designMat)
The limma function lmFit
produce a linear model for each gene in each condition; i.e. it estimates expression for each gene across the conditions specified in the design matrix.
A contrast matrix is a means of specifying what design matrix conditions we would like to compare against one-another. In GSE37555, the contrasts are simple: 1.) control siRNA vs TAZ siRNA 1 and 2.) control siRNA vs TAZ siRNA 2. Each contrast is specified column-wise in a labelled fashion: a -1 and 1 entry detailing the comparator and comparative - this is critical in determining the direction of fold change (the control condition should typically be assigned as -1). You can name the columns whatever you like, but do not use any special characters or spaces.
contrastMat <- matrix(0, ncol = 2, nrow = ncol(designMat), dimnames = list(colnames(designMat), c("knockdownTAZ1","knockdownTAZ2"))) contrastMat["siNTC",] <- -1 contrastMat["siTAZ1", "knockdownTAZ1"] <- 1 contrastMat["siTAZ2", "knockdownTAZ2"] <- 1 contrastMat
In this workbook, we shall only process knockdownTAZ1 (feel free to process knockdownTAZ2 - the procedure will be identical).
The linear model fit using lmFit
contains estimates for the expression of each gene in each condition. The contrast matrix specifies the way in which we would like to compare these conditions. These two objects can be combined using contrasts.fit and then cleverly adjusted for having a relatively small number of samples (6: 2 in each condition) but tens of thousands of genes to estimate bulk properties of gene expression across those samples; important statistical quantities are moderated towards a global average or trend, greatly increasing the power of downstream tests. This routine is known as 'shrinkage' and is well described in the literature.
limma has two techniques for estimating differential expression using shrinkage. The first is eBayes
, which can be used before topTable
to estimate differential expression log2 fold-change, (adjusted) P-values against a null-hypothesis of no change in expression (logFC = 0) and the log-odds (B). The second is treat
before topTreat
which, akin to eBayes
, also computes log2 fold change but instead computes a P-value for a test of log fold change being outside an interval of ± delta, where delta is chosen to wean out small, uninteresting changes (I use log2(1.2) in the block below). Both of these pieces of information are useful, so we shall merge them together.
GSE73555_allContrast <- contrasts.fit(GSE73555_fit, contrasts = contrastMat) # Produce a table of differential expression using eBayes and relabel the gene label column # siTAZ1 siTAZ1diffex_DT <- eBayes(GSE73555_allContrast) %>% topTable(coef = "knockdownTAZ1", adjust.method = "fdr", number = Inf) %>% data.table(keep.rownames = TRUE) %>% .[, .(rn, siTAZ1_logFC = logFC, siTAZ1_avExpr = AveExpr, siTAZ1_qValue = adj.P.Val, siTAZ1_pValue = P.Value)] setnames(siTAZ1diffex_DT, "rn", "ENSG") # siTAZ2 siTAZ2diffex_DT <- eBayes(GSE73555_allContrast) %>% topTable(coef = "knockdownTAZ2", adjust.method = "fdr", number = Inf) %>% data.table(keep.rownames = TRUE) %>% .[, .(rn, siTAZ2_logFC = logFC, siTAZ2_avExpr = AveExpr, siTAZ2_qValue = adj.P.Val, siTAZ2_pValue = P.Value)] setnames(siTAZ2diffex_DT, "rn", "ENSG") siTAZdiffex_DT <- merge(siTAZ1diffex_DT, siTAZ2diffex_DT, by = "ENSG") siTAZdiffex_DT[, qplot(siTAZ1_logFC, siTAZ2_logFC)] siTAZdiffex_DT[ENST2ENSG[geneStatus == "protein_coding", .(ENSG,geneSymbol)], geneSymbol := geneSymbol, on = "ENSG"] #Add gene symbols siTAZdiffex_DT[, ENSG := str_remove(ENSG, "\\.\\d+$")] # No need to keep hold of the ENSG versions siTAZdiffex_DT
A so-called volcano plot is a way of succinctly summarising effect size and significance of genes in a differential expression study. I use them as a means of quality control of an experiment - you can easily see if a fit has gone wrong, but it's hard to conclude much more from them.
library(ggplot2) siTAZdiffex_DT %>% ggplot(aes(x = siTAZ1_logFC, y = -log10(siTAZ1_pValue), colour = (siTAZ2_qValue < 0.05), shape = (siTAZ2_qValue < 0.05) ) ) + geom_point(size = 2) + theme_bw() + scale_color_discrete(name = "siTAZ2 qValue < 0.05") + scale_shape(guide = FALSE) + scale_x_continuous(breaks = seq(-10,10,1)) + scale_y_continuous(breaks = seq(0,20,0.5)) + labs(x = "log2(fold change) for siTAZ1", y = "-log10(P-value)", title = str_c("Differential expression for TAZ siRNA knockdown study"), subtitle = "Data drawn from GSE73555: TAZ knockdown in lung fibroblast cell line")
One of the worst things about bioinformatics is the plethora of gene IDs, symbols and the like. When dealing with a gene locus (which is what RNAseq reads are mapped to), we use ensemble IDs (ENSGXXXX).
uniprotIDmappings <- fread("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz", header = FALSE) #We only really care about swissprot mappings swissprot_accessions <- fread("curl ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | zcat | grep '^>' | perl -ne 'm/sp\\|(\\w+)\\|/; CORE::say $1'", header = FALSE) swissprotIDmappingssubset <- uniprotIDmappings[V1 %in% swissprot_accessions$V1][V19 != "", c(1,2,3,19)] colnames(swissprotIDmappingssubset) <- c("accession","name","geneID","ENSGgrouped") swissprotIDmappingssubset %<>% .[,.(ENSG = unlist(tstrsplit(ENSGgrouped, "; "))), by = .(accession, name, geneID, ENSGgrouped)] siTAZdiffex_DT %<>% merge( swissprotIDmappingssubset[,.(accession,name,geneID,ENSG)], by = "ENSG", all.x = TRUE) siTAZdiffex_HFL1_diffexDT <- siTAZdiffex_DT[,.(ENSG, geneSymbol, accession, name, geneID, siTAZ1_logFC, siTAZ1_pValue, siTAZ2_logFC, siTAZ2_pValue)] # Only save important fields setorder(siTAZdiffex_HFL1_diffexDT, siTAZ1_pValue ) #Write the results out to a data file for inclusion in the package save(siTAZdiffex_HFL1_diffexDT, file = "./data/siTAZdiffex_HFL1_diffexDT.RData", compress = "xz") siTAZdiffex_HFL1_diffexDT[ p.adjust(siTAZ1_pValue,"fdr") < 1E-3][order(-siTAZ1_logFC)]
What's at the top of the list? Stanniocalcin, which is a glycoprotein that controls calcium and phosphate - need I remind you that this study was within fibrosis.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.