About GEO study GSE73555

GSE73555: Genes regulated by TAZ transcription co-factor in a lung fibroblast cell line; an RNAseq dataset (Ion Torrent). Here's a link to the paper. I've used this dataset in the past and know that the signal is relatively clean, so shall include it within metaDEGth. Also, it's a perfect 'spike-in' study, so should be a nice critical assessment excercise.

It is a pretty interesting experiment; the study authors used siRNA to knock down the TAZ gene. TAZ operates a transcription co-activatoon and the protein contains a PDZ binding linear motif (probably somewhere within its disordered regions http://d2p2.pro/view/sequence/up/Q9GZV5), which is likely PPI partner. This GEO dataset describes a knock-out experiment performed in a lung fibroblast cell line; two different siRNA's for TAZ and a negative control (NTC) siRNA.

With the plummeting costs of transcriptomic profiling by RNAseq, these kind of datasets are appearing in increasing abundance on GEO. Turning these data into useful insight occurs in two stages: - first, we must download and align short DNA reads against a model of the target genome (typically human) so as to estimate transcript abundance (literally how many copies of each gene transcript/splice-isoform are there) - second, we need to read in the transcript abundance per sample, aggregate to the gene-level, fit a regularised linear model of gene expression and then perform differential expression analysis based on the study design

Downloading reads and aligning against the human genome

From the GSE73555 project page, the study corresponds to the sequence read archive project of SRP064305. We shall use the SRAtoolkit to download the fastq data.

I've elected to work in the directory /mnt/c/GSE73555, downloading the reads into a sub-directory of FASTQ


SRP064305details <-  fread("http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP064305")

fwrite(SRP064305details, "/mnt/c/GSE73555/SRP064305details.tsv", sep = "\t", quote = FALSE)

fasterq-dump from the SRAtoolkit can take SSR identifiers and download the fastq files directly:

time cut -f 1 ../SRP064305details.tsv | grep 'SRR' | parallel -j 5 fasterq-dump {} -e 1 -t /tmp/scratch

pigz -p 10 ./*.fastq # Save some space!!

This took around 90 minutes. Pretty slow for 18Gb in my opinion!

In my workflows, I tend to use the pseudo-alignment tool Kallisto, which uses counts of sub-strings in common between reads and reference gene sequences. It is dramatically faster than using a full sequence aligner such as TopHat or BWA and in every study that I have ever processed, just as accurate. A full tutorial exists on the github page, but the commands that I used are detailed below. This was performed in a subdirectory 'alignedReads'

Create an index of the current Homo sapiens cDNA library (downloaded from ENSMB ftp://ftp.ensembl.org/pub/release-98/fasta/homo_sapiens/cdna/)

kallisto index -i Homo_sapiens.GRCh38.cDNA.idx /tmp/Homo_sapiens.GRCh38.cdna.all.fa.gz

Kallisto requires a fragment length and standard deviation estimate, but we do not have access to the bioanalyser result! As a result, we'll use a value in rought the 150-200 range and use (roughly) a poisson standard deviation.

time cut -f 1 ../SRP064305details.tsv | grep 'SRR' | parallel -j 1 kallisto quant --single  -l 180 -s 15 -t 10 -i ../Homo_sapiens.GRCh38.cDNA.idx -o ./{} ../FASTQ/{}.fastq.gz

Takes under 10 minutes.

Data import

We shall use the tximport package to read in the transcript estimate data and then the limma-voom pipeline to process it. Such a procedure is advised in a comprehensive benchmarking study from 2016.

First we will need a transcript-to-gene mapping. This can actually be extracted directly from the cDNA file (a catalogue of all of the coding regions in the genome), downloadable directly from enseml. If you just performed the step above, then the relavent file is in /tmp. On a Linux machine, one can simply use the first fread command and do a bit of text manipulation on the fly using command line tools/


#Suck out the ENST to ENSG (and gene Symbol) mapping from the GTF file, downloaded to /tmp/
ENST2ENSG <- fread('zcat /tmp/Homo_sapiens.GRCh38.cdna.all.fa.gz | grep "^>" | perl -pe "s/(>|chromosome:|gene:|GRCh38:|gene_biotype:|transcript_biotype:|gene_symbol:)//g" | perl -ne "s/\\s+/\\t/g; CORE::say $_" | cut -f1-7 ' , header = FALSE, fill = TRUE)

colnames(ENST2ENSG) <- c("ENST","havanaStatus","chromosome","ENSG","geneStatus","geneStatus","geneSymbol")

We are dealing with SRP064305, so we shall load the short-read metadata files directly from SRA and then download and extract the sample annotation from GEO before merging them together into a single table.


parsedSOFT <- getGEO(GEO = "GSE73555", GSEMatrix = FALSE, AnnotGPL = FALSE, getGPL = FALSE)

# This is a data cleanup step. "Hell is other people's data"
GSMdetails <- map_dfr(parsedSOFT@gsms,
        ~ {sampleAttr <- .x@header$characteristics_ch1
           DT <- structure(sampleAttr %>% str_extract("[^:]+$") %>% 
                             str_remove_all("\\s+") %>%
                    names = sampleAttr %>% str_extract("^[^:]+")) %>% 
            as.list %>%
            return(DT)}  )

setnames(GSMdetails, str_remove(colnames(GSMdetails),"\\s+"))

suppressWarnings(GSMdetails[,GSM := names(parsedSOFT@gsms)]) #A frivolous warning about nothing important might crop up, so with supress it

           Run := Run,
           on = .(GSM == SampleName)]


The end result is a table that maps together a SRR sample (the reads downloaded from SRA) with the experimental condition; which siRNA was used in this case.

Setting up the design matrix

The cornerstone of a linear model regression is the design matrix - aka a Model Matrix. It relates samples to experimental conditions in the experimental study; in this case, TAZ siRNA 1, TAZ siRNA 2 and control siRNAs.

Luckily, base R has a helpful method ?model.matrix to assist in the construction of such matrices.

designMat <- model.matrix(~ 0 + transfection, data = GSMdetails)
colnames(designMat) %<>% str_remove("transfection") # model.matrix keeps all of the condition information in the column headings. This is useful when there are many covariates, but in this case short titles are preffered.
row.names(designMat) <- GSMdetails[,GSM]


The design matrix is very simple in this case - a 1 when a sample is in a condition, 0 otherwise. One might anticipate a more complex design matrix when we are studying samples from patients with other features (aka covariates) that we wish to account for (gender, age etc.) in which case the design matrix wouldn't be just ones and zeros and there could be more than one entry per row. Such examples are beyond the scope of this tutorial.

Importing transcript data


projDir <- "/mnt/c/GSE73555/"

# Each kallisto alignment run produces a director (labelled by the SRR ID), with an hdf5 file
# We shall add a column with file locations to our experimental details table

SRP064305details[,path := str_c(projDir,"alignedReads/",Run,"/abundance.tsv")]

# Notice that we pass in a transcript to gene mapping (tx2gene) and that we normalise TPM. See tximport documentation for more details
kallistoTranscriptEstimates <- tximport(files = SRP064305details[, path],
                                        type = "kallisto",
                                        tx2gene = ENST2ENSG[,.(ENST,ENSG)],
                                        countsFromAbundance = "lengthScaledTPM",
                                        dropInfReps = TRUE)

# A function for computing the geometic mean of a series of observation
# See https://en.wikipedia.org/wiki/Geometric_mean
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) }

geneAbundances <- data.table(ENSG = row.names(kallistoTranscriptEstimates$abundance),
                          gm_abundance = apply(kallistoTranscriptEstimates$abundance, 1, gm_mean) )

colnames(kallistoTranscriptEstimates$counts) <-  SRP064305details[, SampleName]

# Low quantified transcripts/genes can introduce strange artifacts. Filter these out with a very, very low cutoff. We only want to remove the lowest expressed genes
transcriptDGElist <- kallistoTranscriptEstimates$counts %>% 
  .[ geneAbundances[gm_abundance > 1.01, ENSG], ] %>%
transcriptDGElist %<>% calcNormFactors

Heteroskedascity - AKA 'how I learned to stop worrying and love the variance stabilising transform'

An important difference between micro-array and next-generation sequencing data is that the gene expression measurements have a property called 'over-dispersion', where the standard deviation of a dataset is greater than one might expect given typical statistical models. In short, there is a non-trivial mean-variance trend (i.e. the variance/standard deviation is not easily predicted from the mean alone, so we must introduce additional parameters to the model). This non-constant behaviour of variance in the data is a property called 'heteroskedascity' (in contrast with 'homoskedascity', where the variance relationship is constant).

Approaches in the field have gravitated around two methods; 1.) a very detailed generalised linear modeling framework with an additional parameter for the over-dispersion, 2.) fitting an empirical mean-variance trend and subtracting it from the data before running more standard linear modeling routines. The first of these options is the approach employed by packages such as edgeR and DESeq whereas the second lies behind limma-voom (which is written by the same author as edgeR) and has proven to be considerably faster and more reliable, robust and accurate in estimating differential expression. I use limma-voom and I believe that the field is moving towards a consensus on this topic.

The next step in our pipeline is to estimate the mean-variance trend (consult the voom help ?voom for more details), and then to fit a linear model based on the design matrix that we built above.

voomTrendModelFromCounts <- voom(transcriptDGElist, designMat, plot = TRUE)

GSE73555_fit <- lmFit(voomTrendModelFromCounts, designMat)

The limma function lmFit produce a linear model for each gene in each condition; i.e. it estimates expression for each gene across the conditions specified in the design matrix.

Contrast matricies and comparing estimated coefficients

A contrast matrix is a means of specifying what design matrix conditions we would like to compare against one-another. In GSE37555, the contrasts are simple: 1.) control siRNA vs TAZ siRNA 1 and 2.) control siRNA vs TAZ siRNA 2. Each contrast is specified column-wise in a labelled fashion: a -1 and 1 entry detailing the comparator and comparative - this is critical in determining the direction of fold change (the control condition should typically be assigned as -1). You can name the columns whatever you like, but do not use any special characters or spaces.

contrastMat <- matrix(0, 
                      ncol = 2,
                      nrow = ncol(designMat), 
                      dimnames = list(colnames(designMat), c("knockdownTAZ1","knockdownTAZ2")))

contrastMat["siNTC",] <- -1
contrastMat["siTAZ1", "knockdownTAZ1"] <- 1
contrastMat["siTAZ2", "knockdownTAZ2"] <- 1


In this workbook, we shall only process knockdownTAZ1 (feel free to process knockdownTAZ2 - the procedure will be identical).

The linear model fit using lmFit contains estimates for the expression of each gene in each condition. The contrast matrix specifies the way in which we would like to compare these conditions. These two objects can be combined using contrasts.fit and then cleverly adjusted for having a relatively small number of samples (6: 2 in each condition) but tens of thousands of genes to estimate bulk properties of gene expression across those samples; important statistical quantities are moderated towards a global average or trend, greatly increasing the power of downstream tests. This routine is known as 'shrinkage' and is well described in the literature.

limma has two techniques for estimating differential expression using shrinkage. The first is eBayes, which can be used before topTable to estimate differential expression log2 fold-change, (adjusted) P-values against a null-hypothesis of no change in expression (logFC = 0) and the log-odds (B). The second is treat before topTreat which, akin to eBayes, also computes log2 fold change but instead computes a P-value for a test of log fold change being outside an interval of ± delta, where delta is chosen to wean out small, uninteresting changes (I use log2(1.2) in the block below). Both of these pieces of information are useful, so we shall merge them together.

GSE73555_allContrast <- contrasts.fit(GSE73555_fit, contrasts = contrastMat)

# Produce a table of differential expression using eBayes and relabel the gene label column

# siTAZ1
siTAZ1diffex_DT <- eBayes(GSE73555_allContrast) %>% 
  topTable(coef = "knockdownTAZ1", adjust.method = "fdr", number = Inf) %>%
  data.table(keep.rownames = TRUE) %>% 
  .[, .(rn, siTAZ1_logFC = logFC, siTAZ1_avExpr = AveExpr, siTAZ1_qValue = adj.P.Val, siTAZ1_pValue = P.Value)]

setnames(siTAZ1diffex_DT, "rn", "ENSG")

# siTAZ2
siTAZ2diffex_DT <- eBayes(GSE73555_allContrast) %>% 
  topTable(coef = "knockdownTAZ2", adjust.method = "fdr", number = Inf) %>%
  data.table(keep.rownames = TRUE) %>% 
  .[, .(rn, siTAZ2_logFC = logFC, siTAZ2_avExpr = AveExpr, siTAZ2_qValue = adj.P.Val, siTAZ2_pValue = P.Value)]

setnames(siTAZ2diffex_DT, "rn", "ENSG")

siTAZdiffex_DT <- merge(siTAZ1diffex_DT, siTAZ2diffex_DT, by = "ENSG")

siTAZdiffex_DT[, qplot(siTAZ1_logFC, siTAZ2_logFC)]

siTAZdiffex_DT[ENST2ENSG[geneStatus == "protein_coding", .(ENSG,geneSymbol)], geneSymbol := geneSymbol, on = "ENSG"] #Add gene symbols

siTAZdiffex_DT[, ENSG := str_remove(ENSG, "\\.\\d+$")] # No need to keep hold of the ENSG versions


Volcano plot

A so-called volcano plot is a way of succinctly summarising effect size and significance of genes in a differential expression study. I use them as a means of quality control of an experiment - you can easily see if a fit has gone wrong, but it's hard to conclude much more from them.


siTAZdiffex_DT %>%
    ggplot(aes(x = siTAZ1_logFC, y = -log10(siTAZ1_pValue),
               colour = (siTAZ2_qValue < 0.05),
               shape = (siTAZ2_qValue < 0.05) ) ) +
    geom_point(size = 2) +
    theme_bw() +
    scale_color_discrete(name = "siTAZ2 qValue < 0.05") +
    scale_shape(guide = FALSE) +
    scale_x_continuous(breaks = seq(-10,10,1)) +
    scale_y_continuous(breaks = seq(0,20,0.5)) +
    labs(x = "log2(fold change) for siTAZ1", y = "-log10(P-value)",
         title = str_c("Differential expression for TAZ siRNA knockdown study"),
         subtitle = "Data drawn from GSE73555: TAZ knockdown in lung fibroblast cell line")

(optional) Mapping ENSG ids to other gene identifiers

One of the worst things about bioinformatics is the plethora of gene IDs, symbols and the like. When dealing with a gene locus (which is what RNAseq reads are mapped to), we use ensemble IDs (ENSGXXXX).

uniprotIDmappings <- fread("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/by_organism/HUMAN_9606_idmapping_selected.tab.gz", header = FALSE)

#We only really care about swissprot mappings
swissprot_accessions <- fread("curl ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | zcat | grep '^>' | perl -ne 'm/sp\\|(\\w+)\\|/; CORE::say $1'", header = FALSE)

swissprotIDmappingssubset <-  uniprotIDmappings[V1 %in% swissprot_accessions$V1][V19 != "", c(1,2,3,19)]
colnames(swissprotIDmappingssubset) <- c("accession","name","geneID","ENSGgrouped")
swissprotIDmappingssubset %<>% .[,.(ENSG = unlist(tstrsplit(ENSGgrouped, "; "))), by = .(accession, name, geneID, ENSGgrouped)]

siTAZdiffex_DT %<>% merge(
  by = "ENSG", all.x = TRUE)

siTAZdiffex_HFL1_diffexDT <- siTAZdiffex_DT[,.(ENSG, geneSymbol, accession, name, geneID, siTAZ1_logFC, siTAZ1_pValue, siTAZ2_logFC, siTAZ2_pValue)] # Only save important fields

setorder(siTAZdiffex_HFL1_diffexDT, siTAZ1_pValue )

#Write the results out to a data file for inclusion in the package
save(siTAZdiffex_HFL1_diffexDT, file = "./data/siTAZdiffex_HFL1_diffexDT.RData", compress = "xz")

siTAZdiffex_HFL1_diffexDT[ p.adjust(siTAZ1_pValue,"fdr") < 1E-3][order(-siTAZ1_logFC)]

What's at the top of the list? Stanniocalcin, which is a glycoprotein that controls calcium and phosphate - need I remind you that this study was within fibrosis.

Pathway enrichment

siTAZdt <- siTAZdiffex_HFL1_diffexDT

siTAZ1_BetaUniformModel <- fitBetaUniformMixtureDistribution(siTAZdt$siTAZ1_pValue)
siTAZ2_BetaUniformModel <- fitBetaUniformMixtureDistribution(siTAZdt$siTAZ2_pValue)



siTAZdt[, betaUnifScore1_FDR0.05 := betaUniformScore(siTAZ1_pValue, siTAZ2_BetaUniformModel, FDR = 0.05)]
siTAZdt[, betaUnifScore2_FDR0.05 := betaUniformScore(siTAZ2_pValue, siTAZ2_BetaUniformModel, FDR = 0.05)]


hallmark_pathways <- gmtPathways("/mnt/c/broad_genesets/h.all.v7.0.entrez.gmt")
hallmark_pathways_DT <- map2_dfr(hallmark_pathways, names(hallmark_pathways), ~ data.table(name = .y, geneID = .x))

cannonical_pathways <- gmtPathways("/mnt/c/broad_genesets/c2.cp.v7.0.entrez.gmt")
cannonical_pathways_DT <- map2_dfr(cannonical_pathways, names(cannonical_pathways), ~ data.table(name = .y, geneID = .x))

chemical_genetic_peturbations <- gmtPathways("/mnt/c/broad_genesets/c2.cgp.v7.0.entrez.gmt")
CGP_geneSets_DT <- map2_dfr(chemical_genetic_peturbations, names(chemical_genetic_peturbations), ~ data.table(name = .y, geneID = .x))


cannonicalSets_siTAZ1_DT <- siTAZdt[ unique(cannonical_pathways_DT[, .(pathway = name, geneID = as.character(geneID))]), , on = "geneID", allow.cartesian=T][!is.na(siTAZ1_pValue),.(pValueSet = list(siTAZ1_pValue), geneSet = list(geneSymbol), scoreSum = sum(betaUnifScore1_FDR0.05)), by = pathway]

cannonicalSets_siTAZ2_DT <- siTAZdt[ unique(cannonical_pathways_DT[, .(pathway = name, geneID = as.character(geneID))]), , on = "geneID", allow.cartesian=T][!is.na(siTAZ2_pValue),.(pValueSet = list(siTAZ2_pValue), geneSet = list(geneSymbol), scoreSum = sum(betaUnifScore2_FDR0.05)), by = pathway]

merge(cannonicalSets_siTAZ1_DT[,.(pathway, scoreSum)], cannonicalSets_siTAZ2_DT[,.(pathway, scoreSum)], by = "pathway") %>%
  ggplot(aes(x=scoreSum.x, y=scoreSum.y)) +

cannonicalSets_siTAZ1_DT[, betaUniformMixtureP := betaUniformPvalueSumTest(pValueSet[[1]], siTAZ1_BetaUniformModel), by = pathway]
cannonicalSets_siTAZ1_DT[, fishersP := fishersPvalueSumTest(pValueSet[[1]]), by = pathway]

cannonicalSets_siTAZ1_DT[p.adjust(betaUniformMixtureP) < 0.01, .(pathway)]

cannonicalSets_siTAZ2_DT[, betaUniformMixtureP := betaUniformPvalueSumTest(pValueSet[[1]], siTAZ2_BetaUniformModel), by = pathway]
cannonicalSets_siTAZ2_DT[, fishersP := fishersPvalueSumTest(pValueSet[[1]]), by = pathway]

REACT_TAZ1 <- cannonicalSets_siTAZ1_DT[pathway %like% "REACT"]
REACT_TAZ1[, betaQ := p.adjust(betaUniformMixtureP, "fdr")]
REACT_TAZ2 <- cannonicalSets_siTAZ2_DT[pathway %like% "REACT"]
REACT_TAZ2[, betaQ := p.adjust(betaUniformMixtureP, "fdr")]

merge(REACT_TAZ1, REACT_TAZ2, by = "pathway") %>% 
  ggplot(aes(x = -log(betaUniformMixtureP.x), y = -log(betaUniformMixtureP.y))) +

merge(REACT_TAZ1, REACT_TAZ2, by = "pathway")[p.adjust(betaUniformMixtureP.x, "fdr") < 0.01 & p.adjust(betaUniformMixtureP.y , "fdr") < 0.01]

Studying the Reactome pathways, it looks like we nicely capture a whole load of integrin, collagen, TAZ and the like pathways.

adamsardar/metaDEGth documentation built on Jan. 28, 2020, 12:17 a.m.