In jmonlong/sQTLseekeR: Splicing QTL discovery tool

Splicing events classification

Introduction

sQTLseeker finds association between a SNP and the relative expression a gene's transcripts, however we don't know exactly what splicing mechanism is involved. One way to have an idea of the mechanism or the change that the variant could cause, is to compare the structure of the transcripts whose relative expression change. For example, if a transcript with a particular exon becomes less used while the same transcript but lacking this exon becomes more used, the main mechanism could be exon skipping.

In practice, we compare the structure of the two transcripts whose relative usage changes the most. These two transcripts are specified in columns tr.first and tr.second of the output of sqtl.seeker/sqtls functions.

To compare the structure of pairs of transcripts, we can either use AStalavista or the build-in function in the package (classify.events). The following code shows how to use this function, which was highly inspired from AStalavista.

Building the transcript structure

The data.frame with the transcript structure must have these columns:

transId: the transcript ID.
strand: the DNA strand.
cdsStart: the start positions of the CDS regions, separated by ,.
cdsEnd: same with the end positions.
urtStarts and utrEnds: same with UTR regions.

Depending on the annotation you are using, you might need to build this data.frame manually. In a later example, we'll see how to construct it from a Gencode annotation.

For now, let's create a simple one to see how the function works.

tr.str = data.frame(transId=c("t1","t2","t3"),
                    strand="+",
                    cdsStarts=c("10,40,100","10,20,100","10,40,100"),
                    cdsEnds=c("15,55,130","15,30,130","15,55,130"),
                    utrStarts=c("5,130","5,130","5,130"),
                    utrEnds=c("10,135","10,135","10,150"))
tr.str

Comparing pairs of transcripts

We will compare the structure of transcripts pairs t1 vs t2, and t1 vs t3. We load the package and run:

library(sQTLseekeR)
tr.df = data.frame(tr.first=c("t1","t1"), tr.second=c("t2","t3"))
classify.events(tr.df, tr.str)

The result is a list with the event code and name for each pair, as well as a data.frame with the global count of each event in the data.

It found Mutually exclusive exons between transcript t1 and t2. Transcripts t1 and t3 have the same CDS but differ in their last UTR, hence the event found being Tandem 3' UTR.

Classifying transcripts from Genecode annotation.

In practice the input is a larger data.frame with information about each sQTL, including the relevant columns tr.first and tr.second. These two columns contains the IDs of the two transcripts whose relative expression change the most.

In the next example, we start by constructing the transcript structure for some genes from Gencode annotation V12, available here.

First we download the file and read it.

download.file("ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_12/gencode.v12.annotation.gtf.gz","../Data/gencode.v12.annotation.gtf.gz")
gtf = read.table("../Data/gencode.v12.annotation.gtf.gz", as.is=TRUE, sep="\t")
colnames(gtf) = c("chr","source","type","start","end","score","strand","score2","info")

Then for each transcript with CDS or UTRs, we concatenate the start/end positions.

library(dplyr)
concat <- function(x) paste(x, collapse=",")
trans.str = gtf %>% filter(type %in% c("CDS","UTR")) %>%
    mutate(transId=gsub(".*transcript_id ([^;]+);.*","\\1",info),
           geneId=gsub(".*gene_id ([^;]+);.*","\\1",info)) %>%
        group_by(transId, geneId, strand) %>%
            summarize(cdsStarts=concat(start[type=="CDS"]),
                      cdsEnds=concat(end[type=="CDS"]),
                      utrStarts=concat(start[type=="UTR"]),
                      utrEnds=concat(end[type=="UTR"]))

For transcripts without CDS, we use the exon annotation instead. This is optional and not necessary if you analyze only protein-coding genes.

exon.str = gtf %>% filter(type=="exon") %>%
    mutate(transId=gsub(".*transcript_id ([^;]+);.*","\\1",info),
           geneId=gsub(".*gene_id ([^;]+);.*","\\1",info)) %>%
        filter(!(transId %in% trans.str$transId)) %>%
            group_by(transId, geneId, strand) %>%
                summarize(cdsStarts=concat(start),cdsEnds=concat(end),
                          utrStarts=NA, utrEnds=NA)
trans.str = rbind(trans.str, exon.str)

Now we read the sQTLs from one population, CEU. After erasing the event classification, we recompute them using classify.events.

sqtl.df = read.table("../Data/sQTLs-FDR10-CEU.tsv.gz", as.is=TRUE, header=TRUE, sep="\t", quote="")
sqtl.df$classCode = sqtl.df$classEvent = NULL
ev.sqtls = classify.events(sqtl.df, trans.str)

The annotated results is visible in

head(ev.sqtls$res)

There is also a summary data.frame. Here count (prop) represents the number (proportion) of time an event is observed relative to the other events. However prop.sqtl represents the proportion of sQTLs (or input pairs) that contain each event. These numbers are different because one sQTL (or transcript pair) can involve several events.

This summary data.frame is useful to plot the global distribution of the events.

head(ev.sqtls$stats)
library(ggplot2)
ggplot(ev.sqtls$stats, aes(x=event, y=prop.sqtl)) + geom_bar(stat="identity") + coord_flip() + ylab("proportion of sQTLs")

Null distribution

It's good to know the general distribution of the splicing events affected by sQTLs but we might wonder what type of events happen in general. To investigate this null distribution we can pick two random transcripts in each the same genes that are affected by sQTLs and see what type of events differentiate them.

## Function that picks two random transcripts for an input gene
geneToTrans = tapply(trans.str$transId, trans.str$geneId, identity)
pickRandomTrans <- function(gene) head(c(sample(geneToTrans[[gene]]),NA, NA),2)
## Build a data.frame with the random transcript pairs
cont.pairs = sapply(sqtl.df$geneId, pickRandomTrans)
cont.pairs = as.data.frame(t(cont.pairs))
colnames(cont.pairs) = c("tr.first","tr.second")

Once these control transcript pairs are ready, we run the classifier.

ev.cont = classify.events(cont.pairs, trans.str)

Now, we can merge this results with the sQTLs' ones and compare the distributions.

stats.df = rbind(data.frame(set="sQTLs", ev.sqtls$stats),
    data.frame(set="Control", ev.cont$stats))
ggplot(stats.df, aes(x=event, y=prop.sqtl, fill=set)) + geom_bar(stat="identity", position="dodge") + coord_flip() + ylab("proportion of sQTLs")