library("knitr")
opts_chunk$set(message = FALSE, error = TRUE, 
               warning = FALSE, cache=TRUE, comment = "")

Getting started

This document gives an overview of the R package rasp, which provides statistical procedures to study differential splicing among conditions in the context of annotated genomes. That is, isoform information is summarized using the available annotation (at exon or transcript level) using any of the existing tools. This information can provided in RPKM as for instance FluxCapacitor does, or using read counts on exons. The statistical methods used in rasp are described in @GonCalSam12. The method is based on comparing the variability of the relative expression among conditions using the methodology described in @Anderson06 that can be seen as an ANOVA for the case of having multivariate data (multiple isoforms or exons per gene). The current vesion of the package also allows the user to compare splicing patterns for more than two conditions and adjust for other covariates using non-parametric multivariate linear models whose p-values are computed using an asymptotic result described in @garrido2020multivariate.

The package can other dependencies can be installed by

devtools::install_github("dgarrimar/mlm")
devtools::install_github("isglobal-brge/rasp")

Then, we should start the R session by loading the library as usual:

library(rasp)

We will start loading into the workspace the data corresponding to the initial table of transcript expression estimates provided by @GonCalSam12. The RNA-seq data correspond to lymphoblastoid cell lines derived from 69 unrelated Nigerian individuals produced by @PicMarPai10. More than half billion 35-bp-long reads were sequenced. Reads were mapped to the human genome version hg19 using GEM mapper @GEM. After that, mapped reads were used to obtain transcipt expression estimates using GENCODE version 3c produced in the framework of the ENCODE project as the reference genome annotation. Finally, Flux Capacitor was used to produce estimates of transcript abundances measured as RPKMs. Data was filtered to have only information about protein-coding genes expressed in all individuals, having at least two annotated splice forms and isoforms with an expresion level of at least 1 RPKM in at least one individual. These filters lead to a total of 1641 genes and 4668 associated transcripts @GonCalSam12. The object containing this information can be loaded by typing:

data(YRI)
dim(YRI)

From an excerpt of YRI, we can see the first two columns containing the gene and transcript Ensembl IDs, respectively, and then the expression values for all samples in the rest of the columns.

YRI[1:6, 1:6]

The object also contains information about gender of each sample that will be used as a grouping factor to illustrate comparison among conditions:

genderYRI <- attr(YRI, "gender")
head(genderYRI)
table(genderYRI)

We are aware that this is not the proper way of encapsulating omic data since, for instance, Bioconductor has better data infrastructures such as RangedSummarizedExperiment that are specifically designed to deal with this type of data. Nontheless, we allow the function pass the data in that format to facilitate the task for those users who are not familiar with Bioconductor or that want to analyze a subset of data obtained from public repositories. [Section Data in RangedSummarizedExperiment format] illustrate an example having this type of data.

Data Visualization

For each gene $n$ alternative splice forms (exons or transcripts) are computed from count data Therfore, for a given gene, and for each individual $j$, the relative expression of each exon, $i=1, \ldots, n$ is computed as: % \begin{equation} f_{ij} = \frac{x_{ij}}{\sum_{i=1}^{n}x_{ij}} \end{equation} % where $x_{ij}$ denotes the transcript expression estimates (either as RPKMs or counts). Therefore, we consider as a outcome the vector of splicing proportions for the $j$ individual, $f_{j}=(f_{1j}, \ldots,f_{nj})$. This multivariate nature of the data is the reason why @GonCalSam12 proposed to use the methodology described in @Anderson06 to compare groups.

This type of data is also known as compositional data and can be represented in the simplex using the function plotTernary() as illustrate Figure \ref{fig:plotTernary} that can be obtained by typing

plotTernary(YRI, "ENSG00000160741", transcripts=1:3)

We can also compare the splicing ratios for a group of individuals using the group argument. For instance, next figure illustrat the splicing of males and females for the gene JRK (ENSG00000234616) and we can observe a very different pattern having females higher splicing ratios of exon ENST00000413868.

plotTernary(YRI, "ENSG00000234616", group = genderYRI, transcripts=1:3)

Notice that the argument transcript indicates the three transcripts to be represented in the Ternary plot. An obvious limitation of the Ternary Plot is that it can oly plot data for three different transcripts. In order to plot the splicing ratios for any number of transcripts one can use the function plotAllIso(). The result is a plot showing the splicing ratios for all selected transcripts and individuals separated by gener.

plotAllIso("ENSG00000005448", data = YRI, group = genderYRI,
                      inds = 1:10)

Data analysis

The splicing ratio for a single gene can be analyzed using the rasp() function. For instance, let us investigate whether the splicing ratios of the gene ENSG00000160741 containing 3 different transcripts are different between males and females (Figure \ref{fig:plotTernary}).

We first need to select the counts corresponding to the gene of interest

counts.gene <- YRI[YRI$gene_id=="ENSG00000160741", ]

And then the p-value is computed by

pval <- rasp(x=counts.gene, group=genderYRI, 
            geneidCol = 1, expressionCols = 3:ncol(counts.gene))
pval

Notice that three arguments must be specified. First, which vector is having the groupin variable (group). Then, which column is having the gene ids (gendidCol) and then the columns having the counts or expression (expressionCols) in the x agument.

The result shows a highly sifnificant p-value indicanting that there are statistically significant differences in the splicing patterns among males and females. The ternary plot in Figure \ref{fig:plotTernary} also shows some differences across sexes. One can observe that females tend to have more RPKMs in the transcript ENST0000046.

More than one gene can also be analyzed using rasp() function. It is performed by using multiple processors when available. Let us analyze, for instance, the first 5 genes can be studied by

ngenes <- 5
sel.genes <-  names(table(YRI$gene_id))[1:ngenes]
YRI.subset <- subset(YRI, gene_id%in%sel.genes)
res <- rasp(x=YRI.subset, group=genderYRI, mc.cores = 3, 
            geneidCol = 1, expressionCols = 3:ncol(YRI))

Notice again that having data in that simple format requires several arguments to be passed through the function as previously illustrated. The argument mc.cores allows the user to pre-define the number of cores used. If missing, it considers parallel::detectCores() - 1. We can see the p-values of each gene by printing the results

res

Analysis with RangedSummarizedExperiment objects

As previously mentioned the RangedSummarizedExperiment is designed to properely manage exon/transcript count data. In order to illustrate how to perform differential splicing analyses using rasp we have added into the package an object of this class which contains exon counts measured in adipose tissue from GTEx samples. The object has been donwloaded from recount and, for illustrating purposes, we have selected the information corresponding to the chromosome 16 and have added a covariable to the metadata called inv16p11.2 which contains the inversion genotypes obtained as described in @ruiz2019scoreinvhap.

data(adipose)
adipose.chr16

The total number of genes are

length(unique(rownames(adipose.chr16)))

One of the advantadges of using RangedSummarizedObjects is that we can easily subset a genomic region and then perfrom downstream analyses. Let us assume that we are interested in studing whether the inversion genotypes are associated with the alternative splicing in the inversion region (chr16:28,42Mb-28,79Mb). The region of interest can be easily subset by

library(GenomicRanges)
library(IRanges)
roi <- GRanges("chr16:25e6-30e6")
adipose <- subsetByOverlaps(adipose.chr16, roi)
adipose

Now the number of genes contained in the object are:

length(unique(rownames(adipose)))

The analysis is the performed by

pvals <- rasp( formula = ~ inv16p11.2, x = adipose)

These are the pvalues

head(pvals)

Notice that here the grouping variable is indicated in the formula. This allows to create adjusted analysis by other covariates that should be available in the metadata (e.g. colData())

The analyses can return p.values as NA. It is cause due to the filtering process that is applied to both individuals and exons (see filterInd and filterExon arguments). Analysis cannot be performed for those genes having no samples or two or more exons.

The top genes can be obtained by

topGenes(pvals)

We observe that the gene ENSG00000197165 (SULT1A2) is hihgly differentially spliced. This figure shows the splicing ratios of each exon given the inversion genotypes

plotGene(adipose, gene="ENSG00000197165.10", condition="inv16p11.2", 
         title="Splicing of SULT1A2 gene and inversion at 16p11.2")

We can observe, for instance, as the splicing ratios of exons 6 and 10 are different among inversion genotypes. In particular, N allele increases the splicing ratio of exon 6 and decreases for exon 10. In the next figure we can observe this pattern as well as how the exon 1 does not present changes among inversion genotypes. Interestingly this inversion and SULT1A2 gene expression have been associated with obesity @gonzalez2019polymorphic, and we see that there is a differntial splicing ratio of inversion genotypes in adipose tissue which is, obiously, a key tissue for obesity.

plotSplicingRatios("ENSG00000197165.10", data = adipose, 
              condition = "inv16p11.2", ind=1:200, exons=c(1,3,10))

References

Session info

sessionInfo()


isglobal-brge/rasp documentation built on July 12, 2020, 1:17 a.m.