YARN: Robust Multi-Tissue RNA-Seq Preprocessing and Normalization

YARN - Yet Another RNa-seq package

The goal of yarn is to expedite large RNA-seq analyses using a combination of previously developed tools. Yarn is meant to make it easier for the user to perform accurate comparison of conditions by leveraging many Bioconductor tools and various statistical and normalization techniques while accounting for the large heterogeneity and sparsity found in very large RNA-seq experiments.

Installation

You can install yarn from github with:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("yarn")

Quick Introduction

If you're here to grab the GTEx version 6.0 data then look no further than this gist that uses yarn to download all the data and preprocess it for you.

Preprocessing

Below are a few of the functions we can use to preprocess a large RNA-seq experiment. We follow a particular procedure where we:

  1. Filter poor quality samples
  2. Merge samples of similar conditions for increased power
  3. Filter genes while preserving tissue or group specificity
  4. Normalize while accounting for global differences in tissue distribution

We will make use of the skin dataset for examples. The skin dataset is a small sample of the full GTEx data that can be downloaded using the downloadGTEx function. The skin dataset looks like this:

library(yarn)
data(skin)
skin

This is a basic workflow. Details will be fleshed out:

  1. First always remember to have the library loaded.
library(yarn)
  1. Download the GTEx gene count data as an ExpressionSet object or load the sample skin dataset.

For computational reasons we load the sample skin data instead of having the user download the

library(yarn)
data(skin)
  1. Check mis-annotation of gender or other phenotypes using group-specific genes
checkMisAnnotation(skin,"GENDER",controlGenes="Y",legendPosition="topleft")
  1. Decide what sub-groups should be merged
checkTissuesToMerge(skin,"SMTS","SMTSD")
  1. Filter lowly expressed genes
skin_filtered = filterLowGenes(skin,"SMTSD")
dim(skin)
dim(skin_filtered)

Or group specific genes

tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name")
# Keep only the sex names
tmp = filterGenes(skin,labels=c("X","Y","MT"),featureName = "chromosome_name",keepOnly=TRUE)
  1. Normalize in a tissue or group-aware manner
plotDensity(skin_filtered,"SMTSD",main=expression('log'[2]*' raw expression'))
skin_filtered = normalizeTissueAware(skin_filtered,"SMTSD")
plotDensity(skin_filtered,"SMTSD",normalized=TRUE,main="Normalized")

Helper functions

Other than checkMisAnnotation and checkTissuesToMerge we provide a few plotting function. We include, plotCMDS, plotDensity, plotHeatmap.

plotCMDS - PCoA / Classical Multi-Dimensional Scaling of the most variable genes.

data(skin)
res = plotCMDS(skin,pch=21,bg=factor(pData(skin)$SMTSD))

plotDensity - Density plots colored by phenotype of choosing. Allows for inspection of global trend differences.

filtData = filterLowGenes(skin,"SMTSD")
plotDensity(filtData,groups="SMTSD",legendPos="topleft")

plotHeatmap - Heatmap of the most variable genes.

library(RColorBrewer)
tissues = pData(skin)$SMTSD
heatmapColColors=brewer.pal(12,"Set3")[as.integer(factor(tissues))]
heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)
plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10,
 col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.25,cexCol = 0.25)

Information

sessionInfo()


Try the yarn package in your browser

Any scripts or data that you put into this service are public.

yarn documentation built on Nov. 8, 2020, 7:50 p.m.