In BPijuanSala/proSeq: Tools to preprocess RNA-seq data

Introduction

This package provides a toolkit for pre-processing single-cell RNA-seq data. It allows you to run cell quality control, normalisation and you will also be able to find highly variable genes within your dataset.

Let's load the library then!

library(proSeq)

Classes, methods and functions

RNAseq class

You have to interpret this as a container for RNAseq data. This class has slots for storing your raw and normalised counts matrices, feature counts, metadata, information on spikes, size factors, output of QC, highly variable genes, dimensionality reduction plots (i.e. pca, tsne, diffusion maps) as well as clustering results.

I will first separate the features from the actual gene counts.

countsfeat = countsMatrix[grep("^__",rownames(countsMatrix)),]
rownames(countsfeat) = c("no.feature","ambiguous","lowQuality","unaligned","align.not.unique")
countsRaw = countsMatrix[grep("^__",rownames(countsMatrix),invert = TRUE),]

Next, I note down which genes are spike-ins (in my case, ERCCs)

spikes = substr(rownames(countsRaw),1,4)=="ERCC"
names(spikes)=rownames(countsRaw)

Now, we create the RNAseq object and add the information we have

meta = meta[rownames(meta)%in%colnames(countsRaw),]
countsRaw <- countsRaw[,colnames(countsRaw)%in%rownames(meta)]
meta <- meta[colnames(countsRaw),]
countsfeat1 <- countsfeat[,colnames(countsRaw)]

data <- new("RNAseq", countsRaw=as.matrix(countsRaw), metadata=as.matrix(meta),SpikeIn=spikes,CountsFeat=as.matrix(countsfeat1))

Cell quality control

One of the first things we need to do is to determine whether a cell is of good quality or not. For this, we can use "runCellQC".

We can set the parameters: * minMapreads - Minimum number of reads mapping to nuclear genes. For Smart-seq2, a threshold of 200,000 reads is ideal. However, sometimes if we sequence shallow, this is not possible. To see the cells discarded with this, check the resulting nNuclearGenes plot; this plot is log-scaled.

maxMapmit - Maximum percentage mapped mitochondrial reads. We want this to be low. Default: 0.1. You can check the threshold in the fMapped2Mit plot.
maxMapSpike - Maximum percentage of mapped spikeins. You would discard cells with a high percentage of spikeins. If no gene is spike-in, this condition will not be taken into account. The threshold will be plotted in the fMapped2ERCCs plot.
minGenesExpr - Minimum number of genes expressed in one cell at least. Default: 4000. Check nDetectedGenes plot.
checkCountsFeat - Logical (TRUE/FALSE). Specifies whether the QC should take into account the features of the mapping. Default = TRUE (recommended).

For our example, we will use: maxMapmit=0.12,minGenesExpr = 4000,minMapreads = 100000

The returning data is a TRUE/FALSE vector, where TRUE means that the cell has failed QC.

cellQC_data <- runCellQC(data,checkCountsFeat = TRUE,plotting="no",maxMapmit=0.12,minGenesExpr = 4000,minMapreads = 100000)

cellsFailQC(data)<-c(cellQC_data)

Normalisation

For normalisation, we have two different approaches.

A. Size-factor normalisation

To apply normalisation, we will use the 'normalise' method. you can also see that it outputs a plot. The plot is a sanity check, as you expect to see size factors correlated with library sizes.

NormOutput1 <- normalise(data)

B. Quantile normalisation

This function normalises based on ranks. To apply it, we need the raw counts matrix.

NormOutput2 <- quantileNorm(countsRaw(data))

We can now add our normalised counts to our object. For this example, I will add the size-factor normalised counts.

countsNorm(data) <- NormOutput1$countsNorm

CellsSizeFac(data) <- list(sizeFactorsGenes = NormOutput1$sizeFactorsGenes, 
                             sizeFactorsSpikeIn = NormOutput1$sizeFactorsSpikeIn)

Identification of highly variable genes

Brennecke et al., 2013

To find highly variable genes, we can use the method described in Brennecke et al., 2013. For this, we can use the method findHVG or the function findHVGMatrix

In both cases we need to specify what genes to use for it. For this, we discard genes based on their CV2 and mean expression. With this in mind, then you can specify the parameters for minimum mean (minQuantMeans), maximum mean (maxQuantMeans), minimum CV2 (minQuantCv2), maximum CV2 (maxQuantCv2)

findHVG method This can be applied directly to the object.

hvg <- findHVG(object=data,plotting="no",UseSpike = TRUE, minQuantMeans = 0.1,maxQuantMeans = 0.9,maxQuantCv2 = 0.9,minQuantCv2 = 0.1)
length(hvg)
genesHVG(data) = hvg

findHVGMatrix function This can be applied to a matrix without size factors. Here, the method will automatically calculate size factors, as in the "normalise" function. If there are no spike-ins (as I exemplify in this case below), the method will take the gene counts to fit the model.

You may see that the fit is not as good as with spike-ins. In this case, I have to filter out more values with high CV2, as they bias the fit.

hvg <- findHVGMatrix(as.matrix(countsNorm(data)),UseSpike = F,plotting="no", minQuantMeans = 0,maxQuantMeans = 1,maxQuantCv2 = 0.4,minQuantCv2 = 0)
length(hvg)
genesHVG(data) = hvg

Distance-to-median (Kolodziejczyk et al. Cell Stem Cell, 2015)

An alternative approach to calculate highly variable genes is to compute a distance-to-median. You can apply this using the findHVG.DM function on a matrix.