This package provides a toolkit for pre-processing single-cell RNA-seq data. It allows you to run cell quality control, normalisation and you will also be able to find highly variable genes within your dataset.
Let's load the library then!
library(proSeq)
You have to interpret this as a container for RNAseq data. This class has slots for storing your raw and normalised counts matrices, feature counts, metadata, information on spikes, size factors, output of QC, highly variable genes, dimensionality reduction plots (i.e. pca, tsne, diffusion maps) as well as clustering results.
I will first separate the features from the actual gene counts.
countsfeat = countsMatrix[grep("^__",rownames(countsMatrix)),] rownames(countsfeat) = c("no.feature","ambiguous","lowQuality","unaligned","align.not.unique") countsRaw = countsMatrix[grep("^__",rownames(countsMatrix),invert = TRUE),]
Next, I note down which genes are spike-ins (in my case, ERCCs)
spikes = substr(rownames(countsRaw),1,4)=="ERCC" names(spikes)=rownames(countsRaw)
Now, we create the RNAseq object and add the information we have
meta = meta[rownames(meta)%in%colnames(countsRaw),] countsRaw <- countsRaw[,colnames(countsRaw)%in%rownames(meta)] meta <- meta[colnames(countsRaw),] countsfeat1 <- countsfeat[,colnames(countsRaw)] data <- new("RNAseq", countsRaw=as.matrix(countsRaw), metadata=as.matrix(meta),SpikeIn=spikes,CountsFeat=as.matrix(countsfeat1))
One of the first things we need to do is to determine whether a cell is of good quality or not. For this, we can use "runCellQC".
We can set the parameters: * minMapreads - Minimum number of reads mapping to nuclear genes. For Smart-seq2, a threshold of 200,000 reads is ideal. However, sometimes if we sequence shallow, this is not possible. To see the cells discarded with this, check the resulting nNuclearGenes plot; this plot is log-scaled.
maxMapmit - Maximum percentage mapped mitochondrial reads. We want this to be low. Default: 0.1. You can check the threshold in the fMapped2Mit plot.
maxMapSpike - Maximum percentage of mapped spikeins. You would discard cells with a high percentage of spikeins. If no gene is spike-in, this condition will not be taken into account. The threshold will be plotted in the fMapped2ERCCs plot.
minGenesExpr - Minimum number of genes expressed in one cell at least. Default: 4000. Check nDetectedGenes plot.
checkCountsFeat - Logical (TRUE/FALSE). Specifies whether the QC should take into account the features of the mapping. Default = TRUE (recommended).
For our example, we will use: maxMapmit=0.12,minGenesExpr = 4000,minMapreads = 100000
The returning data is a TRUE/FALSE vector, where TRUE means that the cell has failed QC.
cellQC_data <- runCellQC(data,checkCountsFeat = TRUE,plotting="no",maxMapmit=0.12,minGenesExpr = 4000,minMapreads = 100000) cellsFailQC(data)<-c(cellQC_data)
For normalisation, we have two different approaches.
To apply normalisation, we will use the 'normalise' method. you can also see that it outputs a plot. The plot is a sanity check, as you expect to see size factors correlated with library sizes.
NormOutput1 <- normalise(data)
This function normalises based on ranks. To apply it, we need the raw counts matrix.
NormOutput2 <- quantileNorm(countsRaw(data))
We can now add our normalised counts to our object. For this example, I will add the size-factor normalised counts.
countsNorm(data) <- NormOutput1$countsNorm CellsSizeFac(data) <- list(sizeFactorsGenes = NormOutput1$sizeFactorsGenes, sizeFactorsSpikeIn = NormOutput1$sizeFactorsSpikeIn)
To find highly variable genes, we can use the method described in Brennecke et al., 2013. For this, we can use the method findHVG or the function findHVGMatrix
In both cases we need to specify what genes to use for it. For this, we discard genes based on their CV2 and mean expression. With this in mind, then you can specify the parameters for minimum mean (minQuantMeans), maximum mean (maxQuantMeans), minimum CV2 (minQuantCv2), maximum CV2 (maxQuantCv2)
hvg <- findHVG(object=data,plotting="no",UseSpike = TRUE, minQuantMeans = 0.1,maxQuantMeans = 0.9,maxQuantCv2 = 0.9,minQuantCv2 = 0.1) length(hvg) genesHVG(data) = hvg
You may see that the fit is not as good as with spike-ins. In this case, I have to filter out more values with high CV2, as they bias the fit.
hvg <- findHVGMatrix(as.matrix(countsNorm(data)),UseSpike = F,plotting="no", minQuantMeans = 0,maxQuantMeans = 1,maxQuantCv2 = 0.4,minQuantCv2 = 0) length(hvg) genesHVG(data) = hvg
An alternative approach to calculate highly variable genes is to compute a distance-to-median. You can apply this using the findHVG.DM function on a matrix.
hvg <- findHVG.DM(countsNorm(data))
I hope this was useful. If you have any issues, please report them to bps.queries@gmail.com
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.