library(RNAseqFunctions) library(tidyverse)
The vignette is designed to show a typical pipeline for filtering counts data. Example data mirroring output from HTSeq is included and called "testingCounts" and this will be used for demonstration purposes.
testingCounts[, 1:5]
The column called HGN, indicating the gene names is included in HTSeq counts
file but we typically want gene names as rownames. The function
moveGenesToRownames
accomplishes this and returns the modified data.frame.
counts <- moveGenesToRownames(testingCounts) counts[, 1:5]
HTSeq adds the suffix ".htseq" to column names when it reports counts. This function removes that suffix from the column names of the supplied counts data.frame.
counts <- removeHTSEQsuffix(counts) counts[, 1:5]
ERCC reads, when present in the experiment, are included in HTSeq output as
genes although, we typically want these to be in a seperate data structure. The
detectERCCreads
function searches the rownames for gene names (the rownames
of the input data.frame) matching the specified regular expression and returns
a logical indicating where they are found. The default regular expression for
matching is "^ERCC\-[0-9]*$". The returned logical vector can be subsequently
used to subset the ERCC reads from the counts matrix and vice versa. Typically,
we expect 92 ERCC reads to be identified and, if they are not, a warning is
issued.
ercc <- detectERCCreads(counts) countsERCC <- counts[ercc, ] counts <- counts[!ercc, ] counts[, 1:5] countsERCC[, 1:5]
HTSeq typically reports the following information in the last rows of the counts
data: "no_feature", "ambiguous", "too_low_aQual", "not_aligned",
"__alignment_not_unique". Although this is useful information it is not usually
desired to retain this in downstream processing. The detectNonGenes
function
finds these rows and returns a logical vector allowing them to be removed.
counts <- counts[!detectNonGenes(counts), ] counts[, 1:5]
In gene expression counts data if can often be the case that some genes are
not detected. This can simply be due to the fact that the gene is not
expressed in the tissue or, in addition, that the sequencing depth was not
sufficient to detect the gene. In addition, some genes may be detected but in
so few samples or at such a low level that it makes the quantified value
highly unreliable. In these cases, it is desireable to remove the gene before
downstream analysis which is facilitated by the detectLowQualityGenes
function. A message is issued indicating the number of low quality genes
detected and the percentage of low quality genes for the dataset.
counts <- counts[detectLowQualityGenes(counts, 18), ] counts[, 1:5]
It is often the case that some samples from sequencing experiments are of
low quality, in some cases due to issues during the sample preperation stage.
Due to the fact that these samples represent a high level of technical noise,
it is often desirable to remove these before downstream analysis which is
facilitated by the detectLowQualityCells
function. The function achieves this
using two methods.First, the mincount argument allows detection of samples whose
sum across all genes is > mincount. Second, we utilize a house keeping gene and
assume its log2 expression to be normally distributed when considering all
samples. We then detect samples where the probability of the expression for the
house keeping gene in that sample is greater than the quantile.cut argument.
The function returns a logical vector indicating cells that "pass" both of the
tests that can subsequently be used to subset them from the counts matrix. A
message is issued indicating the number of low quality cells detected and the
percentage of low quality cells in the dataset.
lqc <- detectLowQualityCells( counts, geneName = "ACTB", mincount = 30, quantileCut = 0.01 ) counts <- counts[, lqc] countsERCC <- countsERCC[, lqc] counts[, 1:5] countsERCC[, 1:5]
Due to the fact that, when origionally loaded, the counts data includes a
character column (HGN) they are typically stored in a data.frame. Despite this,
downstream processing is usually easier when the data are in a matrix
data structure. The convertCountsToMatrix
coreces the counts data to a matrix.
counts <- convertCountsToMatrix(counts) countsERCC <- convertCountsToMatrix(countsERCC) class(counts) class(countsERCC)
In summary the whole workflow can be reduced to the pipeline shown below. Note that order of operations is important.
counts <- testingCounts %>% moveGenesToRownames() %>% removeHTSEQsuffix() ercc <- detectERCCreads(counts, warn = FALSE) filtered.counts <- counts[!ercc, ] %>% subset(!detectNonGenes(.)) %>% subset(detectLowQualityGenes(., 18)) %>% select(which( detectLowQualityCells( ., geneName = "ACTB", mincount = 30, quantileCut = 0.01 ) )) %>% convertCountsToMatrix(.) countsERCC <- convertCountsToMatrix(counts[ercc, colnames(filtered.counts)])
In addition to the functions already mentioned, there is a subset of functions devoted to helping with metadata. Since the metadata tends to be quite specific depending on the project, it is difficult to write generalized functions to help with metadata. Despite this, some things are assumed to be shared between all projects. The functions below leverage a sample naming convention to annotate various metadata. In the future, these functions may be updated to be more useful in projects using a different sample naming convention. A example of a metadata file is included as "testingMeta" and includes sample names with the typical format. The naming convention for samples is as follows:
Sample names should follow: (s|m)\\.platename\\.platePosition. platePosition is in the form row and column without a space where row is a LETTER (A-H) and column is a number (1-12).
testingMeta
Uses the standard sample naming nomenclature to add plate row to metadata. Returns the metadata tibble with an additional column indicating plate name.
meta <- annotatePlate(testingMeta) meta
Uses the standard sample naming nomenclature to add plate row to metadata. Returns the metadata tibble with an additional column indicating row as a numeric value.
meta <- annotateRow(meta) meta
Uses the standard sample naming nomenclature to add plate column to metadata. Returns the metadata tibble with an additional column indicating column as a numeric value.
meta <- annotateColumn(meta) meta
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.