In LuyiTian/scPipe: Pipeline for single cell multi-omic data pre-processing

getVal <- function(x, default = NULL) {
  ifelse(is.null(x), default, x)
}

library(scales)
library(readr)
library(ggplot2)
library(plotly)
library(DT)
library(scater)
library(scran)
library(scPipe)
library(Rtsne)

Parameters for each preprocessing step

Parameters for `sc_trim_barcode`

File paths

input fastq1: r params$fq1
input fastq2: r params$fq2
output fastq: r params$fqout

Read structure

assume read1 contains the transcript

barcode in read1: r params$bc1_info
barcode in read2: r params$bc2_info
UMI in read2: r params$umi_info

Read filter

remove reads that have N in its barcode or UMI: r params$rm_n
remove reads with low quality: r params$rm_low r if (params$rm_low){paste("\t* minimum read quality:", params$min_q, "\n", "\t* maximum number of base below minimum read quality:", params$num_bq, "\n")}

Parameters for alignment

input fastq: r params$fqout
output bam file: r params$bam_align
genome index: r params$g_index

Parameters for `sc_exon_mapping`

input bam file: r params$bam_align
output bam file: r params$bam_map
transctiptome annotations: r params$anno_gff
do strand specific mapping: r params$stnd
fix chromosome names: r params$fix_chr

Parameters for `sc_demultiplex`

input bam file: r params$bam_map
output folder: r params$outdir
barcode annotation file: r params$bc_anno
maximum mismatch allowed in barcode: r params$max_mis

Parameters for `sc_gene_counting`

output folder: r params$outdir
barcode annotation file: r params$bc_anno
UMI correction: r params$UMI_cor
gene filtering: r params$gene_fl

Data summary

The organism is "r getVal(params$organism, "unknown")", and gene id type is "r getVal(params$gene_id_type, "unknown")".

Overall barcode statistics

if (is.null(params$organism) || is.na(params$organism)) {
  sce = create_sce_by_dir(params$outdir)
} else {
  sce = create_sce_by_dir(params$outdir, organism=params$organism, gene_id_type=params$gene_id_type)
}
overall_stat = demultiplex_info(sce)
datatable(overall_stat, width=800)

Plot barcode match statistics in pie chart:

plot_demultiplex(sce)

Read alignment statistics

ggplotly(plot_mapping(sce, dataname=params$samplename, percentage = FALSE))

ggplotly(plot_mapping(sce, dataname=params$samplename, percentage = TRUE))

Summary and distributions of QC metrics

if (any(colSums(counts(sce)) == 0)) {
  zero_cells = sum(colSums(counts(sce)) == 0)
  sce = sce[, colSums(counts(sce)) > 0]
} else {
  zero_cells = 0
}

r if (zero_cells > 0){paste(zero_cells, "cells have zero read counts, remove them.")}

Datatable of all QC metrics:

sce = calculate_QC_metrics(sce)
if(!all(colSums(as.data.frame(QC_metrics(sce)))>0)){
  QC_metrics(sce) = QC_metrics(sce)[, colSums(as.data.frame(QC_metrics(sce)))>0]
}
datatable(as.data.frame(QC_metrics(sce)), width=800, options=list(scrollX= TRUE))

Summary of all QC metrics:

datatable(do.call(cbind, lapply(QC_metrics(sce), summary)), width=800, options=list(scrollX= TRUE))

Number of reads mapped to exon before UMI deduplication VS number of genes detected:

ggplotly(ggplot(as.data.frame(QC_metrics(sce)), aes(x=mapped_to_exon, y=number_of_genes))+geom_point(alpha=0.8))

Quality control

Detect outlier cells

A robustified Mahalanobis Distance is calculated for each cell then outliers are detected based on the distance. However, due to the complex nature of single cell transcriptomes and protocol used, such a method can only be used to assist the quality control process. Visual inspection of the quality control metrics is still required. By default we use comp = 2 and the algorithm will try to separate the quality control metrics into two gaussian clusters.

The number of outliers:

sce_qc = detect_outlier(sce, type="low", comp = 2)
table(QC_metrics(sce_qc)$outliers)

Pairwise plot for QC metrics, colored by outliers:

plot_QC_pairs(sce_qc)

plot highest expression genes

Remove low quality cells and plot highest expression genes.

sce_qc = remove_outliers(sce_qc)
sce_qc = convert_geneid(sce_qc, returns="external_gene_name")

sce_qc <- calculate_QC_metrics(sce_qc)
plotHighestExprs(sce_qc, n=20)

remove low abundant genes

Plot the average count for each genes:

ave.counts <- rowMeans(counts(sce_qc))
hist(log10(ave.counts), breaks=100, main="", col="grey80",
     xlab=expression(Log[10]~"average count"))

As a loose filter we keep genes that are expressed in at least two cells and for cells that express that gene, the average count larger than two. However this is not the gold standard and the filter may variy depending on the data.

keep1 = (apply(counts(sce_qc), 1, function(x) mean(x[x>0])) > 1.1)  # average count larger than 1.1
keep2 = (rowSums(counts(sce_qc)>0) > 5)  # expressed in at least 5 cells

sce_qc = sce_qc[(keep1 & keep2), ]
dim(sce_qc)

We got r nrow(sce_qc) genes left after removing low abundant genes.

Data normalization

Normalization by `scran` and `scater`

Compute the normalization size factor

ncells = ncol(sce_qc)
if (ncells > 200) {
  sce_qc <- computeSumFactors(sce_qc)
} else {
  sce_qc <- computeSumFactors(sce_qc, sizes=as.integer(c(ncells/7, ncells/6, ncells/5, ncells/4, ncells/3)))
}
summary(sizeFactors(sce_qc))

r if (min(sizeFactors(sce_qc)) <= 0){paste("We have negative size factors in the data. They indicate low quality cells and we have removed them. To avoid negative size factors, the best solution is to increase the stringency of the filtering.")}

if (min(sizeFactors(sce_qc)) <= 0) {
  sce_qc = sce_qc[, sizeFactors(sce_qc)>0]
}

PCA plot using gene expressions as input, colored by the number of genes.

cpm(sce_qc) = calculateCPM(sce_qc, size.factors=NULL)
sce_qc <- runPCA(sce_qc, exprs_values = "cpm")
plotPCA(sce_qc, colour_by="number_of_genes")

Normalize the data using size factor and get high variable genes

The highly variable genes are chosen based on trendVar from scran with FDR > 0.05 and biological variation larger than 0.5. If the number of highly variable genes is smaller than 100 we will select the top 100 genes by biological variation. If the number is larger than 500 we will only keep top 500 genes by biological variation.

sce_qc <- logNormCounts(sce_qc)

var.out <- modelGeneVar(sce_qc)

if (length(which(var.out$FDR <= 0.05 & var.out$bio >= 0.5)) < 500){
  hvg.out <- var.out[order(var.out$bio, decreasing=TRUE)[1:500], ]
}else if(length(which(var.out$FDR <= 0.05 & var.out$bio >= 0.5)) > 1000){
  hvg.out <- var.out[order(var.out$bio, decreasing=TRUE)[1:1000], ]
}else{
  hvg.out <- var.out[which(var.out$FDR <= 0.05 & var.out$bio >= 0.5), ]
  hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE), ]
}

plot(var.out$mean, var.out$total, pch=16, cex=0.6, xlab="Mean log-expression",
     ylab="Variance of log-expression")
o <- order(var.out$mean)
lines(var.out$mean[o], var.out$tech[o], col="dodgerblue", lwd=2)
points(var.out$mean[rownames(var.out) %in% rownames(hvg.out)], var.out$total[rownames(var.out) %in% rownames(hvg.out)], col="red", pch=16)

Heatmap of top100 high variable genes

gene_exp = exprs(sce_qc)

gene_exp = gene_exp[rownames(hvg.out)[1:100], ]

hc.rows <- hclust(dist(gene_exp))
hc.cols <- hclust(dist(t(gene_exp)))

gene_exp = gene_exp[hc.rows$order, hc.cols$order]

m = list(
  l = 100,
  r = 40,
  b = 10,
  t = 10,
  pad = 0
) 

plot_ly(
    x = colnames(gene_exp), y = rownames(gene_exp),
    z = gene_exp, type = "heatmap")%>% 
layout(autosize = F, margin = m)

Dimensionality reduction using high variable genes

Dimensionality reduction by PCA

sce_qc <- runPCA(sce_qc, exprs_values = "logcounts")
plotPCA(sce_qc, colour_by="number_of_genes")

Dimensionality reduction by t-SNE

set.seed(100)
if (any(duplicated(t(logcounts(sce_qc)[rownames(hvg.out), ])))) {
  sce_qc = sce_qc[, !duplicated(t(logcounts(sce_qc)[rownames(hvg.out), ]))]
}
sce_qc <- runTSNE(sce_qc, exprs_values="logcounts", perplexity=10,feature_set=rownames(hvg.out))
out5 <- plotTSNE(sce_qc, colour_by="number_of_genes") + ggtitle("Perplexity = 10")
sce_qc <- runTSNE(sce_qc, exprs_values="logcounts", perplexity=20,feature_set=rownames(hvg.out))
out10 <- plotTSNE(sce_qc, colour_by="number_of_genes") + ggtitle("Perplexity = 20")
sce_qc <- runTSNE(sce_qc, exprs_values="logcounts", perplexity=30,feature_set=rownames(hvg.out))
out20 <- plotTSNE(sce_qc, colour_by="number_of_genes") + ggtitle("Perplexity = 30")
gridExtra::grid.arrange(out5, out10, out20, ncol = 3)

LuyiTian/scPipe documentation built on Dec. 11, 2023, 8:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

LuyiTian/scPipe
Pipeline for single cell multi-omic data pre-processing

In LuyiTian/scPipe: Pipeline for single cell multi-omic data pre-processing

Parameters for each preprocessing step

Parameters for `sc_trim_barcode`

File paths

Read structure

Read filter

Parameters for alignment

Parameters for `sc_exon_mapping`

Parameters for `sc_demultiplex`

Parameters for `sc_gene_counting`

Data summary

Overall barcode statistics

Read alignment statistics

Summary and distributions of QC metrics

Quality control

Detect outlier cells

plot highest expression genes

remove low abundant genes

Data normalization

Normalization by `scran` and `scater`

Normalize the data using size factor and get high variable genes

Heatmap of top100 high variable genes

Dimensionality reduction using high variable genes

Dimensionality reduction by PCA

Dimensionality reduction by t-SNE

R Package Documentation

Browse R Packages

We want your feedback!

LuyiTian/scPipe Pipeline for single cell multi-omic data pre-processing

In LuyiTian/scPipe: Pipeline for single cell multi-omic data pre-processing

Parameters for each preprocessing step

Parameters for sc_trim_barcode

File paths

Read structure

Read filter

Parameters for alignment

Parameters for sc_exon_mapping

Parameters for sc_demultiplex

Parameters for sc_gene_counting

Data summary

Overall barcode statistics

Read alignment statistics

Summary and distributions of QC metrics

Quality control

Detect outlier cells

plot highest expression genes

remove low abundant genes

Data normalization

Normalization by scran and scater

Normalize the data using size factor and get high variable genes

Heatmap of top100 high variable genes

Dimensionality reduction using high variable genes

Dimensionality reduction by PCA

Dimensionality reduction by t-SNE

R Package Documentation

Browse R Packages

We want your feedback!

LuyiTian/scPipe
Pipeline for single cell multi-omic data pre-processing

Parameters for `sc_trim_barcode`

Parameters for `sc_exon_mapping`

Parameters for `sc_demultiplex`

Parameters for `sc_gene_counting`

Normalization by `scran` and `scater`