After installing veni, we load the library. All functions in this vignette are implemented in the veni package.

library(veni)

Next, we load a dataset hosted on the 10X website. The expression matrix is a 1:1 mixture of human and mouse cells.

filename = "../../data/1k_hgmm_v3_filtered_feature_bc_matrix.h5"
# load 1:1 mixture of mouse and human cells
E = Read10Xh5( filename = filename)

After loading the dataset, we wish to identify empty droplets that only contain ambient RNA. We reasoned that such cells will be relatively deeply sequenced (many counts) with few unique genes (due to a limited pool of RNA). veni detects background cutoffs automatically with bivariate statistical analysis and visualizes the results.

# get initial cutoffs
C = GetCutoffs(E,min_counts = 10^3.5,max_counts = 10^4.25);
DepthPlot(E, Cutoffs = C, print.pdf = F)

Since we wish to have multiple cells in the background, we set more liberal thresholds.

C[[1]]$min_umi = 10^3.6
C[[1]]$min_genes = 1400
DepthPlot(E, Cutoffs = C, print.pdf = F)

Next, we inspect the genes detected and background distribution of cells. We want to see a consistent signal in the background cells.

# inspect background matrix
B = GetBackgroundMatrix(E, Cutoffs = C)
Q = InspectBackgroundMatrix(B)

We see eleven droplets, and we can clearly identify one of the droplets is an outlier. We remove it with a basic filtering strategy based on the average pairwise correlation.

B = FilterBackgroundMatrix(B)
Q = InspectBackgroundMatrix(B)

Now we investigate the gene composition of the background cells.

InspectBackgroundGenes(B, num.genes = 20)

We see that the background top 20 percent expressed genes are generally mitochondrial / ribosomal RNA from human samples. As a result, we expect that decontaminate will improve the purity of mouse cells significantly, and only modestly affect human cells. Now we get background distribution

B = GetBackgroundFromMatrix(B)

Now we run regressions. These regressions are implemented in C++ on the backend, and are parallelized. Although here the regressions are performed on 1,000 cells, this package will work with more than one million cells.

Z = Decontaminate(E, B) # remove ambient RNA

Compare results before and after

Z = Z[[1]]
logik = grepl("hg19", rownames(Z))
df = data.frame(state     = c(rep("before",ncol(E))    , rep("after", ncol(Z))),
                kmu_human = c(Matrix::colSums(E[logik,]), Matrix::colSums(Z[logik,])),
                kmu_mouse = c(Matrix::colSums(E[!logik,]), Matrix::colSums(Z[!logik,])),
                kmu      =  c(Matrix::colSums(E), Matrix::colSums(Z)))
df$ratio = log10(df$kmu_human/df$kmu_mouse)

ggplot2::ggplot(df, ggplot2::aes_string(x = "kmu", y = "ratio", color = "state")) + ggplot2::geom_point(alpha = 0.2) + ggplot2::xlab("Library size") +  ggplot2::ylab("Human to Mouse ratio") + ggplot2::scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),labels = scales::trans_format("log10", scales::math_format(10^.x)))

Decontaminate again

# get initial cutoffs
C = GetCutoffs(Z,min_counts = 10^2,max_counts = 10^3.5);
DepthPlot(Z, Cutoffs = C, print.pdf = F)

Hard set these thresholds.

C[[1]]$min_umi = 10^2.75
C[[1]]$min_genes = 250
DepthPlot(Z, Cutoffs = C, print.pdf = F)

Next, we inspect the genes detected and background distribution of cells. We want to see a consistent signal.

# inspect background matrix
B = GetBackgroundMatrix(Z, Cutoffs = C)
Q = InspectBackgroundMatrix(B)

Now we investigate the gene composition of the background cells.

InspectBackgroundGenes(B, num.genes = 20)

We see that the background top 20 percent expressed genes are generally mitochondrial / ribosomal RNA, but now we see a mixture of human and mouse samples.

B = GetBackgroundFromMatrix(B)

Run decontaminate again

Z_new = Decontaminate(Z, B) # remove ambient RNA

Compare results before and after

Z_new = Z_new[[1]]
logik = grepl("hg19", rownames(Z_new))
df = data.frame(state     = c(rep("before",ncol(E))    , rep("after", ncol(Z_new))),
                kmu_human = c(Matrix::colSums(E[logik,]), Matrix::colSums(Z_new[logik,])),
                kmu_mouse = c(Matrix::colSums(E[!logik,]), Matrix::colSums(Z_new[!logik,])),
                kmu      =  c(Matrix::colSums(E), Matrix::colSums(Z_new)))
df$ratio = log10(df$kmu_human/df$kmu_mouse)

ggplot2::ggplot(df, ggplot2::aes_string(x = "kmu", y = "ratio", color = "state")) + ggplot2::geom_point(alpha = 0.2) + ggplot2::xlab("Library size") +  ggplot2::ylab("Human to Mouse ratio") + ggplot2::scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),labels = scales::trans_format("log10", scales::math_format(10^.x)))

Save results

SaveCountsToH5(Z, data.dir = "../../decontaminated_data/")


sanofi-pi/veni documentation built on Oct. 12, 2020, 10:17 p.m.