Converting BUS format into sparse matrix

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The Barcode, UMI, Set (BUS) format is a new way to represent pseudoalignments of reads from RNA-seq. Files of this format can be efficiently generated by the command line tool kallisto bus. With kallisto bus and this package, we go from the fastq files to the sparse matrix used for downstream analysis such as with Seurat within half an hour, while Cell Ranger would take hours.

In this vignette, we convert an 10x 1:1 mouse and human cell mixture dataset from the BUS format to a sparse matrix. To see how the BUS format can be generated from fastq file, as well as more in depth vignettes, see the website of this package.

Note that this vignette is deprecated and is kept for historical reasons as it was implemented when kallisto | bustools was experimental. The functionality of make_sparse_matrix has been implemented more efficiently in the command line tool bustools. Please use the updated version of bustools and if you wish, the wrapper kb instead.

Download the dataset

# The dataset package
library(TENxBUSData)
library(BUSpaRse)
library(Matrix)
library(zeallot)
library(ggplot2)
TENxBUSData(".", dataset = "hgmm100")

Convert to sparse matrix

First, we map transcripts, as in the kallisto index, to the corresponding genes.

tr2g <- transcript2gene(species = c("Homo sapiens", "Mus musculus"), type = "vertebrate",
                        kallisto_out_path = "./out_hgmm100", ensembl_version = 99,
                        write_tr2g = FALSE)
head(tr2g)

Here we make both the gene count matrix and the TCC matrix.

c(gene_count, tcc) %<-% make_sparse_matrix("./out_hgmm100/output.sorted.txt",
                               tr2g = tr2g, est_ncells = 1e5,
                               est_ngenes = nrow(tr2g))

Remove empty droplets

Here we have a sparse matrix with genes in rows and cells in columns.

dim(gene_count)

This dataset should only have about 100 cells, but here we get over 100,000. In fact, most of the barcodes correspond to empty droplets; they can be removed by filtering out barcodes with too few UMI.

tot_counts <- Matrix::colSums(gene_count)
summary(tot_counts)
df1 <- get_knee_df(gene_count)
infl1 <- get_inflection(df1)
knee_plot(df1, infl1)
gene_count <- gene_count[, tot_counts > infl1]
dim(gene_count)

Then this sparse matrix can be used in Seurat for downstream analysis.

Likewise, we can remove empty droplets from the TCC matrix.

dim(tcc)

This dataset should only have about 100 cells, but here we get over 100,000. In fact, most of the barcodes correspond to empty droplets; they can be removed by filtering out barcodes with too few UMI.

tot_counts <- Matrix::colSums(tcc)
summary(tot_counts)
df2 <- get_knee_df(tcc)
infl2 <- get_inflection(df2)
knee_plot(df2, infl2)
tcc <- tcc[, tot_counts > infl2]
dim(tcc)
sessionInfo()


Try the BUSpaRse package in your browser

Any scripts or data that you put into this service are public.

BUSpaRse documentation built on March 3, 2021, 2:01 a.m.