knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

Workflow mouse Smart-seq2 scRNA-seq dataset

In this part of the tutorial, we will use Smart-seq2 data generated on early mouse embryos[1] to illustrate the specifics of full-coverage scRNA-seq data.

Getting started

We start the workflow by loading Repsc and the mouse mm10 BSgenome object into our R environment:

devtools::load_all('/net/mraid14/export/tgdata/users/davidbr/src/Repsc/')

# adjust to your genome of interest (e.g. BSgenome.Hsapiens.UCSC.hg38)
library(BSgenome.Mmusculus.UCSC.mm10)

Import genome annotation

As before, we first import our gene and TE annotation files as GRanges objects followed by Repsc-specific curation and formatting using the curateGenes and curateTEs functions.

# path to Gencode gtf file (provided)
gene_path <- system.file(package = 'Repsc', 
                        'extdata', 
                        'gencode.vM21.annotation.gtf.gz')
# path to RepeatMasker hg38 repeat annotation (provided)
rmsk_path <- system.file(package = 'Repsc', 
                         'extdata', 
                         'mm10.fa.out.gz')

# import
genes <- rtracklayer::import(gene_path)
tes   <- importRMSK(rmsk_path)         # by default only imports DNA, SINE, LINE, and LTR class repeats on the main reference chromosomes                           

# curation
genes_cur <- curateGenes(genes)        # only retains exons, resolves naming ambiguity
tes_cur   <- curateTEs(tes,            # tries to resolve nested TE intervals and adds annotation
                       genes_cur)      # to filter exonic TEs

Import reads and compute count matrix in parallel

Given the size of the example Smart-seq2 dataset and memory limitations of most systems, we will compute single-cell gene/TE counts for subsets of the BAM files and combine the processed results afterwards. For enhanced speed performance, we will employ the parallel backend of the future package.

# path to bam files containing mapped reads                         
bam_paths <- dir('/net/mraid14/export/tgdata/users/evghenic/data/cheng2019/aligned/', 
                 pattern = 'duprm.bam', 
                 full.names = TRUE)
bam_paths

# vector of cell barcodes 
barcodes <- substring(basename(bam_paths), 1, 10)

# commands for parallel execution
cmds   <- list()
chunks <- chunk(bam_paths, chunk_size = 10)

for (i in 1:length(chunks))
{
  bam_chunks     <- glue("c({glue_collapse(single_quote(bam_paths[chunks[[i]]]), sep = ',')})")
  barcode_chunks <- glue("c({glue_collapse(single_quote(barcodes[chunks[[i]]]), sep = ',')})")

  cmds[[i]] <- glue("Repsc:::compCounts(reads = Repsc:::importBAM({bam_chunks}, paired = FALSE, mate = NA, anchor = 'fiveprime', barcode = {barcode_chunks}),
                     tes_cur,
                     genes_cur,
                     bin_size = 20,
                     protocol = 'full')")
}

# distribute commands

# create new empty env and fill with relevant
empty           <- new.env(parent = emptyenv())
empty$tes_cur   <- tes_cur
empty$genes_cur <- genes_cur

# distribute, compute and combine
counts <- 
  gcluster.run3(command_list = cmds,  
                packages = c("data.table", "plyranges", "base", "stats"), 
                envir = empty, 
                collapse_results = TRUE,
                io_saturation = TRUE)

Normalize counts

The full transcript coverage of Smart-seq2 comes with a cost: Unlike most scRNA-seq protocols it lacks UMIs to quantify true molecule counts per cell. As an approximation, Repsc samples N molecules per cell with empirical weights learned from the data to simulate integer molecule counts.

counts_norm <- normCounts(counts, 
                          N = 20000) # number of molecules to sample per cell

References

[1]

Session information

sessionInfo()


tanaylab/repsc documentation built on Jan. 9, 2021, 9:39 a.m.