knitr::opts_chunk$set(eval = FALSE) knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
In this part of the tutorial, we will use Smart-seq2 data generated on early mouse embryos[1] to illustrate the specifics of full-coverage scRNA-seq data.
We start the workflow by loading Repsc and the mouse mm10 BSgenome object into our R environment:
devtools::load_all('/net/mraid14/export/tgdata/users/davidbr/src/Repsc/') # adjust to your genome of interest (e.g. BSgenome.Hsapiens.UCSC.hg38) library(BSgenome.Mmusculus.UCSC.mm10)
As before, we first import our gene and TE annotation files as GRanges objects followed by Repsc-specific curation and formatting using the curateGenes
and curateTEs
functions.
# path to Gencode gtf file (provided) gene_path <- system.file(package = 'Repsc', 'extdata', 'gencode.vM21.annotation.gtf.gz') # path to RepeatMasker hg38 repeat annotation (provided) rmsk_path <- system.file(package = 'Repsc', 'extdata', 'mm10.fa.out.gz') # import genes <- rtracklayer::import(gene_path) tes <- importRMSK(rmsk_path) # by default only imports DNA, SINE, LINE, and LTR class repeats on the main reference chromosomes # curation genes_cur <- curateGenes(genes) # only retains exons, resolves naming ambiguity tes_cur <- curateTEs(tes, # tries to resolve nested TE intervals and adds annotation genes_cur) # to filter exonic TEs
Given the size of the example Smart-seq2 dataset and memory limitations of most systems, we will compute single-cell gene/TE counts for subsets of the BAM files and combine the processed results afterwards. For enhanced speed performance, we will employ the parallel backend of the future package.
# path to bam files containing mapped reads bam_paths <- dir('/net/mraid14/export/tgdata/users/evghenic/data/cheng2019/aligned/', pattern = 'duprm.bam', full.names = TRUE) bam_paths # vector of cell barcodes barcodes <- substring(basename(bam_paths), 1, 10) # commands for parallel execution cmds <- list() chunks <- chunk(bam_paths, chunk_size = 10) for (i in 1:length(chunks)) { bam_chunks <- glue("c({glue_collapse(single_quote(bam_paths[chunks[[i]]]), sep = ',')})") barcode_chunks <- glue("c({glue_collapse(single_quote(barcodes[chunks[[i]]]), sep = ',')})") cmds[[i]] <- glue("Repsc:::compCounts(reads = Repsc:::importBAM({bam_chunks}, paired = FALSE, mate = NA, anchor = 'fiveprime', barcode = {barcode_chunks}), tes_cur, genes_cur, bin_size = 20, protocol = 'full')") } # distribute commands # create new empty env and fill with relevant empty <- new.env(parent = emptyenv()) empty$tes_cur <- tes_cur empty$genes_cur <- genes_cur # distribute, compute and combine counts <- gcluster.run3(command_list = cmds, packages = c("data.table", "plyranges", "base", "stats"), envir = empty, collapse_results = TRUE, io_saturation = TRUE)
The full transcript coverage of Smart-seq2 comes with a cost: Unlike most scRNA-seq protocols it lacks UMIs to quantify true molecule counts per cell. As an approximation, Repsc samples N
molecules per cell with empirical weights learned from the data to simulate integer molecule counts.
counts_norm <- normCounts(counts, N = 20000) # number of molecules to sample per cell
[1]
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.