knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
Repsc can compute family-wise multiple sequence alignments of all repeat sequences in the genome to improve read mapping onto a consensus model. To do so, we require the genome assembly of our organism of choice stored as a BSgenome object. You can retrieve the full list of supported genomes by typing BSgenome::available.genomes()
or create a custom BSgenome object following the instructions.
Since we are working with expression data from human cancer cell lines and mouse embryos, we will first install the UCSC hg38 and mm10 assemblies.
BiocManager::install("BSgenome.Hsapiens.UCSC.hg38") BiocManager::install("BSgenome.Mmusculus.UCSC.mm10")
Important note: Do not use repeat-masked BSgenome objects (contain 'masked' suffix, e.g. BSgenome.Hsapiens.UCSC.hg38.masked)!
To compute family-wise multiple sequence alignments (used to improve mapping of read/UMI signal along consensus TE models), Repsc utilizes the MAFFT multiple alignment program. To use this feature, make sure mafft is in your command PATH.
A convient solution is to download transposon coordinates from the Repeatmasker homepage or the DFAM database for your genome and assembly of choice. You can import the Repeatmasker fa.out.gz or DFAM dfam.hits.gz files using the Repsc importRMSK
and importDFAM
functions, respectively. Another option is to provide custom annotation as long as it provides the basic information about chromosome, start, end, strand, repname (family identifier), and id_unique (unique locus identifier). In this tutorial, we will show you how to import such information using the provided example datasets.
In addition to TE expression counts, Repsc also quantifies genic expression levels using common gene interval and annotation formats (e.g. gtf). In this tutorial, we will use the Gencode comprehensive gene annotation on the human reference chromosomes only. Other annotation ressources are currently untested and should be used with caution.
Repsc requires read alignment coordinates stored in BAM format as input, which are routinely generated during most common scRNA-seq workflows, including 10x' Cellranger pipeline. BAM inputs should be duplicate removed (see below). Chunking BAM inputs (e.g. by chromosome) can accelerate the import into your R environment using the importBAM
function. Important: Repsc assumes the cell barcode is either stored as CB tag or BAM input files are seperated per cell. Other formats are currently not supported (e.g. cell barcode in read name).
To deduplicate reads based on UMI sequences, we can use the Reputils::deduplicateBAM
function, which groups reads into genomic bins and removes duplicate reads using UMI-tools. Make sure UMI tools is installed and in your command PATH.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.