library(knitr) library(scChIPseqsim) knitr::opts_chunk$set( collapse = TRUE, echo = T, eval = T, cache = T, comment = "#>", out.width = "100%" )
This is an R-based simulator for single-cell reads of epigenomic data (scChIP-seq, scATAC-seq, etc.) based on pseudo-bulk profile. Specifically, the function scChIPseqsim
takes a set of genomic coordinates (peaks) from pseudo-bulk data and will outcome single-cell counts from any number of cells. The distribution of reads counts across cells may or may not vary regarding the sequencing depth and the amount of reads associated with noise.
You can install the released version of scChIPseqsim from this repository with:
devtools::install_github("plbaldoni/scChIPseqsim")
In summary, the necessary arguments of scChIPseqsim
are:
label
: a character with the data labelpath
: a character with the path where the simulated data will be savedgenome
: a character specifying the target genomegrPeaks
: a GenomicRanges object with the coordinates of peaks from pseudo-bulk databinSize
: integer with the size of the bin windowbaseSeqDepth
: integer with the baseline sequencing depth for all simulated cellsnCells
: integer with the number of cells to simulaterelSeqDepthPerCell
: vector of size nCells
with positive scalars representing the cell-specific relative sequencing depth with respect to baseSeqDepth
. For instance, if relSeqDepthPerCell=rep(1,nCells)
then all cells are simulated with an equal expected sequencing depth.noisePerCell
: vector of size nCells
with positive scalars representing the cell-specific noise relative to the cell-specific expected sequencing depth. For instance, if noisePerCell=rep(0.05,nCells)
then 5% of the reads from the cell-specific sequencing depth will be randomly assigned to any genomic bin (regardless if the bin overlaps grPeaks
).chromosome
: vector of chromosomes to consider from the specified genome
. Default is c(paste0('chr',1:22),'chrX')
basebinSize
: integer giving the base bin size to calculate the gold-standard set of peaks. It must be smaller than binSize and usually small enough to create a refined grid of bins across the genome
. Default 250outputRanges
: a logical indicating whether (TRUE
) or not (FALSE
) to save the binned coordinates of the genome. Default is TRUE
By default, scChIPseqsim will output an RData file (named after the label
argument) that will contain the single-cell's read counts (matCounts) as well as the gold-standard sets of peaks from the refined grid (matPeaks) and the coarse binned grid (matPeaksBinned). These matrices are ouput in sparseMatrix format.
If outputRanges=TRUE
, scChIPseqsim will also output an RData file (named after the genome
argument) the refined and coarse binned grid of the reference genome
.
In this example, I will simulate scChIPseq data for H3K27me3. Specifically, I will use the peaks from 'bulk' Helas3 and Huvec cell lines of the ENCODE project as my 'pseudo-bulk' sample. In reality, we could use any set of peaks from a given reference genome. The ultimate goal is to create artificial counts from these pseudo-bulk samples, merge them, and evaluate single-cell computational tools.
# Loading libraries library(scChIPseqsim) library(data.table) library(GenomicRanges) library(ggplot2) library(magrittr) library(dplyr) # General parameters # Number of cells per cluster to simulate (assuming balanced n for simplicity) ncell <- 50 # Target bin size binSize <- 5000
# Simulating cluster 1 pkHelas3 <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHelas3H3k27me3StdPk.broadPeak.gz' grHelas3 <- makeGRangesFromDataFrame(fread(pkHelas3), seqnames.field = 'V1', start.field = 'V2', end.field = 'V3') scChIPseqsim(label = 'Helas3', path = '~/Downloads/', genome = 'hg19', grPeaks = grHelas3, binSize = binSize, baseSeqDepth = 30000, nCells = ncell, relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell), noisePerCell = seq(0,0.10,length.out = ncell)) # Simulating cluster 2 pkHuvec <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHuvecH3k27me3StdPk.broadPeak.gz' grHuvec <- makeGRangesFromDataFrame(fread(pkHuvec), seqnames.field = 'V1', start.field = 'V2', end.field = 'V3') scChIPseqsim(label = 'Huvec', path = '~/Downloads/', genome = 'hg19', grPeaks = grHuvec, binSize = binSize, baseSeqDepth = 30000, nCells = ncell, relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell), noisePerCell = seq(0,0.10,length.out = ncell), outputRanges = F)
# Loading ranges load('~/Downloads/scChIPseqsim_ranges_hg19_250bp_5000bp.RData') # Loading counts load('~/Downloads/scChIPseqsim_Helas3.RData') matCountsHelas3 <- matCounts matPeaksHelas3 <- matPeaks matPeaksBinnedHelas3 <- matPeaksBinned load('~/Downloads/scChIPseqsim_Huvec.RData') matCountsHuvec <- matCounts matPeaksHuvec <- matPeaks matPeaksBinnedHuvec <- matPeaksBinned # Removing unecessary files rm(matCounts,matPeaks,matPeaksBinned)
# Subsetting matrices and merging chr <- 'chr19' coordinates = c(37356013,53891350) ncell <- 100 matCountsSubset <- cbind(matCountsHelas3[which(seqnames(grGenomeBinned)==chr),], matCountsHuvec[which(seqnames(grGenomeBinned)==chr),]) grGenomeBinnedSubset <- grGenomeBinned[which(seqnames(grGenomeBinned)==chr)] ranges <- which.min(abs(start(grGenomeBinnedSubset)-coordinates[1])): which.min(abs(end(grGenomeBinnedSubset)-coordinates[2]))
# Plotting fig <- as.matrix(matCountsSubset[ranges,]) %>% as.data.table() %>% setnames(.,paste0('V',1:ncell)) %>% cbind(.,as.data.table(grGenomeBinnedSubset[ranges]))%>% melt(data = ., id.vars = c('seqnames','start','end'),measure.vars = paste0('V',1:ncell), variable.name = 'Cell',value.name = 'Count') %>% as_tibble() %>% mutate(CountFactor=cut(Count, breaks=c(-Inf, 0.5, 1.5, 2.5,Inf), labels=c("0","1","2",'>2'))) %>% mutate(Cell=as.numeric(gsub("V","",Cell))) %>% mutate(Cluster=factor(ifelse(Cell%in%1:ncol(matCountsHelas3),'Helas3','Huvec'), levels = c('Huvec','Helas3'))) %>% # Viz ggplot(aes(y = Cell,x = start, fill= CountFactor)) + facet_grid(rows = vars(Cluster),scales = 'free') + geom_tile() + theme_bw()+ theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text = element_text(size=12),axis.title = element_text(size = 12), legend.position = 'top',legend.direction = 'horizontal')+ scale_x_continuous(labels = scales::scientific)+ guides(fill=guide_legend(title="Simulated scChIP-seq count"))+ ylab('Cells')+xlab(paste0('Genomic Window (',chr,')'))+ labs(title = paste0('Simulated data for H3K27me3 (',binSize,'bp windows)'))+ scale_fill_manual(values=c('white','gray','black','red')) fig
We can evaluate the above simulated data for single-cell experiments by looking at the respective genomic position from the UCSC genome browser:
knitr::include_graphics(path.expand("./README_files/figure-gfm/hgt_genome_4786a_fd69f0.png"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.