This is an R-based simulator for single-cell reads of epigenomic data
(scChIP-seq, scATAC-seq, etc.) based on pseudo-bulk profile.
Specifically, the function scChIPseqsim
takes a set of genomic
coordinates (peaks) from pseudo-bulk data and will outcome single-cell
counts from any number of cells. The distribution of reads counts across
cells may or may not vary regarding the sequencing depth and the amount
of reads associated with noise.
You can install the released version of scChIPseqsim from this repository with:
devtools::install_github("plbaldoni/scChIPseqsim")
In summary, the necessary arguments of scChIPseqsim
are:
label
: a character with the data labelpath
: a character with the path where the simulated data will be
savedgenome
: a character specifying the target genomegrPeaks
: a GenomicRanges object with the coordinates of peaks from
pseudo-bulk databinSize
: integer with the size of the bin windowbaseSeqDepth
: integer with the baseline sequencing depth for all
simulated cellsnCells
: integer with the number of cells to simulaterelSeqDepthPerCell
: vector of size nCells
with positive scalars
representing the cell-specific relative sequencing depth with
respect to baseSeqDepth
. For instance, if
relSeqDepthPerCell=rep(1,nCells)
then all cells are simulated with
an equal expected sequencing depth.noisePerCell
: vector of size nCells
with positive scalars
representing the cell-specific noise relative to the cell-specific
expected sequencing depth. For instance, if
noisePerCell=rep(0.05,nCells)
then 5% of the reads from the
cell-specific sequencing depth will be randomly assigned to any
genomic bin (regardless if the bin overlaps grPeaks
).chromosome
: vector of chromosomes to consider from the specified
genome
. Default is c(paste0('chr',1:22),'chrX')
basebinSize
: integer giving the base bin size to calculate the
gold-standard set of peaks. It must be smaller than binSize and
usually small enough to create a refined grid of bins across the
genome
. Default 250outputRanges
: a logical indicating whether (TRUE
) or not
(FALSE
) to save the binned coordinates of the genome. Default is
TRUE
By default, scChIPseqsim will output an RData file (named after the
label
argument) that will contain the single-cell’s read counts
(matCounts) as well as the gold-standard sets of peaks from the refined
grid (matPeaks) and the coarse binned grid (matPeaksBinned). These
matrices are ouput in sparseMatrix format.
If outputRanges=TRUE
, scChIPseqsim will also output an RData file
(named after the genome
argument) the refined and coarse binned grid
of the reference genome
.
In this example, I will simulate scChIPseq data for H3K27me3. Specifically, I will use the peaks from ‘bulk’ Helas3 and Huvec cell lines of the ENCODE project as my ‘pseudo-bulk’ sample. In reality, we could use any set of peaks from a given reference genome. The ultimate goal is to create artificial counts from these pseudo-bulk samples, merge them, and evaluate single-cell computational tools.
# Loading libraries
library(scChIPseqsim)
library(data.table)
library(GenomicRanges)
library(ggplot2)
library(magrittr)
library(dplyr)
# General parameters
# Number of cells per cluster to simulate (assuming balanced n for simplicity)
ncell <- 50
# Target bin size
binSize <- 5000
# Simulating cluster 1
pkHelas3 <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHelas3H3k27me3StdPk.broadPeak.gz'
grHelas3 <- makeGRangesFromDataFrame(fread(pkHelas3),
seqnames.field = 'V1',
start.field = 'V2',
end.field = 'V3')
scChIPseqsim(label = 'Helas3',
path = '~/Downloads/',
genome = 'hg19',
grPeaks = grHelas3,
binSize = binSize,
baseSeqDepth = 30000,
nCells = ncell,
relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
noisePerCell = seq(0,0.10,length.out = ncell))
# Simulating cluster 2
pkHuvec <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHuvecH3k27me3StdPk.broadPeak.gz'
grHuvec <- makeGRangesFromDataFrame(fread(pkHuvec),
seqnames.field = 'V1',
start.field = 'V2',
end.field = 'V3')
scChIPseqsim(label = 'Huvec',
path = '~/Downloads/',
genome = 'hg19',
grPeaks = grHuvec,
binSize = binSize,
baseSeqDepth = 30000,
nCells = ncell,
relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
noisePerCell = seq(0,0.10,length.out = ncell),
outputRanges = F)
# Loading ranges
load('~/Downloads/scChIPseqsim_ranges_hg19_250bp_5000bp.RData')
# Loading counts
load('~/Downloads/scChIPseqsim_Helas3.RData')
matCountsHelas3 <- matCounts
matPeaksHelas3 <- matPeaks
matPeaksBinnedHelas3 <- matPeaksBinned
load('~/Downloads/scChIPseqsim_Huvec.RData')
matCountsHuvec <- matCounts
matPeaksHuvec <- matPeaks
matPeaksBinnedHuvec <- matPeaksBinned
# Removing unecessary files
rm(matCounts,matPeaks,matPeaksBinned)
# Subsetting matrices and merging
chr <- 'chr19'
coordinates = c(37356013,53891350)
ncell <- 100
matCountsSubset <- cbind(matCountsHelas3[which(seqnames(grGenomeBinned)==chr),],
matCountsHuvec[which(seqnames(grGenomeBinned)==chr),])
grGenomeBinnedSubset <- grGenomeBinned[which(seqnames(grGenomeBinned)==chr)]
ranges <- which.min(abs(start(grGenomeBinnedSubset)-coordinates[1])):
which.min(abs(end(grGenomeBinnedSubset)-coordinates[2]))
# Plotting
fig <- as.matrix(matCountsSubset[ranges,]) %>%
as.data.table() %>%
setnames(.,paste0('V',1:ncell)) %>%
cbind(.,as.data.table(grGenomeBinnedSubset[ranges]))%>%
melt(data = ., id.vars = c('seqnames','start','end'),measure.vars = paste0('V',1:ncell),
variable.name = 'Cell',value.name = 'Count') %>%
as_tibble() %>%
mutate(CountFactor=cut(Count, breaks=c(-Inf, 0.5, 1.5, 2.5,Inf),
labels=c("0","1","2",'>2'))) %>%
mutate(Cell=as.numeric(gsub("V","",Cell))) %>%
mutate(Cluster=factor(ifelse(Cell%in%1:ncol(matCountsHelas3),'Helas3','Huvec'),
levels = c('Huvec','Helas3'))) %>%
# Viz
ggplot(aes(y = Cell,x = start, fill= CountFactor)) +
facet_grid(rows = vars(Cluster),scales = 'free') +
geom_tile() +
theme_bw()+
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_text(size=12),axis.title = element_text(size = 12),
legend.position = 'top',legend.direction = 'horizontal')+
scale_x_continuous(labels = scales::scientific)+
guides(fill=guide_legend(title="Simulated scChIP-seq count"))+
ylab('Cells')+xlab(paste0('Genomic Window (',chr,')'))+
labs(title = paste0('Simulated data for H3K27me3 (',binSize,'bp windows)'))+
scale_fill_manual(values=c('white','gray','black','red'))
fig
We can evaluate the above simulated data for single-cell experiments by looking at the respective genomic position from the UCSC genome browser:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.