README.md

scChIPseqsim

This is an R-based simulator for single-cell reads of epigenomic data (scChIP-seq, scATAC-seq, etc.) based on pseudo-bulk profile. Specifically, the function scChIPseqsim takes a set of genomic coordinates (peaks) from pseudo-bulk data and will outcome single-cell counts from any number of cells. The distribution of reads counts across cells may or may not vary regarding the sequencing depth and the amount of reads associated with noise.

Installation

You can install the released version of scChIPseqsim from this repository with:

devtools::install_github("plbaldoni/scChIPseqsim")

Input

In summary, the necessary arguments of scChIPseqsim are:

Output

By default, scChIPseqsim will output an RData file (named after the label argument) that will contain the single-cell’s read counts (matCounts) as well as the gold-standard sets of peaks from the refined grid (matPeaks) and the coarse binned grid (matPeaksBinned). These matrices are ouput in sparseMatrix format.

If outputRanges=TRUE, scChIPseqsim will also output an RData file (named after the genome argument) the refined and coarse binned grid of the reference genome.

Example

In this example, I will simulate scChIPseq data for H3K27me3. Specifically, I will use the peaks from ‘bulk’ Helas3 and Huvec cell lines of the ENCODE project as my ‘pseudo-bulk’ sample. In reality, we could use any set of peaks from a given reference genome. The ultimate goal is to create artificial counts from these pseudo-bulk samples, merge them, and evaluate single-cell computational tools.

Step 1: loading libraries and general parameters

# Loading libraries

library(scChIPseqsim)
library(data.table)
library(GenomicRanges)
library(ggplot2)
library(magrittr)
library(dplyr)

# General parameters

# Number of cells per cluster to simulate (assuming balanced n for simplicity)
ncell <- 50 
# Target bin size
binSize <- 5000

Step 2: simulating cluster-specific counts from pseudo-bulk profile

# Simulating cluster 1

pkHelas3 <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHelas3H3k27me3StdPk.broadPeak.gz'
grHelas3 <- makeGRangesFromDataFrame(fread(pkHelas3),
                                     seqnames.field = 'V1',
                                     start.field = 'V2',
                                     end.field = 'V3')

scChIPseqsim(label = 'Helas3',
             path = '~/Downloads/',
             genome = 'hg19',
             grPeaks = grHelas3,
             binSize = binSize,
             baseSeqDepth = 30000,
             nCells = ncell,
             relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
             noisePerCell = seq(0,0.10,length.out = ncell))

# Simulating cluster 2

pkHuvec <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHuvecH3k27me3StdPk.broadPeak.gz'
grHuvec <- makeGRangesFromDataFrame(fread(pkHuvec),
                                    seqnames.field = 'V1',
                                    start.field = 'V2',
                                    end.field = 'V3')

scChIPseqsim(label = 'Huvec',
             path = '~/Downloads/',
             genome = 'hg19',
             grPeaks = grHuvec,
             binSize = binSize,
             baseSeqDepth = 30000,
             nCells = ncell,
             relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
             noisePerCell = seq(0,0.10,length.out = ncell),
             outputRanges = F)

Step 3: loading the data into memory

# Loading ranges

load('~/Downloads/scChIPseqsim_ranges_hg19_250bp_5000bp.RData')

# Loading counts

load('~/Downloads/scChIPseqsim_Helas3.RData')

matCountsHelas3 <- matCounts
matPeaksHelas3 <- matPeaks
matPeaksBinnedHelas3 <- matPeaksBinned

load('~/Downloads/scChIPseqsim_Huvec.RData')

matCountsHuvec <- matCounts
matPeaksHuvec <- matPeaks
matPeaksBinnedHuvec <- matPeaksBinned

# Removing unecessary files

rm(matCounts,matPeaks,matPeaksBinned)

Step 4: data visualization

# Subsetting matrices and merging

chr <- 'chr19'
coordinates = c(37356013,53891350)
ncell <- 100

matCountsSubset <- cbind(matCountsHelas3[which(seqnames(grGenomeBinned)==chr),],
                       matCountsHuvec[which(seqnames(grGenomeBinned)==chr),])

grGenomeBinnedSubset <- grGenomeBinned[which(seqnames(grGenomeBinned)==chr)]

ranges <- which.min(abs(start(grGenomeBinnedSubset)-coordinates[1])):
    which.min(abs(end(grGenomeBinnedSubset)-coordinates[2]))
# Plotting

fig <- as.matrix(matCountsSubset[ranges,]) %>%
    as.data.table() %>% 
    setnames(.,paste0('V',1:ncell)) %>%
    cbind(.,as.data.table(grGenomeBinnedSubset[ranges]))%>%
    melt(data = ., id.vars = c('seqnames','start','end'),measure.vars = paste0('V',1:ncell),
         variable.name = 'Cell',value.name = 'Count') %>%
    as_tibble() %>%
    mutate(CountFactor=cut(Count, breaks=c(-Inf, 0.5, 1.5, 2.5,Inf),
                           labels=c("0","1","2",'>2'))) %>%
    mutate(Cell=as.numeric(gsub("V","",Cell))) %>%
    mutate(Cluster=factor(ifelse(Cell%in%1:ncol(matCountsHelas3),'Helas3','Huvec'),
                          levels = c('Huvec','Helas3'))) %>%
    # Viz
    ggplot(aes(y = Cell,x =  start, fill= CountFactor)) + 
    facet_grid(rows = vars(Cluster),scales = 'free') +
    geom_tile() +
    theme_bw()+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.text = element_text(size=12),axis.title = element_text(size = 12),
          legend.position = 'top',legend.direction = 'horizontal')+
    scale_x_continuous(labels = scales::scientific)+
    guides(fill=guide_legend(title="Simulated scChIP-seq count"))+
    ylab('Cells')+xlab(paste0('Genomic Window (',chr,')'))+
    labs(title = paste0('Simulated data for H3K27me3 (',binSize,'bp windows)'))+
    scale_fill_manual(values=c('white','gray','black','red'))

fig 

We can evaluate the above simulated data for single-cell experiments by looking at the respective genomic position from the UCSC genome browser:



plbaldoni/scChIPseqsim documentation built on June 11, 2020, 7:41 p.m.