In plbaldoni/scChIPseqsim: Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles

library(knitr)
library(scChIPseqsim)
knitr::opts_chunk$set(
    collapse = TRUE,
    echo = T,
    eval = T,
    cache = T,
    comment = "#>",
    out.width = "100%"
)

scChIPseqsim

This is an R-based simulator for single-cell reads of epigenomic data (scChIP-seq, scATAC-seq, etc.) based on pseudo-bulk profile. Specifically, the function scChIPseqsim takes a set of genomic coordinates (peaks) from pseudo-bulk data and will outcome single-cell counts from any number of cells. The distribution of reads counts across cells may or may not vary regarding the sequencing depth and the amount of reads associated with noise.

Installation

You can install the released version of scChIPseqsim from this repository with:

devtools::install_github("plbaldoni/scChIPseqsim")

Input

In summary, the necessary arguments of scChIPseqsim are:

label: a character with the data label
path: a character with the path where the simulated data will be saved
genome: a character specifying the target genome
grPeaks: a GenomicRanges object with the coordinates of peaks from pseudo-bulk data
binSize: integer with the size of the bin window
baseSeqDepth: integer with the baseline sequencing depth for all simulated cells
nCells: integer with the number of cells to simulate
relSeqDepthPerCell: vector of size nCells with positive scalars representing the cell-specific relative sequencing depth with respect to baseSeqDepth. For instance, if relSeqDepthPerCell=rep(1,nCells) then all cells are simulated with an equal expected sequencing depth.
noisePerCell: vector of size nCells with positive scalars representing the cell-specific noise relative to the cell-specific expected sequencing depth. For instance, if noisePerCell=rep(0.05,nCells) then 5% of the reads from the cell-specific sequencing depth will be randomly assigned to any genomic bin (regardless if the bin overlaps grPeaks).
chromosome: vector of chromosomes to consider from the specified genome. Default is c(paste0('chr',1:22),'chrX')
basebinSize: integer giving the base bin size to calculate the gold-standard set of peaks. It must be smaller than binSize and usually small enough to create a refined grid of bins across the genome. Default 250
outputRanges: a logical indicating whether (TRUE) or not (FALSE) to save the binned coordinates of the genome. Default is TRUE

Output

By default, scChIPseqsim will output an RData file (named after the label argument) that will contain the single-cell's read counts (matCounts) as well as the gold-standard sets of peaks from the refined grid (matPeaks) and the coarse binned grid (matPeaksBinned). These matrices are ouput in sparseMatrix format.

If outputRanges=TRUE, scChIPseqsim will also output an RData file (named after the genome argument) the refined and coarse binned grid of the reference genome.

Example

In this example, I will simulate scChIPseq data for H3K27me3. Specifically, I will use the peaks from 'bulk' Helas3 and Huvec cell lines of the ENCODE project as my 'pseudo-bulk' sample. In reality, we could use any set of peaks from a given reference genome. The ultimate goal is to create artificial counts from these pseudo-bulk samples, merge them, and evaluate single-cell computational tools.

Step 1: loading libraries and general parameters

# Loading libraries

library(scChIPseqsim)
library(data.table)
library(GenomicRanges)
library(ggplot2)
library(magrittr)
library(dplyr)

# General parameters

# Number of cells per cluster to simulate (assuming balanced n for simplicity)
ncell <- 50 
# Target bin size
binSize <- 5000

Step 2: simulating cluster-specific counts from pseudo-bulk profile

# Simulating cluster 1

pkHelas3 <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHelas3H3k27me3StdPk.broadPeak.gz'
grHelas3 <- makeGRangesFromDataFrame(fread(pkHelas3),
                                     seqnames.field = 'V1',
                                     start.field = 'V2',
                                     end.field = 'V3')

scChIPseqsim(label = 'Helas3',
             path = '~/Downloads/',
             genome = 'hg19',
             grPeaks = grHelas3,
             binSize = binSize,
             baseSeqDepth = 30000,
             nCells = ncell,
             relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
             noisePerCell = seq(0,0.10,length.out = ncell))

# Simulating cluster 2

pkHuvec <- 'http://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone/wgEncodeBroadHistoneHuvecH3k27me3StdPk.broadPeak.gz'
grHuvec <- makeGRangesFromDataFrame(fread(pkHuvec),
                                    seqnames.field = 'V1',
                                    start.field = 'V2',
                                    end.field = 'V3')

scChIPseqsim(label = 'Huvec',
             path = '~/Downloads/',
             genome = 'hg19',
             grPeaks = grHuvec,
             binSize = binSize,
             baseSeqDepth = 30000,
             nCells = ncell,
             relSeqDepthPerCell = seq(0.8,1.20,length.out = ncell),
             noisePerCell = seq(0,0.10,length.out = ncell),
             outputRanges = F)

Step 3: loading the data into memory

# Loading ranges

load('~/Downloads/scChIPseqsim_ranges_hg19_250bp_5000bp.RData')

# Loading counts

load('~/Downloads/scChIPseqsim_Helas3.RData')

matCountsHelas3 <- matCounts
matPeaksHelas3 <- matPeaks
matPeaksBinnedHelas3 <- matPeaksBinned

load('~/Downloads/scChIPseqsim_Huvec.RData')

matCountsHuvec <- matCounts
matPeaksHuvec <- matPeaks
matPeaksBinnedHuvec <- matPeaksBinned

# Removing unecessary files

rm(matCounts,matPeaks,matPeaksBinned)

Step 4: data visualization

# Subsetting matrices and merging

chr <- 'chr19'
coordinates = c(37356013,53891350)
ncell <- 100

matCountsSubset <- cbind(matCountsHelas3[which(seqnames(grGenomeBinned)==chr),],
                       matCountsHuvec[which(seqnames(grGenomeBinned)==chr),])

grGenomeBinnedSubset <- grGenomeBinned[which(seqnames(grGenomeBinned)==chr)]

ranges <- which.min(abs(start(grGenomeBinnedSubset)-coordinates[1])):
    which.min(abs(end(grGenomeBinnedSubset)-coordinates[2]))

# Plotting

fig <- as.matrix(matCountsSubset[ranges,]) %>%
    as.data.table() %>% 
    setnames(.,paste0('V',1:ncell)) %>%
    cbind(.,as.data.table(grGenomeBinnedSubset[ranges]))%>%
    melt(data = ., id.vars = c('seqnames','start','end'),measure.vars = paste0('V',1:ncell),
         variable.name = 'Cell',value.name = 'Count') %>%
    as_tibble() %>%
    mutate(CountFactor=cut(Count, breaks=c(-Inf, 0.5, 1.5, 2.5,Inf),
                           labels=c("0","1","2",'>2'))) %>%
    mutate(Cell=as.numeric(gsub("V","",Cell))) %>%
    mutate(Cluster=factor(ifelse(Cell%in%1:ncol(matCountsHelas3),'Helas3','Huvec'),
                          levels = c('Huvec','Helas3'))) %>%
    # Viz
    ggplot(aes(y = Cell,x =  start, fill= CountFactor)) + 
    facet_grid(rows = vars(Cluster),scales = 'free') +
    geom_tile() +
    theme_bw()+
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.text = element_text(size=12),axis.title = element_text(size = 12),
          legend.position = 'top',legend.direction = 'horizontal')+
    scale_x_continuous(labels = scales::scientific)+
    guides(fill=guide_legend(title="Simulated scChIP-seq count"))+
    ylab('Cells')+xlab(paste0('Genomic Window (',chr,')'))+
    labs(title = paste0('Simulated data for H3K27me3 (',binSize,'bp windows)'))+
    scale_fill_manual(values=c('white','gray','black','red'))

fig

We can evaluate the above simulated data for single-cell experiments by looking at the respective genomic position from the UCSC genome browser:

knitr::include_graphics(path.expand("./README_files/figure-gfm/hgt_genome_4786a_fd69f0.png"))

plbaldoni/scChIPseqsim documentation built on June 11, 2020, 7:41 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

plbaldoni/scChIPseqsim
Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles

In plbaldoni/scChIPseqsim: Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles

scChIPseqsim

Installation

Input

Output

Example

Step 1: loading libraries and general parameters

Step 2: simulating cluster-specific counts from pseudo-bulk profile

Step 3: loading the data into memory

Step 4: data visualization

R Package Documentation

Browse R Packages

We want your feedback!

plbaldoni/scChIPseqsim Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles

In plbaldoni/scChIPseqsim: Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles

scChIPseqsim

Installation

Input

Output

Example

Step 1: loading libraries and general parameters

Step 2: simulating cluster-specific counts from pseudo-bulk profile

Step 3: loading the data into memory

Step 4: data visualization

R Package Documentation

Browse R Packages

We want your feedback!

plbaldoni/scChIPseqsim
Single-cell ChIP-seq Data Simulator Based on Pseudo-bulk Profiles