knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

1. Introduction

Single cell HiC techniques enable us to study the between-cell variability in long-distance interactions and genomic features. However, single cell HiC data usually suffers from excessive zeroes due to a lack of sequencing depth. Among those zeros in scHiC contact matrix, some are structural zero (SZ) because the corresponding pairs do not interact with each other at all, while others are dropout (DO) as a result of low sequencing depth. While those dropouts happen at random, those structural zeros do not.

scHiCBayes (Xie and Lin, 2021) distinguishs DO from SZ and make imputations at the DO positions, based on a Bsyesian hierarchical model that take information from neighborhood, similar cells, and bulk data.

In this package, we provide the following main functions:

We will illustrate the usage of these functions in the following sections.

2. MCMCImpute

MCMCImpute imputes single cell HiC data under a Bayesian framework.

2.2 Input data format

The following show all the parameters in MCMCImpute function:

MCMCImpute(niter = 30000,burnin = 15000,single,bulk = bulk,startval = c(100, 100, 10, 8, 10, 0.1, 900, 0.2, 0, replicate(dim(single)[2], 8)),n,epsilon1 = 0.5,epsilon2 = 5,mc.cores = 1,cutoff = 0.5)

niter is the number of iterations for MCMC. The default value is 30,000.

burnin is the number of burn-in interation. The default value is 15,000.

single is a single cell matrix with each column being the vector of upper triangular matrix of a single cell. For a single cell matrix of size $n \times n$, the length of the vector should be $n\times(n-1)/2$. We only need the upper triangular matrix because the HiC matrix are symmetrical. You can use function $mattovec(\cdot)$ to transform your matrix into its upper triangular vector.

Here is an example for the format of single cell matrix:

options(digits = 2)
library(scHiCBayes)
data(simudat)
head(simudat)

In this example, there are 10 single cells, and each column is a vector of the upper triangular of each single cell. Since this simudat is in dimension $61\times 61$ so that each single cell has a vector of length $61\times 60/2=1830$.

bulk is a vector of bulk data. Bulk data is a combination of single cells, and it provides information for prior settings. If bulk data is not availble, set it to be NULL, and MCMCImpute will sum up the single cells to construct a bulk data.

startval is the starting value of MCMC chain. The default value is $c(100, 100, 10, 8, 10, 0.1, 900, 0.2, 0, replicate(dim(single)[2], 8))$.

n is the number of bins in the single cell.

epsilon1 is the range size of $\delta$ that is uniformly distributed. The default value is 0.5.

epsilon2 is the range size of $B$ that is uniformly distributed. The default value is 5.

mc.cores is the number of cores to be used in mclapply function that can parallelly impute the matrix. The default value is 1.

cutoff is the threshold of $\pi_{ij}$ that is used to define structural zeros. The default value is 0.5. That is, if the probability of being a SZ is greater than 0.5, that pair of bins is treated as SZ.

2.2 Numerical summary of MCMCImpute results

MCMCImpute provides a list of posterior mean of SZ probability, imputed data without defining SZ, and imputed data after defining SZ.

Here is a example for the use of MCMCImpute:

data("simudat")
#simudat_res=MCMCImpute(niter=10000, burnin=5000, single=simudat, bulk=NULL, n=61)

The output of MCMCImput is a list of posterior mean of probability, the imputed data without defining SZ, and imputed data with SZ, using the threshold. The posterior mean of probability is a matrix of size $n\times n$, where n is the number of bins. It tells us the probability of being a SZ between the corresponding pairs. For example, the probability of bin1 and bin2 do not interact is 0.25. IMP1 and IMP2 have the same dimension as the observed data.

data("simudat_res")
simudat_res$pii[1:10,1:10]
head(simudat_res$IMP1)
head(simudat_res$IMP2)

2.3 Visualization of the results

hm draws heatmap of HiC data so that we can visually compares the imputation results. For example, the following is the heatmap of observed and imputed single cell of the simudat, where we can see an improvement of sequence depth.

par(mar = c(0.4,0.4,0.4,0.4))
par(mfrow=c(1,2))
hm(simudat[,1], 61)
hm(simudat_res$IMP2[,1], 61)

3. Funcitons for generating scHiC data

3.1 Generate single cell

generate_single is a function designed to simulate scHiC data based on 3D structure of chromosome. It requires 3D coordinates as shown in below. The data str1 is generated from another package called SIMBA.

data("str1")
head(str1)

And the idea of simulation is based on the function $log(\lambda_{ij})=\alpha_0+\alpha_1log(d_{ij})+\beta_llog(x_{g,i}x_{g,j})+\beta_mlog(x_{x_{m,i},x_{m,j}})$, where $\alpha_0, \alpha_1$ are set to control the sequence depth, and $x_{l,i}\sim Unif(0.2,0.3)$, $x_{g,i}\sim Unif(0.4,0.5)$, $x_{m,i}\sim Unif(0.9,1)$ are used to account for covariates.

The following function generates 10 single cells based on str1. The output contains the underline truecount, the position of SZ, and the generated single cells. Truecounts can be used to measure imputation accuracy.

set.seed(1234)
#Generate 100 random type1 single cells
data <- generate_single(data=str1, alpha_0=5.6,alpha_1=-1, beta_l=0.9,beta_g=0.9,beta_m=0.9, alpha=0.2, n_single=10) 

3.2 Accuracy summary

summa summarizes imputation accuracy using the 11 measurements used in the paper.

data("simudat_true")
options(digits = 2)
summa(simudat, simudat_true, simudat_res$IMP2)

summa2 calculates PTDO when PTSZ is fixed to be 0.95. summa3 calculates PTSZ when PTDO is fixed to be 0.80. Their output also contains the standard deviation and the corresponding threshold.

summa2(simudat, simudat_true, simudat_res)
summa3(simudat, simudat_true, simudat_res)

scHiCBayes_ROC draws ROC (Receiver operating characteristic) curve to visually demonstrate ability to tell SZ from DR. The following is an example on the simudat, where we can see the ROC curve goes up to 1 pretty fast.

scHiCBayes_ROC(simudat, simudat_true, simudat_res)


Queen0044/scHiCBayes documentation built on Dec. 18, 2021, 8:43 a.m.