PoisMS package contains implementation of Principal Curve-based methods for chromatin reconstruction. The input data is a contact map $C\in\mathbf{Z}^{n\times n}$, a symmetric integer-valued matrix of contact counts between $n$ binned genomic loci for a given chromosome. Here, we treat bins as equi-sized and equi-spaced and accordingly index as $1,...,n$, this being the typical scenario; however, generalizations to alternate indexing (e.g. genomic coordinates) is straightforward. Then the goal is to use the contact matrix $C$ to obtain spatial coordinates $x_1, \ldots, x_n\in \mathbb{R}^3$ of genomic loci $1, \ldots, n,$ respectively.
\bigskip
In our experiments we use Hi-C data for IMR90 cells for chromosome 20 and resolution 100kb obtained from the Gene Expression Omnibus (check the package for different Hi-C data available).
library(PoisMS) data(IMR90_100kb_chr20) C = data.matrix(IMR90_100kb_chr20) n = ncol(C)
Usually, contact matrices are diagonally dominant so, for the sake of visual convenience, we plot the heatmap for $\log(C)$.
library(fields) C_log = log(C) C_log[C == 0] = NA heatmap = function(C, shift = 1, scale = 1, title = NULL){ #change the color scale a = min(scale * C, na.rm = TRUE) b = max(scale * C, na.rm = TRUE) br = scale * (exp(seq(log(shift), log(b - a + shift), (log(b - a + shift) - log(shift))/100)) + a - shift) #plot heatmap par(mar = c(0, 2, 1, 2)) image.plot(C, xaxt = 'n', yaxt = 'n', nlevel = 100, breaks = sort(br), main = title) } heatmap(C_log)
\newpage
Note that there are some unobserved loci near the centromere; we save the genomic coordinates of $599$ observed loci and drop unobserved ones from the contact matrix.
index = which(colSums(C) != 0) n_obs = length(index) C = C[index, index] C_log = log(C) C_log[C == 0] = NA par(mar = c(0, 2, 1, 2)) heatmap(C_log)
\bigskip
Principal Curve-based approaches model chromatin by a smooth curve, assuming $$x_1,\ldots,x_n\in \gamma\text{, where }\gamma\text{ is a smooth one-dimensional curve in $\mathbb{R}^3$}.$$ We model each coordinate of $\gamma$ by a cubic spline with $k$ degrees-of-freedom. Thus the previous constraint can be written in matrix form as $X=H\Theta.$ Here:
$X\in\mathbf{R}^{n\times 3}$ is the matrix of genomic loci coordinates,
$H\in\mathbf{R}^{n\times k}$ is the matrix of spline basis evaluations at genomic coordinates, i.e. $H_{i\ell} = h_\ell(i)$ where $h_1(t),\ldots,h_k(t)$ are cubic spline basis functions,
$\Theta$ is the matrix of unknown parameters.
In the presence of unobserved genomic loci one can use the following function to construct matrix $H$.
library(splines) load_H = function(df, index){ n_knots = df - 2 knots = unique(seq(from = 1, to = max(index), length = n_knots)) knots = knots[-c(1,n_knots)] H = bs(index, knots = knots, intercept = TRUE) return(H) }
Example of spline basis with $25$ degrees-of-freedom evaluated at $599$ points corresponding to the observed loci.
H = load_H(25, index) par(mar = c(2, 5, 0, 2)) matplot(H, type = 'l', lwd = 2)
\bigskip
Principal Curve Metric Scaling (PCMS) solves (classical) MDS problem equipped with the smooth curve constraint: $$\text{minimize } \|Z-S(X)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ subject to } X = H\Theta.$$ Here $S(X) = X X^T$ is the inner products matrix and $Z$ is some symmetric matrix, e.g. the contact matrix after some proper transformation applied. An example of such a transformation could be the combination of power law, that converts contacts to distances, combined with double centering, that turns distances to inner products.
D = 1/(C+1) Z = -D^2/2 Z = scale(Z, scale = FALSE, center = TRUE) Z = t(scale(t(Z), scale = FALSE, center = TRUE)) par(mar = c(0, 2, 1, 2)) image.plot(Z, xaxt='n', yaxt = 'n', nlevel = 100)
The PCMS approach assumes $H$ to be an orthogonal matrix. Orthogonalize $H$ from previous part by means of QR decomposition and run PCMS(Z, H)
function to find the PCMS reconstruction.
H = qr.Q(qr(H)) solution_PCMS = PCMS(Z, H) cat('The optimal loss value is', solution_PCMS$loss)
\bigskip
Plot the contact matrix approximation $\hat Z = S(X) = XX^T$.
Z_hat = solution_PCMS$X %*% t(solution_PCMS$X) par(mar = c(0, 2, 1, 2)) image.plot(Z_hat, xaxt='n', yaxt = 'n', nlevel = 100)
Plot projections of 3D reconstruction using visualize(..., type = 'projection')
.
visualize(solution_PCMS$X, index, type = 'projection')
One can also use \texttt{visualize(..., type = '3D')} to create an interactive 3D plot of the reconstruction.
visualize(solution_PCMS$X, index, type = '3D')
\bigskip
Weighted Principal Curve Metric Scaling (WPCMS) is an upgrade of the PCMS approach that solves the following optimization problem: $$\text{minimize } \|W(C-D^2(X)+\beta)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ and } \beta\in\mathbb{R} \text{ subject to } X = H\Theta.$$ Here $$ refers to the Hadamard product, $W\in[0,1]^{n\times n}$ is some symmetric matrix of weights, $\beta$ is an intercept and $D(X)$ corresponds to the matrix of pairwise distances.
To make the result to be compatible with Poisson Metric Scaling described below we apply the WPCMS technique to the log-transformed contact matrix $\log(C)$. Let's create the binary matrix of weights $W$ with elements $W_{ij} = 0$ if $C_{ij} = 0$ and $W_{ij} = 1$ otherwise. These weights will allow us to ignore $-\infty$ in matrix $\log(C)$. Note that if $W$ is a binary matrix WPCMS treats the elements with zero weights as missing.
W = matrix(1, n_obs, n_obs) W[C == 0] = 0 C_log[C == 0] = 0 par(mar = c(0, 2, 1, 2)) image.plot(W, xaxt='n', yaxt = 'n')
One can run WPCMS(Z, H, W, ...)
to find the WPCMS solution for the log-transformed contact matrix $-\log(C)$.
solution_WPCMS = WPCMS(-C_log, H, W = W, eps = 1e-8, maxiter = 1000)
Summary for the WPCMS method.
cat(' It takes', solution_WPCMS$iter, 'iterations to converge.\n', 'The optimal intercept value is', solution_WPCMS$beta, 'and the achieved loss value is', solution_WPCMS$loss)
The convergence plots for both loss and $\beta$ are presented below.
solution_WPCMS$plot$objective solution_WPCMS$plot$intercept
Next we vizualize the WPCMS solution.
D = as.matrix(dist(solution_WPCMS$X, diag = TRUE, upper = TRUE)) logL = -D^2 + solution_WPCMS$beta heatmap(logL, 0.1, -1)
visualize(solution_WPCMS$X, index, type = 'projection')
visualize(solution_WPCMS$X, index, type = '3D')
\bigskip
Poisson Metric Scaling (PoisMS) models contact counts by a Poisson distribution: $$C_{ij}\sim Pois(\lambda_{ij}),~~\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta.$$ Here $\beta\in\mathbb{R}$ is the unknown intercept. The optimization problem is, therefore, to find minimum of the negative log-likelihood under the smooth curve constraint: $$\text{minimize } \sum_{1 \leq i,j \leq n} \left[e^{- \| x_i - x_j\|^2 + \beta} - C_{ij}\left(- \| x_i - x_j\|^2 + \beta\right) \right] \text{ w.r.t. } X \text{ subject to } X = H\Theta.$$ Since the log-transformation is implicitly introduced in the model through $\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta,$ we apply the PoisMS technique to the original data $C$.
Run PoisMS(C, H)
function to calculate the PoisMS solution.
solution_PoisMS = PoisMS(C, H, eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)
Summary for the PoisMS method.
cat(' It takes', solution_PoisMS$epoch, 'iterations to converge.\n', 'The optimal intercept value is', solution_PoisMS$beta, 'and the achieved loss value is', solution_PoisMS$loss)
The convergence plots for both loss and $\beta$ are presented below.
solution_PoisMS$plot$objective solution_PoisMS$plot$intercept
Plot the corresponding contact matrix approximation.
D = as.matrix(dist(solution_PoisMS$X, diag = TRUE, upper = TRUE)) logL = -D^2 + solution_PoisMS$beta heatmap(logL, 0.1, -1)
Plot the projection of the resulting 3D reconstruction.
visualize(solution_PoisMS$X, index, type = 'projection')
Create the 3D interactive plot.
visualize(solution_PoisMS$X, index, type = '3D')
\bigskip
The main hyperparameter for the PoisMS approach is the spline degrees-of-freedom $\operatorname{df}$ controlling the reconstruction smoothness. Calculate the PoisMS solutions for different degrees-of-freedom values.
dfs = c(10, 15, 25, 50, 100, 200) Xs = c() betas = c() for(df in dfs){ H = load_H(df, index) H = qr.Q(qr(H)) poisms = PoisMS(C, H, eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000) Xs = rbind(Xs, data.frame(poisms$X, 'DF' = df)) betas = rbind(betas, data.frame('beta' = poisms$beta, 'DF' = df)) }
Plot the heatmaps.
for(df in dfs){ X = as.matrix(subset(Xs, DF == df)[,1:3]) beta = subset(betas, DF == df)$beta D = as.matrix(dist(X, diag = TRUE, upper = TRUE)) logL = -D^2 + beta heatmap(logL, 0.1, -1, paste('df =', df)) }
Plot the projections.
for(df in dfs){ X = as.matrix(subset(Xs, DF == df)[,1:3]) visualize(X, index, type = 'projection', title = paste('df =', df)) }
Low $\operatorname{df}$ leads to the reconstructions capturing only general structure, high $\operatorname{df}$ reconstructions lose smoothness property.
Let's have a closer look at the 3D reconstructions obtained via PoisMS for $\operatorname{df} = 200$.
visualize(as.matrix(subset(Xs, DF == 200)[,1:3]), index, type = '3D')
\bigskip
Vary degrees-of-freedom value and calculate the loss value for each PCMS reconstruction.
dfs = seq(5, 200, 5) loss = c() for(df in dfs){ H = load_H(df, index) H = qr.Q(qr(H)) solution_PCMS = PCMS(Z, H) loss = append(loss, solution_PCMS$loss) }
The loss decreases with the growth of $\operatorname{df}$.
library(ggplot2) data = data.frame(dfs, loss) plt = ggplot(data, aes(dfs, loss))+ geom_point(color = 'blue', size = 2)+ geom_line(color = 'blue')+ scale_x_continuous(breaks = seq(0, 200, 10))+ ylab('PCMS loss')+ xlab('df') plt
Use segmented regression to determine the kink location (thus, optimal value of $\operatorname{df}$).
library(segmented) LR <- lm(loss ~ dfs, data = data) SR <- segmented(LR, seg.Z = ~ dfs, npsi = 1, tol = 1e-6, it.max = 1000) kink = SR$psi[2] prediction = predict(object = SR, newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs)))) newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs)), 'loss' = prediction) plt+geom_line(newdata, mapping = aes(dfs, loss), colour = 'black', size = 0.8)+ geom_vline(xintercept = kink, color = 'red', size = 0.4, linetype = 'dashed')+ ggtitle(paste('Optimal df value = ', round(kink)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.