In ElenaTuzhilina/PoisMS: Principal Curve based chromatin reconstruction

Introduction

PoisMS package contains implementation of Principal Curve-based methods for chromatin reconstruction. The input data is a contact map $C\in\mathbf{Z}^{n\times n}$, a symmetric integer-valued matrix of contact counts between $n$ binned genomic loci for a given chromosome. Here, we treat bins as equi-sized and equi-spaced and accordingly index as $1,...,n$, this being the typical scenario; however, generalizations to alternate indexing (e.g. genomic coordinates) is straightforward. Then the goal is to use the contact matrix $C$ to obtain spatial coordinates $x_1, \ldots, x_n\in \mathbb{R}^3$ of genomic loci $1, \ldots, n,$ respectively.

\bigskip

Contact matrix

In our experiments we use Hi-C data for IMR90 cells for chromosome 20 and resolution 100kb obtained from the Gene Expression Omnibus (check the package for different Hi-C data available).

library(PoisMS)
data(IMR90_100kb_chr20)
C = data.matrix(IMR90_100kb_chr20)
n = ncol(C)

Usually, contact matrices are diagonally dominant so, for the sake of visual convenience, we plot the heatmap for $\log(C)$.

library(fields)
C_log = log(C)
C_log[C == 0] = NA

heatmap = function(C, shift = 1, scale = 1, title = NULL){
  #change the color scale
  a = min(scale * C, na.rm = TRUE)
  b = max(scale * C, na.rm = TRUE)
  br = scale * (exp(seq(log(shift), log(b - a + shift), (log(b - a + shift) - log(shift))/100)) + a - shift)
  #plot heatmap
  par(mar = c(0, 2, 1, 2))
  image.plot(C, xaxt = 'n', yaxt = 'n', nlevel = 100, breaks = sort(br), main = title)
}

heatmap(C_log)

\newpage

Note that there are some unobserved loci near the centromere; we save the genomic coordinates of $599$ observed loci and drop unobserved ones from the contact matrix.

index = which(colSums(C) != 0)
n_obs = length(index)
C = C[index, index]
C_log = log(C)
C_log[C == 0] = NA
par(mar = c(0, 2, 1, 2))
heatmap(C_log)

\bigskip

Principal Curve

Principal Curve-based approaches model chromatin by a smooth curve, assuming $$x_1,\ldots,x_n\in \gamma\text{, where }\gamma\text{ is a smooth one-dimensional curve in $\mathbb{R}^3$}.$$ We model each coordinate of $\gamma$ by a cubic spline with $k$ degrees-of-freedom. Thus the previous constraint can be written in matrix form as $X=H\Theta.$ Here:

$X\in\mathbf{R}^{n\times 3}$ is the matrix of genomic loci coordinates,
$H\in\mathbf{R}^{n\times k}$ is the matrix of spline basis evaluations at genomic coordinates, i.e. $H_{i\ell} = h_\ell(i)$ where $h_1(t),\ldots,h_k(t)$ are cubic spline basis functions,
$\Theta$ is the matrix of unknown parameters.

In the presence of unobserved genomic loci one can use the following function to construct matrix $H$.

library(splines)
load_H = function(df, index){
  n_knots = df - 2
  knots = unique(seq(from = 1, to = max(index), length = n_knots))
  knots = knots[-c(1,n_knots)]
  H = bs(index, knots = knots, intercept = TRUE)
  return(H)
}

Example of spline basis with $25$ degrees-of-freedom evaluated at $599$ points corresponding to the observed loci.

H = load_H(25, index)
par(mar = c(2, 5, 0, 2))
matplot(H, type = 'l', lwd = 2)

\bigskip

PCMS

Principal Curve Metric Scaling (PCMS) solves (classical) MDS problem equipped with the smooth curve constraint: $$\text{minimize } \|Z-S(X)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ subject to } X = H\Theta.$$ Here $S(X) = X X^T$ is the inner products matrix and $Z$ is some symmetric matrix, e.g. the contact matrix after some proper transformation applied. An example of such a transformation could be the combination of power law, that converts contacts to distances, combined with double centering, that turns distances to inner products.

D = 1/(C+1)
Z = -D^2/2
Z = scale(Z, scale = FALSE, center = TRUE)
Z = t(scale(t(Z), scale = FALSE, center = TRUE))
par(mar = c(0, 2, 1, 2))
image.plot(Z, xaxt='n', yaxt = 'n', nlevel = 100)

The PCMS approach assumes $H$ to be an orthogonal matrix. Orthogonalize $H$ from previous part by means of QR decomposition and run PCMS(Z, H) function to find the PCMS reconstruction.

H = qr.Q(qr(H))
solution_PCMS = PCMS(Z, H)
cat('The optimal loss value is', solution_PCMS$loss)

\bigskip

Visualization

Plot the contact matrix approximation $\hat Z = S(X) = XX^T$.

Z_hat =  solution_PCMS$X %*% t(solution_PCMS$X)
par(mar = c(0, 2, 1, 2))
image.plot(Z_hat, xaxt='n', yaxt = 'n', nlevel = 100)

Plot projections of 3D reconstruction using visualize(..., type = 'projection').

visualize(solution_PCMS$X, index, type = 'projection')

One can also use \texttt{visualize(..., type = '3D')} to create an interactive 3D plot of the reconstruction.

visualize(solution_PCMS$X, index, type = '3D')

\bigskip

WPCMS

Weighted Principal Curve Metric Scaling (WPCMS) is an upgrade of the PCMS approach that solves the following optimization problem: $$\text{minimize } \|W(C-D^2(X)+\beta)\|^2_F \text{ w.r.t. } \Theta\in\mathbb{R}^{k\times 3} \text{ and } \beta\in\mathbb{R} \text{ subject to } X = H\Theta.$$ Here $$ refers to the Hadamard product, $W\in[0,1]^{n\times n}$ is some symmetric matrix of weights, $\beta$ is an intercept and $D(X)$ corresponds to the matrix of pairwise distances.

To make the result to be compatible with Poisson Metric Scaling described below we apply the WPCMS technique to the log-transformed contact matrix $\log(C)$. Let's create the binary matrix of weights $W$ with elements $W_{ij} = 0$ if $C_{ij} = 0$ and $W_{ij} = 1$ otherwise. These weights will allow us to ignore $-\infty$ in matrix $\log(C)$. Note that if $W$ is a binary matrix WPCMS treats the elements with zero weights as missing.

W = matrix(1, n_obs, n_obs)
W[C == 0] = 0
C_log[C == 0] = 0
par(mar = c(0, 2, 1, 2))
image.plot(W, xaxt='n', yaxt = 'n')

One can run WPCMS(Z, H, W, ...) to find the WPCMS solution for the log-transformed contact matrix $-\log(C)$.

solution_WPCMS = WPCMS(-C_log, H, W = W, eps = 1e-8, maxiter = 1000)

Summary for the WPCMS method.

cat(' It takes', solution_WPCMS$iter, 'iterations to converge.\n', 'The optimal intercept value is', solution_WPCMS$beta, 'and the achieved loss value is', solution_WPCMS$loss)

The convergence plots for both loss and $\beta$ are presented below.

solution_WPCMS$plot$objective
solution_WPCMS$plot$intercept

Next we vizualize the WPCMS solution.

D = as.matrix(dist(solution_WPCMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_WPCMS$beta
heatmap(logL, 0.1, -1)

visualize(solution_WPCMS$X, index, type = 'projection')

visualize(solution_WPCMS$X, index, type = '3D')

\bigskip

PoisMS

Poisson Metric Scaling (PoisMS) models contact counts by a Poisson distribution: $$C_{ij}\sim Pois(\lambda_{ij}),~~\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta.$$ Here $\beta\in\mathbb{R}$ is the unknown intercept. The optimization problem is, therefore, to find minimum of the negative log-likelihood under the smooth curve constraint: $$\text{minimize } \sum_{1 \leq i,j \leq n} \left[e^{- \| x_i - x_j\|^2 + \beta} - C_{ij}\left(- \| x_i - x_j\|^2 + \beta\right) \right] \text{ w.r.t. } X \text{ subject to } X = H\Theta.$$ Since the log-transformation is implicitly introduced in the model through $\log(\lambda_{ij}) = - \| x_i - x_j\|^2 + \beta,$ we apply the PoisMS technique to the original data $C$.

Run PoisMS(C, H) function to calculate the PoisMS solution.

solution_PoisMS = PoisMS(C, H,  eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)

Summary for the PoisMS method.

cat(' It takes', solution_PoisMS$epoch, 'iterations to converge.\n', 'The optimal intercept value is', solution_PoisMS$beta, 'and the achieved loss value is', solution_PoisMS$loss)

The convergence plots for both loss and $\beta$ are presented below.

solution_PoisMS$plot$objective
solution_PoisMS$plot$intercept

Plot the corresponding contact matrix approximation.

D = as.matrix(dist(solution_PoisMS$X, diag = TRUE, upper = TRUE))
logL = -D^2 + solution_PoisMS$beta
heatmap(logL, 0.1, -1)

Plot the projection of the resulting 3D reconstruction.

visualize(solution_PoisMS$X, index, type = 'projection')

Create the 3D interactive plot.

visualize(solution_PoisMS$X, index, type = '3D')

\bigskip

Vary degrees-of-freedom

The main hyperparameter for the PoisMS approach is the spline degrees-of-freedom $\operatorname{df}$ controlling the reconstruction smoothness. Calculate the PoisMS solutions for different degrees-of-freedom values.

dfs = c(10, 15, 25, 50, 100, 200)
Xs = c()
betas = c()
for(df in dfs){
    H = load_H(df, index)
    H = qr.Q(qr(H))
    poisms = PoisMS(C, H,  eps_wpcms = 1e-7, maxiter = 100, eps_poisms = 1e-7, maxepoch = 1000)
    Xs = rbind(Xs, data.frame(poisms$X, 'DF' = df))
    betas = rbind(betas, data.frame('beta' = poisms$beta, 'DF' = df))
}

Plot the heatmaps.

for(df in dfs){
  X = as.matrix(subset(Xs, DF == df)[,1:3])
  beta = subset(betas, DF == df)$beta
  D = as.matrix(dist(X, diag = TRUE, upper = TRUE))
  logL = -D^2 + beta
  heatmap(logL, 0.1, -1, paste('df =', df))
}

Plot the projections.

for(df in dfs){
  X = as.matrix(subset(Xs, DF == df)[,1:3])
  visualize(X, index, type = 'projection', title = paste('df =', df))
}

Low $\operatorname{df}$ leads to the reconstructions capturing only general structure, high $\operatorname{df}$ reconstructions lose smoothness property.

Let's have a closer look at the 3D reconstructions obtained via PoisMS for $\operatorname{df} = 200$.

visualize(as.matrix(subset(Xs, DF == 200)[,1:3]), index, type = '3D')

\bigskip

Pick optimal degrees-of-freedom value (PCMS)

Vary degrees-of-freedom value and calculate the loss value for each PCMS reconstruction.

dfs = seq(5, 200, 5)
loss = c()
for(df in dfs){
  H = load_H(df, index)
  H = qr.Q(qr(H))
  solution_PCMS = PCMS(Z, H)
  loss = append(loss, solution_PCMS$loss)
}

The loss decreases with the growth of $\operatorname{df}$.

library(ggplot2)
data = data.frame(dfs, loss)
plt = ggplot(data, aes(dfs, loss))+
  geom_point(color = 'blue', size = 2)+
  geom_line(color = 'blue')+
  scale_x_continuous(breaks =  seq(0, 200, 10))+
  ylab('PCMS loss')+
  xlab('df')
plt

Use segmented regression to determine the kink location (thus, optimal value of $\operatorname{df}$).

library(segmented)
LR <- lm(loss ~ dfs, data = data)
SR <- segmented(LR, seg.Z = ~ dfs, npsi = 1, tol = 1e-6, it.max = 1000)
kink = SR$psi[2]
prediction = predict(object = SR, newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs))))
newdata = data.frame('dfs' = c(min(dfs), kink, max(dfs)), 'loss' = prediction)
plt+geom_line(newdata, mapping = aes(dfs, loss), colour = 'black', size = 0.8)+
  geom_vline(xintercept = kink, color = 'red', size = 0.4, linetype = 'dashed')+
  ggtitle(paste('Optimal df value = ', round(kink)))