Introduction

The c3co package implements a constrained dictionary learning problem to recover cancer subclones from several DNA copy number profiles described in @pierre-jean:c3co. The c3co model may be seen as an extension of the Fused Lasso Latent Feature Model (FLLat) method of @nowak2011fused, and of the Enhanced FLLat (E-FLLat) method of @masecchia2013dictionary, with the following original features:

Notes:

Model

The figure below illustrates the model used in the c3co package. Two heterogeneous tumor samples (green and yellow circles) are composed of a collection of normal cells (gray discs) and two cancer subclones (red triangles and blue squares). One of the cancer subclones is present in both tumor samples.

The corresponding (noiseless) copy number profiles are displayed in the figure below. They are given by a linear combination of the latent profiles. This Figure is adapted from @nowak2011fused.

Using the package

library("c3co")
library("ggplot2")
library("mclust")
library("reshape")
set.seed(147)

Creating a synthetic data set

We start by defining the characteristics of the subclone profiles: profile length, number of subclones, breakpoint positions, and copy number states:

len <- 500*10    ## Number of loci
K <- 3L          ## Number of subclones
n <- 15L         ## Number of samples

bkps <- list(c(100,250)*10, c(150,400)*10, c(150,400)*10)
regions <- list(c("(1,2)", "(0,2)", "(1,2)"),
                c("(1,1)", "(0,1)", "(1,1)"),
                c("(0,2)", "(0,1)", "(1,1)"))

Then, we load an annotated data set from the acnr package. The buildSubclones() function can then be used to generate the (pure) subclone profiles with the above characteristics, by resampling from the annotated data set.

dataAnnotTP <- acnr::loadCnRegionData(dataSet = "GSE13372_HCC1143", tumorFraction = 1)
dataAnnotN <- acnr::loadCnRegionData(dataSet = "GSE13372_HCC1143", tumorFraction = 0)
datSubClone <- buildSubclones(len = len, nbClones = K, bkps = bkps,
       regions = regions, dataAnnotTP = dataAnnotTP, dataAnnotN = dataAnnotN)

Note that the same can be done using another data set from the acnr package, see

acnr::listDataSets()

Caveat: we actually require both tumor and normal data to be available, so only the "GSE13372_HCC1143" (Affymetrix) and "GSE11976_CRL2324" (Illumina) data sets may be used.

cols <- c("#AAAAAA33", "#00000099", "#AAAAAA33")
cex <- 0.3
pch <- 19 
clim <- c(0, 4)
blim <- c(-0.1, 1.1)
plot(datSubClone[[1]]$ct, col = cols[factor(datSubClone[[1]]$genotype)],
     cex = cex, pch = pch, ylab = "TCN", ylim = clim,
     main = "PSCN profile of one simulated subclone")
plot(datSubClone[[1]]$baft, col = cols[factor(datSubClone[[1]]$genotype)],
     cex = cex, pch = pch, ylab = "BAF", ylim = blim)

Once the subclones are created, we can generate a weight matrix $W$ in order to build mixtures.

W <- rSparseWeightMatrix(nb.samp = n, nb.arch = K, sparse.coeff = 0.7)
datList <- mixSubclones(datSubClone, W = W)
str(datList[[1]])

Note that datList is a list of data frames with the following required columns : c1, c2, tcn, dh, and genotype.

Inference of the c3co model parameters

Then the c3co method can be applied to the mixture data set. Let us choose the same grid for $\lambda_1$ and $\lambda_2$ and a grid from 2 to 6 for the number of subclones.

l1 <- seq(from = 1e-8, to = 1e-5, length.out = 10L)
nb.arch <- 2:6
parameters.grid <- list(lambda1 = l1, nb.arch = nb.arch) 
res <- c3co(datList, parameters.grid, verbose = FALSE, warn = FALSE)

l2 <- seq(from = 1e-6, to = 1e-3, length.out = 5L)
parameters.grid <- list(lambda = l2, nb.arch = nb.arch) 
resC <- c3co(datList, parameters.grid, stat = "TCN", verbose = FALSE)

For each candidate number of subclones $p$ in r nb.arch, c3co() only retains the combination of the penalization coefficients $(\lambda_1, \lambda_2)$ which minimizes the Bayesian Information Criterion (BIC) of the model. The next step is to choose the best $p$ (number of subclones). Following @nowak2011fused, we compare the models for different values of $p$ through their percentage of variance explained (PVE), and select the last $p$ before the final plateau of the PVE. In this example, it seems that the best is $\hat{p}=4$ (which is the true number of subclones).

pvePlot(res, ylim = c(0, 1))
pvePlot(resC, ylim = c(0, 1))

We can compare the true and the estimated weight matrices. We can easily recover a classification close to the truth with the inferred weight matrix.

best <- 2
res.clustTRUE <- hclust(dist(cbind(W, 1-Matrix::rowSums(W))), method = "ward.D")
col = grDevices::colorRampPalette(RColorBrewer::brewer.pal(9, 'GnBu'))(100)
Wplot(res, idxBest = best, cexCol = 0.4)
Wplot(resC, idxBest = best, cexCol = 0.4)
heatmap.3(cbind(W, 1-Matrix::rowSums(W)), dendrogram = "row", main = "TRUE",
          Rowv = as.dendrogram(res.clustTRUE), col = col, scale = "none")

If we look at the subclones in the dimension of parental copy numbers, we can recover the simulated alterations.

df <- createZdf(res, chromosomes = 1, idxBest = best)
Zplot(df)

Further analysis

Performance of clustering

In this section, we show the ability of the method to recover the simulated clustering of patients.

WC1C2 <- res@fit[[best]]@W
WTCN <- resC@fit[[best]]@W
Wtrue <- cbind(W, 1-Matrix::rowSums(W))
adjustedRandIndex(cutree(hclust(dist(WC1C2), method="ward.D2"), k = 4),
                  cutree(hclust(dist(Wtrue), method="ward.D2"), k = 4))
adjustedRandIndex(cutree(hclust(dist(WTCN), method="ward.D2"), k = 4),
                  cutree(hclust(dist(Wtrue), method="ward.D2"), k = 4))

Performance of alteration detection

In this section, we show the ability of the method to recover the simulated altered regions.

getAlteredSegments <- function(tol, zz1, zz2, c1Mean, c2Mean){
  loss <- ((c1Mean - zz1) >= tol) | ((c2Mean - zz2) >= tol)
  gain <- ((zz2 - c2Mean) >= tol) | ((zz1 - c1Mean) >= tol)
  ## coding gains as 3
  gain <- apply(gain, MARGIN = 2L, FUN = function(gg) {
    idx <- which(gg)
    gg[idx] <- 3
    gg
  })
  ## coding loh segments as 4 (gain + loss)
  loss + gain
}

dataBest <- res@fit[[best]]
meanMin <- mean(df$CopyNumber[which(df$arch == "d" & df$stat == "Minor")])
getAlt <- getAlteredSegments(tol = 0.2, dataBest@Zt$Z1, dataBest@Zt$Z2, 1, 1.1)
colnames(getAlt) <- sprintf("Subclone %s", letters[1:ncol(dataBest@W)])
df.Alt <- reshape::melt(getAlt)
df.Alt$x.min <- rep(res@bkp[[1]][-length(res@bkp[[1]])], times = best+1)
df.Alt$x.max <- rep(res@bkp[[1]][-1], times = best+1)
ggplot(df.Alt, aes(xmin = x.min, xmax = x.max, ymin = 1, ymax = 2, fill = factor(value))) + geom_rect() + facet_grid(X2 ~ .) + theme(axis.text.y = element_blank(), axis.ticks = element_blank()) + scale_fill_discrete(labels = c("Normal", "Loss", "Gain", "LoH"), name = "States")

Session information

sessionInfo()

References



pneuvial/c3co documentation built on May 25, 2019, 10:21 a.m.