knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)

What is DTD?

The gene expression profile of a tissue averages the expression profiles of all cells in this tissue. Digital tissue deconvolution (DTD) addresses the following inverse problem: Given the expression profile $y$ of a tissue, what is the cellular composition $c$ in that tissue? If $X$ is a reference matrix whose $q$ columns hold reference profiles of $q$ different cell types, the cellular composition $c$ can be estimated by \begin{equation} arg~ min_c ~|| y - Xc||2^2 \end{equation} @Goertler2018 generalized this formula by introducing a gene weight vector $g$ \begin{equation} arg~ min_c ~|| diag(g) (y - Xc)||_2^2 ~~~~~~~~~~ (2) \end{equation} Every entry $g_i$ of $g$ calibrates how much gene i should contribute to the deconvolution process. The parameter vector $g$ can be estimated from a loss-function learning approach. This idea and the mathematical background of loss-function learning DTD approach are described in @Goertler2018. We use the following notation. The reference matrix is denoted as $X$. Every column $X{.,k}$ is a cellular reference profile. The expression matrix of bulk profiles is named $Y$. Every column $Y_{.,k}$ is the expression profile of a mixture of cells (a tissue). We estimate the matrix $C$ where $C_{i,k}$ is the contribution of cell type i in tissue k .

Let \begin{equation} L = - \sum_{j} cor(C_{j,.}, \widehat{C_{j,.}} (g)) \end{equation}

The package DTD learns $g$ by minimizing $L$ on a training set of artificial bulk profiles $Y$ simulated by mixing known reference profiles with known fractions $C$. Here $C_{j,.}$ are the true compositions whereas $\widehat{C_{j,.}}(g)$ are the estimated compositions according to equation (2).

The package DTD provides functions to set up the learning process. To run DTD you need reference profiles of different cell types that either result from single cell data or from sorted cell fractions. After some data preparation a full learning process can be evoked by the following code.

# I'd like to start with 'training the g vector'. Therefore I need a lot of stuff ...
library(DTD)
###
number.types <- 25
n.features <- 100
n.per.type <- 100
n.per.mixtures <- 100
maxit <- 25
n.samples <- n.features
random.data <- generate_random_data(
  n.types = number.types,
  n.samples.per.type = n.per.type,
  n.features = n.features
)
normalized.data <- normalize_to_count(random.data)
indicator.vector <- gsub("^Cell[0-9]*\\.", "", colnames(normalized.data))
names(indicator.vector) <- colnames(normalized.data)
perc.of.all.cells <- 0.1
include.in.X <- paste0("Type", 1:5)
sample.X <- sample_random_X(
  included.in.X = include.in.X,
  pheno = indicator.vector,
  expr.data = normalized.data,
  percentage.of.all.cells = perc.of.all.cells
)
X.matrix <- sample.X$X.matrix
samples.to.remove <- sample.X$samples.to.remove
remaining.mat <- normalized.data[, -which(colnames(normalized.data) %in% samples.to.remove)]
train.samples <- sample(
  x = colnames(remaining.mat),
  size = ceiling(ncol(remaining.mat) / 2),
  replace = FALSE
)
test.samples <- colnames(remaining.mat)[which(!colnames(remaining.mat) %in% train.samples)]

train.mat <- remaining.mat[, train.samples]
test.mat <- remaining.mat[, test.samples]
indicator.train <- indicator.vector[names(indicator.vector) %in% colnames(train.mat)]
training.data <- mix_samples(
  expr.data = train.mat,
  pheno = indicator.train,
  included.in.X = include.in.X,
  n.samples = n.samples,
  n.per.mixture = n.per.mixtures,
  verbose = FALSE
)
indicator.test <- indicator.vector[names(indicator.vector) %in% colnames(test.mat)]
test.data <- mix_samples(
  expr.data = test.mat,
  pheno = indicator.test,
  included.in.X = include.in.X,
  n.samples = n.samples,
  n.per.mixture = n.per.mixtures,
  verbose = FALSE
)

start.tweak <- rep(1, nrow(X.matrix))
names(start.tweak) <- rownames(X.matrix)

model <- train_deconvolution_model(
  tweak = start.tweak,
  X.matrix = X.matrix,
  train.data.list = training.data,
  test.data.list = test.data,
  estimate.c.type = "direct",
  maxit = maxit,
  verbose = FALSE
  )

The crucial function in the DTD package is the train_deconvolution_model function, which minimizes the vector $g$ over a training set. After some preliminary work it can be called via:

model <- train_deconvolution_model(
  tweak = start.tweak,
  X.matrix = X.matrix,
  train.data.list = training.data,
  test.data.list = test.data,
  estimate.c.type = "direct",
  verbose = FALSE, 
  maxit = maxit
  )
print(model$pics$convergence)

Here, we trained the $g$-vector for maxit = r maxit iterations. Both training and test errors were reduced by optimizing the gene weights $g$.

How to use DTD?

In the following examples, we demonstrate how data has to look like in order to be used for running DTD. After that, we give examples on how to process the data, how to generate a test and training set, how to train the $g$ vector, and finally, how to visualize the results. \cr First of all, load the library:

library(DTD)

In practice, the DTD package needs expression measurements of either single cells of known cell type or of sorted cell fractions to train a $g$-vector. In this vignette we will demonstrate its use on simulated data. To this end we have generated a matrix of random data reflecting single cell measurements. In fact the DTD package contains the generate_random_data function to simulate such data. It has 3 arguments:

Here we generate data for r number.types cell types, with r n.features features. For each cell type we generate r n.per.type single cell profiles.

cat("```\n")
cat(
  " number.types <- ", number.types, "\n",
  "n.features <- ", n.features, "\n",
  "n.per.type <- ", n.per.type, "\n"
)
cat("```\n")
random.data <- generate_random_data(
  n.types = number.types,
  n.samples.per.type = n.per.type,
  n.features = n.features
)

The object random.data is a numeric matrix with r nrow(random.data) rows (features), and r ncol(random.data) columns (cells).

print(random.data[1:5, 1:5])
# for further exemplary visualization
example.index <- c(1, n.per.type + 1, 2 * n.per.type + 1, 3 * n.per.type + 1)

The colnames of random.data indicate the type of the cell:

print(colnames(random.data)[example.index])

Data processing

Notice that the loss-function learning DTD approach works on the original scale. Therefore, any type log transformation on real data needs to be undone. For the simulated data there was no such transformation. Next, we scale every sample of the data set to a fixed number of r sum(normalized.data[,1]) using the normalize_to_count function:

# In random.data the number of counts differs over all samples:
apply(random.data, 2, sum)[example.index]
normalized.data <- normalize_to_count(random.data)
# In normalized.data all samples share the same number of counts:
apply(normalized.data, 2, sum)[example.index]

After that, we construct an index vector that maps single cell profiles (columns of normalized.data) to cell types

indicator.vector <- gsub("^Cell[0-9]*\\.", "", colnames(normalized.data))
names(indicator.vector) <- colnames(normalized.data)
print(indicator.vector[example.index])

indicator.vector needs to be a named list. Each entry assigns the cell type (as value of the list) to the corresponding sample (names of the list) in normalized.data. It is used in the training process of the algorithm.

Reference matrix X

Next, we build the reference matrix $X$. $X$ holds data from some but not all input profiles. The profiles used to build $X$ can either be selected manually, or chosen randomly. Within the DTD package we provide the sample_random_X function, which generates a reference matrix by randomly selecting profiles of the same type, and then averaging them to a reference profile for this type. The parameter included.in.X specifies which cell types are included into $X$, while the parameter perc.of.all.cells specifies what fraction of cells from each type will be used for building $X$.

Here we build $X$ using only the first r length(include.in.X) cell types and r round(perc.of.all.cells, digits = 2) of the corresponding single cell profiles.

cat("```\n")
cat(" include.in.X <- ", paste0("c(\"", paste(include.in.X, collapse = "\", \""), "\")"), "\n")
cat(" perc.of.all.cells <- ", perc.of.all.cells, "\n")
cat("```\n")
sample.X <- sample_random_X(
  included.in.X = include.in.X,
  pheno = indicator.vector,
  expr.data = normalized.data,
  percentage.of.all.cells = perc.of.all.cells
)
X.matrix <- sample.X$X.matrix
samples.to.remove <- sample.X$samples.to.remove

Single cell profiles that were used to build $X$ must not be used in the following training and testing procedures. The sample_random_X function keeps track on all used samples, and collects them in the variable samples.to.remove. Using this index vector, we remove these profiles from the data and split the remaining profiles into a training and a test set both holding 50% of the remaining profiles.

# removing samples that have been used in X.matrix
remaining.mat <- normalized.data[, -which(colnames(normalized.data) %in% samples.to.remove)]

# sampling training samples: (notice, that train test seperation is 50:50)
train.samples <- sample(
  x = colnames(remaining.mat),
  size = ceiling(ncol(remaining.mat) / 2),
  replace = FALSE
)
# selecting test samples:
test.samples <- colnames(remaining.mat)[which(!colnames(remaining.mat) %in% train.samples)]

# extract data matrices for training and testing:
train.mat <- remaining.mat[, train.samples]
test.mat <- remaining.mat[, test.samples]

Generating artificial tissue profiles

In addition to the reference matrix $X$, DTD requires training data consisting of two matrices:

A column of Y is a simulated bulk tissue profile, where the corresponding column in $C$ annotates the cellular composition of this tissue. From the single cell profiles in the training set (that were not used to build $X$) we randomly sample profiles and average them to generate
artificial tissue profiles. The DTD package provides two methods to generate training and test sets. One method randomly samples cells and averages them as described above. This is the recommended method for large single cell data sets. However, if there are only a few reference samples per cell type (in average below 10 samples per type) we recommend to multiply profiles with random quantities and modify each mixture with a multiplicative jitter. For both methods we provide functions within the DTD package.

mix_samples randomly selects n.per.mixture profiles from expr.data and generates bulk profiles. Additionally, it reports which profiles have been used and which cell types they were drawn from, resulting in a tissue matrix Y, and a corresponding composition matrix C. The mix_samples function needs the following arguments:

cat("```\n")
cat(
  " n.samples <- ", n.samples, "\n",
  "n.per.mixtures <- ", n.per.mixtures, "\n"
)
cat("```\n")
indicator.train <- indicator.vector[names(indicator.vector) %in% colnames(train.mat)]
training.data <- mix_samples(
  expr.data = train.mat,
  pheno = indicator.train,
  included.in.X = include.in.X,
  n.samples = n.samples,
  n.per.mixture = n.per.mixtures,
  verbose = FALSE
)
str(training.data)

Note that all cell types are included in the artificial tissue profiles, but only the fractions of the cell within $X$ are reported in the composition matrix $C$. Therefore, the columns in C do not necessarily sum up to 1.

Using the same function we generate a test set, but this time we select single cell profiles from the test set. Notice that the single cell profiles used for training and testing are disjoint and so are the profiles used for building $X$.

indicator.test <- indicator.vector[names(indicator.vector) %in% colnames(test.mat)]
test.data <- mix_samples(
  expr.data = test.mat,
  pheno = indicator.test,
  included.in.X = include.in.X,
  n.samples = n.samples,
  n.per.mixture = n.per.mixtures,
  verbose = FALSE
)

The mix_samples_with_jitter function needs additionally a prob.each parameter. In the mix_samples function the average quantity for each cell type is its occurence in the expr.data. In the mix_samples_with_jitter function, the average expected quantity for each cell type is set via prob.each. It must be a numeric vector, with the same length as included.in.X.

training.data.jitter <- mix_samples_with_jitter(
  expr.data = train.mat,
  prob.each = rep(1, length(include.in.X)), # default value, 
  # every type occurs in average with the same frequency 
  n.samples = n.samples,
  pheno = indicator.train,
  verbose = FALSE,
  add.jitter = TRUE,
  included.in.X = include.in.X
)

Learning the gene weight vector g

In the optimizing procedure we search for a vector $g$, which minimizes L, where \begin{equation} L = - \sum_{j} cor(C_{j,.}, \widehat{C_{j,.}} (g)) \end{equation} This optimization is done iteratively by gradient descent, using an implementation of 'FISTA' (Fast iterative shrinkage thresholding algorithm, Beck and Teboulle (2009)). The training procedure can be called via train_deconvolution_model, and takes the following arguments:

start.tweak <- rep(1, nrow(X.matrix))
names(start.tweak) <- rownames(X.matrix)
model <- train_deconvolution_model(
  tweak = start.tweak,
  X.matrix = X.matrix,
  train.data.list = training.data,
  test.data.list = test.data, 
  estimate.c.type = "direct",
  maxit = maxit,
  cv.verbose = TRUE,
  verbose = FALSE
)

The output of train_deconvolution_model algorithm is a list. The entry 'best.model' contains a model trained on the full trainings data set. This model contains the following entries: In the entry Tweak it returns the tweak.vec after the last iterations (which is that $g$ vector which minimizes the loss function), and a loss vector. The loss vector stores the loss function evaluated in every iteration. Depending on the save.all.tweaks argument, there is a third entry named History. In this matrix the $g$ vector of every iteration can be found. Additionally, in the model object there is a entry named reference.X and pics. reference.X holds the used reference matrix X, in pics there are several ggplot figures to assess the quality of the model.

Visualizations

Each of the following visualizations is called automatically after training a model with the train_deconvolution_model function. They are stored as a list of ggplot object, and can be found in the pics entry of the model.

Visualization of the learning curve

The function ggplot_correlation plots the loss $L$ against the iteration of gradient descent algorithm for both training and test data. It can be used to monitor learning rates and potential overfitting.

# print via model$pics list:
print(model$pics$convergence)

# or via function call:
# print(
#   ggplot_convergence(
#     DTD.model = model,
#     X.matrix = X.matrix,
#     test.data = test.data,
#     estimate.c.type = "direct",
#     title =  "DTD Vignette"
#   )
# )

Performance cell type per cell type

The function ggplot_true_vs_esti plots true versus estimated cell fractions of the test data for all cell types. In the first entry, a a plot including all cell types is drawn. In the subsequent plots, each cell type is plotted in a seperate picture.

# print via model$pics list:
print(model$pics$true_vs_esti[[1]])

# or via function call:
# print(
#   ggplot_true_vs_esti(
#     DTD.model = model,
#     X.matrix = X.matrix,
#     test.data = test.data,
#     estimate.c.type = "direct"
#   )
# )

Histogram of g-vector

ggplot_ghistogram plots the optimal gene weight vector $g$.

# print via model$pics list:
print(model$pics$histogram)
# or via function call:
# print(
#   ggplot_ghistogram(
#     DTD.model = model
#   )
# )

g-Path

Additional useful information that can be gained from the History of the tweak.vec can be visualized using the ggplot_gpath function. In the plot below each line tracks the change of one $g_i$ during all iterations. In this showcase, we included r nrow(X.matrix) features in the model, therefore the length of our $g$ vector is r nrow(X.matrix). Accordingly, there are r nrow(X.matrix) lines in the plot. Notice, that most of the change in $g$ takes place in the early steps of the algorithm, therefore we included a ITER.TRANSFORM.FUN (defaults to log10(g + 1)) to transform the iteration axis. \cr

# print via model$pics list:
print(model$pics$path)
# or via function call:

# singlePic <- ggplot_gpath(
#   DTD.model = model,
#   number.pics = 1,
#   title =  "All genes",
#   G.TRANSFORM.FUN = log2,
#   y.lab = "log2(g)"
# )
# print(singlePic$gPath)

In the ggplot_gPath function, each $g$ is visualized with one line. To visualize only a subset of $g$, selected features can be indicated by the option subset:

singlePic.subset <- ggplot_gpath(
  DTD.model = model,
  number.pics = 1,
  title =  "Gene Subset",
  G.TRANSFORM.FUN = log2,
  y.lab = "log2(g)",
  subset = c("gene22", "gene37")
)
print(singlePic.subset$gPath)

Alternatively, you can specify into how many pictures the plot should be split. For example, if you split the plot into 2 pictures, the function groups all $g$ who result below the median all $g$ into the first picture, and all $g$ that result above the median into the second picture:

multPic <- ggplot_gpath(
  DTD.model = model,
  number.pics = 2,
  title =  "All Genes, split into 2",
  G.TRANSFORM.FUN = log2,
  y.lab = "log2(g)"
)
print(multPic$gPath)

By default, the ggplot_gpath function does not plot a legend. There is the boolean argument plot.legend. If it is set TRUE, a TableGrob object will be stored in the legend entry, holding the legend-plot.

Reference matrix as heatmap

The effect of the g-vector on deconvolution can be visualized when plotting the reference matrix X, multiplied with g, as a heatmap. The function ggplot_heatmap takes as input a DTD.model (e.g. the output of train_deconvolution_model). Rows and columns are clustered hierarchically.
The heatmap becomes more informative, if only a subset of features is included. Here, we implemented two subsetting methods:

Note that subsetting by rownames does not require a test set. Subsetting by explained correlation requires a test set. When calling train_deconvolution_model a heatmap including all features is plotted, and included in the pics list.

# print via model$pics list:
print(model$pics$Xheatmap)
# or via function call:
# print(
#   ggplot_heatmap(
#     DTD.model = model,
#     test.data = test.data,
#     estimate.c.type = "direct",
#     feature.subset = 0.1
#   )
# )

Session Info

sessionInfo()

Literature



MarianSchoen/DTD documentation built on April 29, 2022, 1:59 p.m.