knitr::opts_chunk$set(fig.width=8, fig.height=8) 

Prerequisites

This manual explains how to run the FateID algorithm on sample data. FateID is a method for the quantification of cell fate bias in single cell transcriptome datasets comprising different cell types that emerge from a common progenitor [@FateID]. The progenitor populations are expected to be part of the dataset and the FateID algorithm was designed to learn a pre-existing bias of each progenitor cell to one or multiple alternative terminal fates. The strategy of the algorithm is to apply an iterative random forest classification [@RandomForests] in order to quantify fate bias in increasingly naive progenitors using cells that have been classified in previous iterations as training set.

FateID can be directly installed from CRAN:

install.packages("FateID")

After installation the FateID package has to be loaded:

library(FateID)

In the following sections we describe the application of FateID on sample data. We do not explain all input and return arguments of the FateID functions in detail, since extensive description is available in the man pages for all of the functions.

Input data

The algorithm requires an expression data frame as input with cells as columns and genes as features. Column names are expected to correspond to cell IDs and row names are expected to correspond to gene IDs.

Example data are available from the FateID package. The dataset has to be imported in order to reproduce the analysis in this reference manual:

data(intestine)

This dataset contains transcript counts of mouse intestinal epithelial cells positive for an Lgr5-lineage reporter after 5 days of lineage tracing [@StemID], i. e. these cells are 5 days old progeny of Lgr5-positive intestinal stem cells:

x <- intestine$x
head(x[,1:5])

Moreover, FateID requires a partitioning of cells, which can be generated by any clustering method. For example the RaceID3 algorithm can be used to identify cell clusters, and a partitioning generated by this method is provided as part of the package. The partitioning has to be provided as a vector with integer values and component names corresponding to column names of the expression data frame:

y <- intestine$y
head(y)

Cluster analysis can inform on the presence of mature cell types in the dataset, where cell types of distinct lineages correspond to different clusters (i. e. different numbers of the partition). In the example data, cluster number 6 comprises enterocytes, marked by high expression of the Alpi gene, while cluster 9 represents mature Paneth cells (high expression of Defa24) and cluster 13 mature goblet cells (high expression of Clca3). Other rare cell types are only present in very low numbers and therefore were excluded from the analysis.

As further input for FateID, the endpoints of the differentiation trajectories, i. e. the most mature stages of all distinct cell lineages in the data set have to be defined by a vector of integer numbers representing the corresponding clusters in the partition y.

tar <- c(6,9,13)

If a partitioning into cell types and states from a prior clustering analysis is not available, FateID can derive a partitioning based on marker gene information. For this strategy, a list of marker gene IDs is needed. Each component of this list contains one or more marker genes of a distinct lineage:

FMarker <- list(c("Defa20__chr8","Defa24__chr8"), "Clca3__chr3", "Alpi__chr1")
xf <- getPart(x,FMarker,fthr=NULL,n=5)
head(xf$part)
head(xf$tar)
tar <- xf$tar
y <- xf$part

The getPart function extracts the top n cells expressing the markers of one of the lineages most highly and defines an expression threshold by the average expression across these cells. For this inference the expression level is aggregated across all markers of this lineage. The target cluster of a lineage is given by the ensemble of cells with aggregated marker gene expression higher than this threshold. Alternatively, a vector with threshold expression values for all lineages can also be provided as input argument fthr. The target cluster numbers will reflect the order of the component of the FMarker list starting at 2. Cluster number 1 comprises all remaining cells that do not exhibit marker gene expression beyond the mean expresion of the top n cells expressing a marker or the levels in fthr, respectively.

Once target clusters have been defined, FateID has the option to reclassify all remaining cells using the cells within the target clusters as input.

rc <- reclassify(x, y, tar, clthr=.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL, q=0.9)
y  <- rc$part

The function returns a partition with the novel assignments after reclassification and can replace the previous partition based on the original target clusters. The purpose of this step is to identify all cells with a pronounced bias towards one of the fates. This step is optional but recommended to obtain larger training sets for random forests. This and other functions can be executed on expression data prior to or after feature selection. In the sample data, the data frame x contains only genes with variability exceeding a background level of combined technical and biological noise as inferred by RaceID3. Alternatively, the full data frame containing all genes can be used as input.

v <- intestine$v
rc <- reclassify(v, y, tar, clthr=.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL, q=0.9)
y  <- rc$part

The reclassify function also performs a feature selection based on importance sampling, i. e. all features with an importance larger than the q-quantile of the importance distribution for a given class are retained. The reduced expression table is returned by the function and can replace the original input expression data frame:

x  <- rc$xf

Utilizing this function is recommended if the input data have not been subject to any other feature selection method.

Feature selection can also be performed utilizing a differential gene expression analysis.

x <- getFeat(v,y,tar,fpv=0.01)

This function compares gene expression within cells of a target cluster to the ensemble of all remaining cells and identifies genes that are significantly up-regulated in a target cluster with a p-value lower than fpv. The function returns a reduced expression data frame, which can be used for the subsequent analysis. In general, the reclassify function is more recommended for feature selection, since it reflects the information used for the random forest classification.

Computing the fate bias

The core function of FateID computes the fate bias for each cell in the dataset excluding cells within the target cluster. These cells are assigned to the lineage representing the respective target cluster with a probability of one and this probability does not change during inference of the fate bias of all other cells:

tar <- c(6,9,13)
x <- intestine$x
y <- intestine$y
fb  <- fateBias(x, y, tar, z=NULL, minnr=5, minnrh=10, adapt=TRUE, confidence=0.75, nbfactor=5, use.dist=FALSE, seed=12345, nbtree=NULL)

Apart from the (feature selected) expression data frame x, the partition y, and the vector of target clusters tar, the fateBias function takes further arguments as input. The optional argument z is a cell-to-cell distance matrix utilized to identify non-classified cells in the immediate neighborhood of all cells that have been classified as one of the target clusters in the previous iteration. By default this distance matrix will be computed as z=1-cor(x), but if other distance measures are preferred, a distance matrix can be provided by this argument.

The FateID algorithm computed by the fateBias function performs an iterative calculation. It starts with a set of cells representing each target cluster. For each target cluster, the minnr neighboring cells with the shortest median distance to all cells in the target cluster are extracted. The ensemble of the neighboring cells of all target clusters represents the test set of the next iteration. The minnr parameter thus controls the step size of the algorithm. In each iteration, minnr cells times the number of target clusters are classified and can contribute to the training set in the next iteration.

The training set of this iteration comprises all cells assigned to one of the target clusters and the response vector is given by the partition of these cells. The classification of the test set is done based on the random forest votes: If a cell receives significantly more random forest votes for one target cluster versus all other clusters (based on sampling statistics with a p-value threshold of 0.05 it is assigned to this target cluster and contributes to the training set for the next iteration. All cells without a significant fate bias towards any one of the target clusters are not incorporated into the training set for the next iteration. However, the fraction of votes, which can be interpreted as a fate probability, is recorded and stored for all cells. Another important parameter controls which cells contribute to the training set for a given iteration. At most minnrh cells from each target cluster contribute to the training set. These cells are selected as the minnrh cells with the shortest distance to any cell within the current training set. This parameter controls the gene expression horizon on the differentiation trajectory taken into account for the classification of the test set. If minnrh is set to Inf then all previously classified cells with a significant fate bias for one of the target clusters contribute to the training set. However, if gene expression changes follow complex dynamics along a differentiation trajectory, it can become detrimental to include very distant cells expressing maturation markers that are not expressed during the earliest stages of differentiation. It is generally advised to confine the expression horizon to smaller values in order to increase the specificity of the algorithm. However, the training set should still be large enough to warrant a confident classification. As a rule of thumb, the minnrh parameter should be set to 20 or larger values, depending on the size and coverage of the dataset. If the input data set is large and a large number of cells are available in the dataset for all lineages covering the entire differentiation trajectory, this parameter can be increased. We recommend testing the robustness of the results to changes in minnrh. The step size minnr should be selected based on similar consideration. However, it is generally recommended to keep this number lower in order to avoid misclassification due to insufficient resolution. For the majority of datasets with several hundreds of cells we use minnr=5 and minnrh=20.

FateID also permits a dynamic test set size, where the minnr parameter is adapted separately for each target cluster based on the classification success in the previous iteration. If adapt=TRUE, the number of successfully classified cells is determined in each iteration, i.e. the number of cells with a minimum fraction of votes given by the confidence parameter for the target cluster, which gave rise to the inclusion of the cell in the test set. Weights are then derived by dividing this number by the maximum across all target clusters after adding a pseudocount of 1. The test set size minnr is rescaled for each cluster by the respective weight in the next iteration.

This leads to decreasing test set sizes, and hence, slower classification, if the previous classification success of the local neighbourhood has been low. Such an adaptive scheme is important if the number cells on different branches exhibits large differences. If classification on a short branch reaches the na??ve compartment, where classification succes can be expected to decrease, the progression on this branch in future iterations slows down, while classification of more mature stages on a highly populated branch still progresses at full speed.

As an alternative approach the FateID algorithm offers classification based on distances to all other cells. When use.dist is set to TRUE, then the distance matrix z (or 1-cor(x)) is interpreted as feature matrix. The remaining arguments are control parameters of the random forests algorithm and usually do not have to be adjusted. As outlined in the man packages of the fateBias function it returns a list of five components. The votes component is a data frame of random forest votes of all cells for each of the target clusters. The column names are given by a concatenation of a t with the number of the target cluster. The row names are given by cell IDs. The probs component has the same structure, but the votes for each cell are normalized to one in order to represent fate probabilities. The tr component is a list of vectors. Each vector contains all cell IDs of cells with a significant fate bias towards the corresponding target cluster. Significant fate bias means significantly more votes for a given cluster than for any other cluster based on sampling statistics with a p-value below 0.05. The fourth component is vector with all cell IDs ordered by random forest iteration in which they have been classified. The last component is a list of random forest objects produced by the function randomForest from the randomForest package object for all iterations.

Visualization of fate bias

Various dimensional reduction methods are commonly used for single cell transcriptome analysis in order to visually inspect the cell population structure. The FateID package computes a number of dimensional reduction representations to enable visualization of the fate bias and pseudo-temporal ordering by principal curve computation.

dr  <- compdr(x, z=NULL, m=c("tsne","cmd","umap"), k=2, tsne.perplexity=30, seed=12345)

The first two input parameters are the same as the ones to the fateBias function. The parameter k indicates the dimensions for which the dimensional reduction representations are computed. Typically, one wants to inspect data visually in two or three dimensions. However, it is possible to also compute dimensional reductions to more than three dimensions and inspecting the data after projecting onto a subset of dimensions. The remaining parameters are main control parameter for the various algorithms used for dimensional reduction. The function performs computation of a t-SNE map [@TSNE] using the Rtsne package, and classical multi-dimensional scaling using cmdscale from the stats package. Umaps are computed with the umap package. To speed up the computation, only a subset of dimensionality reduction representations can be selected as input parameter m.

All results can be plotted by the plotFateMap function.

The dimensional reduction representation with a highlighting of the partition can be plotted for any of the dimensional reduction algorithms in any of the dimensions computed by compdr. For example, a umap can be plotted in two dimensions:

plotFateMap(y,dr,k=2,m="umap")

Plotting in three dimensions opens an interactive RGL device to allow rotation of the plot and zooming in and out:

plotFateMap(y,dr,k=3,m="umap")

The fate bias can now be highlighted in the dimensional reduction representation by providing the name of the target cluster additional argument (concatenation of a t and the cluster number, e.g. t6) and the fateBias output:

plotFateMap(y,dr,k=2,m="umap",fb=fb,g="t6")

To approximate the differentiation trajectories, principal curves can be computed in the dimensionally reduced space. For the computation of a principal curve of a given target cluster, only those cells contribute that have a fraction of random forest votes for this target cluster larger than trthr. If this argument is not given only cells contribute with a significant bias for this target cluster in comparison to all others with a p-value < 0.05 derived from sampling statistics. The principal curves are plotted into the maps and returned by the function if the argument prc is TRUE.

pr <- plotFateMap(y,dr,k=2,m="umap",trthr=.33,fb=fb,prc=TRUE)

If the most naive cluster representing a common upstream progenitor of all cell types is known, the number of this cluster can also be given as input argument start in order to initialize the principal curve computation with principal curves starting in the start cluster and ending in one of the target clusters. To enforce that principal curves traverse through the cell population representing the respective target cluster, the weight of cells within the initial target clusters is increased 10-fold for the principal curve computation.

If principal curves are computed, the return object of plotFateMap is a list of two objects. The first argument pr is itself a list of principal curve objects for each target cluster as computed by the principal.curve function from the princurve package. The second argument trc is a list of vectors with cell IDs for the principal curve of each target cluster containing the IDs of all cells with a significant fate bias or a fate bias greater than trthr, respectively. The cell IDs are ordered by their position along the principal curve as obtained from the point of their shortest distance to the curve.

The expression of a gene, or the aggregated expression across a group of genes, can also be highlighted in the dimensional reduction representation by providing a vector with the respective gene IDs as additional parameter g, an optional title n for the plot, and an expression data data frame x (we are using the full one instead of the reduced one):

v <- intestine$v
pr <-plotFateMap(y, dr, k=2, m="umap", g=c("Defa20__chr8", "Defa24__chr8"), n="Defa", x=v)

Finally, if g="E" and a list fb returned by the fateBias function is provided as input, the function will plot the entropy of the fate bias based on the probabilities for the different target clusters. An entropy level of the fate bias will indicate the level of multipotency with larger fate bias entropies corresponding to more multipotent cell states. If executed with g="E", the function will return a vector with the fate bias entropy of each cell:

E <- plotFateMap(y,dr,k=2,m="umap",g="E",fb=fb)
head(E)

The principal curve can also be computed directly without calling the plotFateMap function:

pr  <- prcurve(y,fb,dr,k=2,m="umap",trthr=0.33,start=3)

This function has the same input arguments as plotFateMap and is invoked by the plotFateMap function.

Inspecting pseudo-temporal gene expression changes

FateID also provides functions for the visualization and analysis of pseudo-temporal gene expression changes. For this purpose, cells with a fate bias towards a target cluster can be extracted. The principal curve analysis returns all cells along a differentiation trajectory in pseudo-temporal order. For example, cells with a fate bias towards cluster 6 in pseudo-temporal order can be extracted by the following command:

n <- pr$trc[["t6"]]

Alternatively, other published methods for pseudo-temporal ordering, e.g., Monocle2 [@Census] or diffusion pseudo-time [@DPT], can be used after selecting cells by cell fate probability.

In the second step, the input expression data are defined:

v <- intestine$v
fs  <- filterset(v,n=n,minexpr=2,minnumber=1)

Here, we are using the full expression data frame v without feature selection. The second argument is the vector of cell ids corresponding to valid row names of the expression table in pseudo-temporal order.

The filterset function can be used to eliminate lowly expressed genes on the trajectory from the subsequent analysis and has two additional arguments to discard genes, which are not expressed at a level of minexpr or higher in at least minnumber of cells. The function returns a filtered expression data frame with genes as rows and cells as columns in the same order as in the input vector n. In the third step, a self-organizing map (SOM) of the pseudo-temporal expression profiles is computed:

s1d <- getsom(fs,nb=50,alpha=.5)

This map provides a grouping of similar expression profiles into modules. The first input argument is again an expression data frame. In this case, we use the filtered expression table generated by the filterset function to retain only genes that are expressed on the trajectory under consideration. Pseudo-temporal expression profiles along the differentiation trajectory of interest are computed after smoothening by local regression with smoothing parameter alpha.

This function returns a list of the following three components: a som object returned by the function som of the package som, a data frame x with smoothened and normalized expression profiles, and a data frame zs of z-score transformed pseudo-temporal expression profiles.

The SOM is then processed by another function to group the nodes of the SOM into larger modules and to produce additional z-score transformed and binned expression data frames for display:

ps  <- procsom(s1d,corthr=.85,minsom=3)

The first input argument is given by the SOM computed by the function getsom. The function has two additional input parameters to control the grouping of the SOM nodes into larger modules. The parameter corthr defines a correlation threshold. If the correlation of the z-scores of the average normalized pseudo-temporal expression profiles of neighboring nodes in the SOM exceeds this threshold genes of the neighboring nodes are merged into a larger module. Only modules with at least minsom genes are kept. The function returns a list of various data frames with normalized, z-score-transformed, or binned expression along with the assignment of genes to modules of the SOM (see help page for details).

The output of the processed SOM can be plotted using the plotheatmap function. First, in order to highlight the clustering partition y, the same color scheme as used in plotFateMap can be generated (alternatively, both functions can be executed with an arbitrary user-defined scheme):

fcol <- sample(rainbow(max(y)))

Now, the different output data frames of the procsom function can be plotted.

Average z-score for all modules derived from the SOM:

plotheatmap(ps$nodes.z, xpart=y[n], xcol=fcol, ypart=unique(ps$nodes), xgrid=FALSE, ygrid=TRUE, xlab=FALSE)

The z-score profile of each gene ordered in SOM modules:

plotheatmap(ps$all.z, xpart=y[n], xcol=fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=FALSE)

Normalized expression profile of each gene ordered by SOM modules:

plotheatmap(ps$all.e, xpart=y[n], xcol=fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=FALSE)

Binarized expression profile of each gene (z-score < -1, -1 < z-score < 1, z-score > 1):

plotheatmap(ps$all.b, xpart=y[n], xcol=fcol, ypart=ps$nodes, xgrid=FALSE, ygrid=TRUE, xlab=FALSE)

In order to inspect genes within individual modules of the SOM, these genes can be extracted given the number of the module. The module numbers are contained in the return argument nodes of the procsom function and can be extracted, e. g. for module number 1:

g <- names(ps$nodes)[ps$nodes == 1]

The average pseudo-temporal expression profile of this group can be plotted by the function plotexpression:

plotexpression(fs, y, g, n, col=fcol, name="Node 1", cluster=FALSE, alpha=.5, types=NULL)

In the same way it is possible to plot expression profiles of individual genes, e.g.:

plotexpression(fs, y, "Clca4__chr3", n, col=fcol, cluster=FALSE, alpha=.5, types=NULL)

It is also possible to highlight the data points as specific symbols, for example reflecting batches, by using the types argument:

plotexpression(fs, y, g, n, col=fcol, name="Node 1", cluster=FALSE, types=sub("\\_\\d+","",n))

See help pages for detailed explanation of the input parameters.

Finally, individual loess-smoothed profiles for a set of genes can be plotted in the same graph (after normalizing the sum of expression across all cells to one in order to display genes on the same scale):

group <- head(g,6)
plotexpressionProfile(fs, y, group, n, name="Node 1", cluster=FALSE)

See help pages for detailed explanation of the input parameters.

Differential gene expression analysis

Together with FateID a function for differential gene expression analysis is provided. This function implements an approach similar to DESeq [@DESeq] by which negative binomial distributions are fitted to transcript counts in two populations, representing, for instance, two different cell types. By default, the function should be executed on normalized data and estimates the dispersion parameter from the polynomial fit of the log-transformed transcript count variance as a function of the log-transformed mean expression. This is done on the input data, either after pooling cells from the two populations or on each of the populations separately. Alternatively, one can also provide a function for the variance-mean dependence as an input parameter, e. g. the dependence computed in the RaceID2 [@StemID] outlier identification. As another option, the function can also execute a standard DESeq2 [@DESeq2] differential expression analysis.

To run the analysis, an expression data frame is needed as input together with two vectors of cell IDs corresponding to subsets of column names of this data frame.

For example, this function could be used to identify genes differentially expressed between cells in the original clusters 3,4,5 (the progenitor compartment) that are either biased towards enterocytes (cluster 6) or goblet cells (cluster 13) with a fate-probability of >0.5:

thr <- .5
a   <- "t13"
b   <- "t6"
cl  <- c(3,4,5)
A <- rownames(fb$probs)[fb$probs[,a] > thr]
A <- A[y[A] %in% cl]
B <- rownames(fb$probs)[fb$probs[,b] > thr]
B <- B[y[B] %in% cl]
de <- diffexpnb(v,A=A,B=B,DESeq=FALSE,norm=FALSE,vfit=NULL,locreg=FALSE)

Here, we use the full expression data frame without feature selection as input. Normalization is not performed, since v is already normalized by size. With the chosen input parameters, diffexpnb will infer the dispersion parameter of the negative binomial transcript level distributions for the two populations based on a variance-mean dependence derived as in RaceID2 [@StemID]. Alternatively, a local regression can be utilized, if locreg equals TRUE. Another option is to run a DESeq2 [@DESeq2] analysis, if DESeq is TRUE. The function returns a list of three components. The results of the differential gene expression analysis are stored in the res component, which is a data frame displaying mean expression of the two sets, fold change and log2 fold-change between the two sets, the p-value for differential expression (pval) and the Benjamini-Hochberg corrected false discovery rate (padj).

The results can be plotted by the following function:

plotdiffgenesnb(de,mthr=-4,lthr=0,Aname=a,Bname=b,padj=FALSE)

Inspecting fate bias of gene expression

To enable further inspection of fate-specific marker gene expression, FateID provides a function to plot the expression levels of two genes against each other in a scatter plot. These genes could, for instance, represent markers of two different lineages, and the function reveals whether the expression domains of these genes fall into distinct clusters or whether they correlate with the fate bias towards a particular lineage. To visualize a correlation of gene expression domain and cluster origin of the cells, the gene2gene function can be called with a gene expression data frame and a clustering partition as input:

gene2gene(intestine$v,intestine$y,"Muc2__chr7","Apoa1__chr9")

If the fate bias towards a particular target cluster, e. g. cluster 6, should be highlighted, the output fb of the fateBias function and the column name of fb$probs corresponding to the target cluster, i. e. t6, are required as additional input parameters:

gene2gene(intestine$v, intestine$y, "Muc2__chr7", "Apoa1__chr9", fb=fb, tn="t6", plotnum=FALSE)

See help pages for additional input arguments.

To identify genes involved in early priming and fate decision at all stages of differentiation, it is informative to extract all the genes with a high importance value (see details of the randomForest function in the randomForest package) for the classification into a given target cluster in at least one random forest iteration:

k <- impGenes(fb,"t6")

This function requires the output object of the fateBias function and the number of a target clusters, concatenated with a t as input. Additional optional arguments specify cutoffs for the importance values (see help pages). It returns a list of two objects, one data frame with mean importance values for all genes that passed the thresholds in at least of the iterations as rows and the iterations as columns, and a corresponding data frame with the standard deviation of the importance values.

The function also plots a heatmap of the importance values with hierarchical clustering of the genes.

For bug reports and any questions related to FateID please email directly to link.

References


references:



dgrun/FateID documentation built on June 20, 2022, 12:57 p.m.