SCRIP proposed two frameworks based on Gamma-Poisson and Beta-Poisson distribution for simulating scRNA-seq data. Both Gamma-Poisson and Beta-Poisson distribution model the over dispersion of scRNA-seq data. Specifically, Beta-Poisson model was used to model bursting effect. The dispersion was accurately simulated by fitting the mean-BCV dependency using generalized additive model (GAM). Other key characteristics of scRNA-seq data including library size, zero inflation and outliers were also modeled by SCRIP. With its flexible modeling, SCRIP enables various application for different experimental designs and goals including DE analysis, clustering analysis, trajectory-based analysis and bursting analysis

install.packages("TruncatedDistributions", repos="http://R-Forge.R-project.org") BiocManager::install("splatter") library(devtools) install_github("FeiQin92/SCRIP")
Assuming you already have a count matrix for scRNA-seq data, and you want to simulation data based on it. Only a few steps are needed to creat a simulation data using SCRIP.
A dataset from Xin data is used for example.
library(SCRIP) library(Biobase) library(splatter) EMTAB.eset <- EMTABesethealthy expre_data = exprs(EMTAB.eset) pheno_data = pData(EMTAB.eset) dim(expre_data) expre_data[1:6,1:6] head(pheno_data) # To simplify computation here, only acinar celltype in sample 5 was utilized here. sub_expre_data=expre_data[,which((pheno_data$sampleID %in% c(5)) & (pheno_data[,4] == "acinar"))] params <- splatEstimate(sub_expre_data) sim_trend <- SCRIPsimu(data=sub_expre_data, params=params, mode="GP-trendedBCV") sim_trend
\newpage
SCRIP utlized the estimation strategy from splatter, but also provided more parameters (Fold change, dropout rates, library size, BCV degree of freedom) to serve different experimental designs (i.e. Simulation for differential expression analysis, clustering analysis and trajectory analysis). Detailed description about other parameters will be shown in other sections of this document.
params <- splatEstimate(sub_expre_data) params
The default mode in SCRIP for simulation is "GP-trendedBCV". You can also choose other modes ("GP-commonBCV", "BP-commonBCV", "BP-trendedBCV", "BP-burstBCV" and "BP") in the SCRIPsimu() function. Specifically, for the "BP" mode, you have to provide the "bursting" information (burst frequency and burst size), which was estimated from directly from real data by performing a profile-likelihood method built up on Beta-Poisson method proposed by Larsson et al.(not shown here). For single cell type simulation, you have to set the "method" as "single", which was default in SCRIPsimu() function.
GP-commonBCV is the model used by splatter. GP-commonBCV applied the Gamma-Poisson mixture model with mean-BCV dependency fitted by a common BCV across genes.
########################### GP-commonBCV model/Splatter ########################## ################################################################################## sim_GPcommon <- SCRIPsimu(data=sub_expre_data, params=params, mode="GP-commonBCV") sim_GPcommon
GP-trendedBCV is the major model of SCIRP. which used the Gamma-Poisson mixture model with mean-BCV dependency fitted by GAM.
############################### GP-trendedBCV model ############################## ################################################################################## sim_GPtrend <- SCRIPsimu(data=sub_expre_data, params=params, mode="GP-trendedBCV")
BP-commonBCV is the model used for simulating bursting effect with Beta-Poisson mixture distribution. The mean-BCV dependency was fitted by a common BCV across genes.
############################### BP-commonBCV model ############################## ################################################################################## sim_BPcommon <- SCRIPsimu(data=sub_expre_data, params=params, mode="BP-commonBCV")
BP-trendedBCV is the model used for simulating bursting effect with Beta-Poisson mixture distribution. The mean-BCV dependency was fitted by a GAM.
############################### BP-trendedBCV model ############################## ################################################################################## sim_BPtrend <- SCRIPsimu(data=sub_expre_data, params=params, mode="BP-trendedBCV")
BP-burstBCV is also the model used for simulating bursting effect with Beta-Poisson mixture distribution. The mean-BCV dependency was fitted by a GAM. In "BP-trendedBCV" from 4.2.4, we assumed that bursting effect was nested in BCV which has the dependency with mean. However for "BP-burstBCV" here, we assumed that mean-BCV dependency and bursting effect were independent. Under this assumption, we simulated underlying mean count from beta distribution and included BCV information in the Beta-Negative Binomial framework using the GAM function described above.
############################### BP-trendedBCV model ############################## ################################################################################## sim_BPtrend_burst <- SCRIPsimu(data=sub_expre_data, params=params, mode="BP-burstBCV")
\newpage
Group simulation is useful for studying different experimental conditions, especially for differential expression (DE) analysis. To serve different applications in scRNA-seq analysis, SCRIP provides flexible simulation. It can simulate scRNA-seq data with different parameters from multiple cell groups (i.e. cell types), which is useful for evaluating the detection of global characteristics such as clustering. It also allows simulation of group difference in a single cell group, which is useful for evaluating typical DE analysis methods.
DEGs were simulated using multiplicative differential expression factors from a log-normal distribution with parameters including number of genes (nGenes), the path-specific proportion of DE genes (de.prob), the proportion of down-regulated DE genes (de.downProb), DE location factor (de.facLoc) and DE scale factor (de.facScale).
counts <- as.matrix(Tung) counts <- counts[1:10000,1:200] params <- splatEstimate(counts)
sim.SCRIP2 <- SCRIPsimu(data=counts, params=params, method="groups", batchCells=300, group.prob = c(0.25, 0.25, 0.25, 0.25), de.prob = c(0.2, 0.2, 0.2, 0.2), de.downProb = c(0.5, 0.5, 0.5, 0.5), de.facLoc = c(0.2, 0.3, 0.4, 0.5), de.facScale=c(0.2, 0.2, 0.2, 0.2)) library(scater) sim2 <- logNormCounts(sim.SCRIP2) sim2 <- runPCA(sim2) plotPCA(sim2, colour_by = "Group") + ggtitle("Big DE factors")
Batch effect factors are also generated from a log-normal distribution with parameters including batchCells, batch.facLoc and batch.facScale.
batchCells: number of cells for each batch \
batch.facLoc: Batch location factor in log-normal distribution for batch factor \
batch.facScale: Batch scale factor in log-normal distribution for batch factor \
sim.SCRIP3 <- SCRIPsimu(data=counts, params=params, method="groups", batchCells=c(150, 150), batch.facLoc = c(0.1, 0.1), batch.facScale = c(0.1, 0.1), group.prob = c(0.25, 0.25, 0.25, 0.25), de.prob = c(0.2, 0.2, 0.2, 0.2), de.downProb = c(0.5, 0.5, 0.5, 0.5), de.facLoc = c(0.2, 0.3, 0.4, 0.5), de.facScale=c(0.2, 0.2, 0.2, 0.2)) sim3 <- logNormCounts(sim.SCRIP3) sim3 <- runPCA(sim3) plotPCA(sim3, colour_by = "Group", shape_by="Batch") + ggtitle("Big DE factors")
To simulate multi-cell type scRNA-seq data, we estimated hyperparameters from real data and simulate data for each cell type separately. SCRIP preserved cluster discriminative genes, which is not capable in available methods. We identified variably expressed genes (VEGs) from each cell type in real data. For these VEGs, true cell mean was set to their expression means in real data to simulate cell-type discriminative genes. Whereafter, aforementioned simulation steps (simulating true cell mean; simulating true counts) will be followed to complete the simulation.
### Xin #### EMTAB.eset <- EMTABesethealthy library(Biobase) library(edgeR) library(stats) expre_data = exprs(EMTAB.eset) pheno_data=pData(EMTAB.eset) expre_data <- as.matrix(expre_data[1:2000,]) expre_data[1:6,1:6] head(pheno_data) table(pheno_data$cellType) CTlist <- c("acinar","alpha","beta") params <- splatEstimate(expre_data)
library(Seurat) final.list <- simu_cluster(expre_data = expre_data, pheno_data = pheno_data, CT=CTlist, mode="GP-trendedBCV", nfeatures=100) final <- final.list$final CT.infor <- final.list$CT.infor library(ggplot2) Group.sim <- function(data){ library(dplyr) library(Seurat) library(cowplot) pbmc <- CreateSeuratObject(counts = data, project = "MuSiC", min.cells = 3, min.features = 200) pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000) pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 100) all.genes <- rownames(pbmc) pbmc <- ScaleData(pbmc, features = all.genes) pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc)) pbmc <- RunUMAP(pbmc, dims = 1:10) pbmc <- RunTSNE(pbmc) return(pbmc) } pbmc <- Group.sim(data=final) pbmc@meta.data$Celltype <- CT.infor umap <- DimPlot(pbmc, reduction = "umap",group.by = "Celltype",pt.size=1)+ labs(title="GP-trendedBCV clustering simulation")+ theme(plot.title = element_text(face="bold",size=12, hjust=0.5), legend.text=element_text(size=12), axis.text=element_text(size=10,face="bold"), axis.title.x = element_text(size=10,face="bold"), axis.title.y = element_text(size=10,face="bold"), panel.background = element_rect(fill='white',colour="black")) umap
To simulate single-cell type scRNA-seq data, SCRIP enabled to simulate with different DE rates, fold change, dropout rate, library size, BCV.df and BCV.shrink. You change these parameters in simu_DE() function, which will finally export the simulation results for group1 and group2, as well as the true DEGs.
## Here I will use Tung dataset (given in package) to give examples about how to change ## parameter for evaluating DE analysis methods. counts <- as.matrix(Tung) library(Biobase) library(gtools) library(lcmix) ## To simplify computation, only 1000 genes and 100 cells from Tung dataset will be use ## for estimation and simulation expre_data_sub <- counts[1:1000,1:100] params1 <- splatEstimate(expre_data_sub)
1) Fold change: FC enables us to control the difference between two groups. Larger fold change value means larger pre-defined difference between two groups; it is expected that we can find better performance for each DE analysis tool.
################### Change the parameter of fold change (FC) #################### ################################################################################# resFC <- simu_DE(expre_data = expre_data_sub, params = params1, nDE=50, FC=1.4) ## simu_DE() function will export the simulation results for group1 and group2, ## as well as the true DEGs. resFC
2) Dropout rate: dropout rate controls the zero proportion in scRNA-seq data, since zero inflation is one of the main characteristics in scRNA-seq data.
########### ##### Change the parameter of dropout rates (Dropout_rate) ########## ################################################################################# resDrop <- simu_DE(expre_data = expre_data_sub, params = params1, nDE=50, FC=1.4, Dropout_rate=0.2)
3) Library size: library size can be used to control the expression level for each gene, as well as the zero proportions in the scRNA-seq data.
################ Change the parameter of library size (libsize) ################# ################################################################################# resLib <- simu_DE(expre_data = expre_data_sub, params = params1, nDE=50, FC=1.4, libsize=5000)
4) BCV.df: BCV.df enables us to change the variation of BCV values. Smaller variation is given to BCV values when BCV.df gets larger.
########## Change the parameter of BCV degree of freedom (pre.bcv.df) ########### ################################################################################# resDf <- simu_DE(expre_data = expre_data_sub, params = params1, nDE=50, FC=1.4, pre.bcv.df=5)
5) BCV.shrink: This parameter is introduced to amplify (> 1) or shrink (< 1) BCV values, which mean it can control the BCV levels.
######### Change the parameter of BCV shrinkage factor (bcv.shrink) ############# ################################################################################# resShrink <- simu_DE(expre_data = expre_data_sub, params = params1, nDE=50, FC=1.4, bcv.shrink = 1.5)
Here I take fold change as an example to introduce how to use the simulation data to evaluate DE analysis methods (i.e. edgeR, DESeq2, limma voom and MAST):
sim=resFC[[1]] simDE=resFC[[2]] genenameDE=resFC[[3]] exps <- counts(sim) expsDE <- counts(simDE) counts <- cbind(exps,expsDE) colnames(counts) <- paste0("cell",1:ncol(counts)) rownames(counts) <- paste0("gene",1:nrow(counts)) truth <- rownames(counts) truth[which(rownames(counts) %in% genenameDE)] <- "DE" truth[-which(rownames(counts) %in% genenameDE)] <- "common" truth <- as.factor(truth)
As an application of SCRIP, we simulated scRNA-seq data using SCRIP and compared four DE analysis tools including edgeR, DESeq2, Limma-voom and MAST. To better account for zero inflation of scRNA-seq data in DE analysis, a weighting strategy of ZINB-WaVE was also integrated with edgeR, DESeq2 and Limma-voom.
ZINB WAVE: ZINB-WaVE is used to computes gene- and cell-specific weights, based on zero-inflated negative binomial model, to “unlock” bulk RNA-seq tools for single-cell applications.
############################################################################ ################################# ZINB WAVE ################################ library(zinbwave) library(DESeq2) group <- factor(c(rep(1,ncol(exps)),rep(2,ncol(expsDE)))) coldata <- data.frame(condition=group) rownames(coldata) <- colnames(counts) fluidigm <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition) zinb <- zinbFit(fluidigm, K=2, epsilon=1000) fluidigm_zinb <- zinbwave(fluidigm, fitted_model = zinb, K = 2, epsilon=1000, observationalWeights = TRUE) ZINB.Wave.weight <- assay(fluidigm_zinb, "weights")
ZINB.Wave.weight[1:6,1:6]
edgeR was originally designed for bulk RNA sequencing data and it used a overdispersed Poisson model and an Empirical Bayes modelto accout for both biological and technical variability.
################################################################################ ###################################### edgeR ################################### library(edgeR) dgList <- DGEList(assay(fluidigm)) countsPerMillion <- cpm(dgList) countCheck <- countsPerMillion > 1 keep <- which(rowSums(countCheck) >= 2) dgList <- dgList[keep,] dgList <- calcNormFactors(dgList, method="TMM") design <- model.matrix(~condition, data = colData(fluidigm)) dgList <- estimateDisp(dgList, design) fit <- glmFit(dgList, design) lrt <- glmLRT(fit, coef=2) edgeR.res <- as.data.frame(topTags(lrt,n=nrow(lrt$coefficients))) edgeR.res <- edgeR.res[rownames(counts),]
head(edgeR.res)
To apply edgeR for scRAN-seq data, we computed gene- and cell-specific weights using ZINB Wave approach. Then such weights were considered in the DE analysis.
################################################################################# ############################### edgeR ZINB WAVE ################################# library(edgeR) dge <- DGEList(assay(fluidigm_zinb)) dge <- calcNormFactors(dge) design <- model.matrix(~condition, data = colData(fluidigm)) dge$weights <- weights dge <- estimateDisp(dge, design) fit <- glmFit(dge, design) lrt <- glmWeightedF(fit, coef = 2) edgeR.weighted.res <- as.data.frame(topTags(lrt,n=nrow(lrt$coefficients))) edgeR.weighted.res <- edgeR.weighted.res[rownames(counts),]
head(edgeR.weighted.res)
DESeq2, originally designed for bulk RNA-seq data, used a shrinkage estimation approach for dispersions and fold change to improve the stability and interpretability of estimates in DE analysis
############################################################################## ################################## DESeq2 #################################### library(DESeq2) dds <- DESeqDataSet(fluidigm, design = ~ condition) dds <- DESeq(dds, sfType="poscounts", useT=TRUE, minmu=1e-6) DESeq2.res <- results(dds) DESeq2.res <- as.data.frame(DESeq2.res) DESeq2.res <- DESeq2.res[rownames(counts),]
head(DESeq2.res)
############################################################################### ############################## DESeq2 ZINB WAVE ############################### library(DESeq2) dds <- DESeqDataSet(fluidigm_zinb, design = ~ condition) dds <- DESeq(dds, sfType="poscounts", useT=TRUE, minmu=1e-6) DESeq2.weighted.res <- results(dds) DESeq2.weighted.res <- as.data.frame(DESeq2.weighted.res) DESeq2.weighted.res <- DESeq2.weighted.res[rownames(counts),]
head(DESeq2.weighted.res)
Voom model estimated the mean-variance relationship for the log-counts, which then generated a precision weight for each observation. Then these weights were considered into the limma empirical Bayes analysis pipeline.
################################################# ############################## #################################### limma voom ################################# library(edgeR) d0 <- DGEList(counts) d0 <- calcNormFactors(d0) mm <- model.matrix(~0 + group) y <- voom(d0, mm, plot = F) fit <- lmFit(y, mm) contr <- makeContrasts(group1 - group2, levels = colnames(coef(fit))) tmp <- contrasts.fit(fit, contr) tmp <- eBayes(tmp) top.table <- topTable(tmp, sort.by = "none", n = Inf) limma.res <- top.table limma.res <- limma.res[rownames(counts),]
head(limma.res)
################################################################################# ############################### limma voom ZINB WAVE ############################ library(edgeR) d0 <- DGEList(counts) d0 <- calcNormFactors(d0) mm <- model.matrix(~0 + group) y <- voom(d0, mm, plot = F) y$weights <- weights*y$weights fit <- lmFit(y, mm) contr <- makeContrasts(group1 - group2, levels = colnames(coef(fit))) tmp <- contrasts.fit(fit, contr) tmp <- eBayes(tmp) top.table <- topTable(tmp, sort.by = "none", n = Inf) limma.weighted.res <- top.table limma.weighted.res <- limma.weighted.res[rownames(counts),]
head(limma.weighted.res)
MAST was designed specifically for scRNA-seq data. It developed a two-part, generalized linear model to parameterizes the characteristics in scRNA-seq data.
################################################################################# ##################################### MAST ###################################### group <- factor(c(rep("Group0",ncol(exps)),rep("Group1",ncol(expsDE)))) group <- relevel(group,"Group0") coldata <- data.frame(condition=group) rownames(coldata) <- colnames(counts) fluidigm <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition) library(MAST) library(data.table) sca <- FromMatrix(counts,data.frame(condition=group), data.frame(Geneid=rownames(counts)),check_sanity = FALSE) zlmCond <- zlm(~condition, sca) summaryCond <- summary(zlmCond, doLRT='conditionGroup1') print(summaryCond, n=4) summaryDt <- summaryCond$datatable fcHurdle <- merge(summaryDt[contrast=='conditionGroup1' & component=='H',.(primerid, `Pr(>Chisq)`)], #hurdle P values summaryDt[contrast=='conditionGroup1' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients MAST.res=fcHurdle MAST.res <- MAST.res[rownames(counts),]
head(MAST.res)
For each method, true positive rates and false positive rates of DE detections can be calculated based on the knowledge of true and false DEGs. Receiver operating characteristic (ROC) curve and the areas under the ROC curve (AUC) can be used as measures to assess the performance of these DE analysis methods.
\newpage
RNA transcriptomes data can be used to trace cell divisions and migrations over time, which is also referred to as cell lineage analysis. The developmental order of cells can be represented by a quantitative measure, pseudotime. To simulate pseudotime effect, we used the strategy of Splat based on differential expression process in cell developmental paths.Similarly to group simulation, in such differential expression process, we first defined the start and end expression level for each path. DEGs were simulated using multiplicative differential expression factors from a log-normal distribution with parameters including number of genes (nGenes), the path-specific proportion of DE genes (de.prob), the proportion of down-regulated DE genes (de.downProb), DE location factor (de.facLoc) and DE scale factor (de.facScale). Specifically, a few other parameters are provided to control the pattern of the lineages.
1) path.nSteps \
-- number of steps between the start point and end point for each path
2) path.skew \
-- Controls how likely cells are from the start or end point of the path. Cells are more likely to come from end point when path.skew gets close to 0, while cells are more likely to come from start point when path.skew gets close to 1.
3) path.from \
-- Control the relationship between these paths.
counts <- as.matrix(Tung) counts <- counts[1:10000,1:200] params <- splatEstimate(counts)
You can either use the parameters estimated from real data to simulate data or pre-define those parameters by yourself.
### Use parameters estimated from real data to simulate data and you only need ### to define the group porportions (group.prob) and path directions (path.from). sim.SCRIP1 <- SCRIPsimu(data=counts, params=params, method="paths", group.prob = c(0.25, 0.25, 0.25, 0.25), path.from = c(0, 1, 2, 3)) library(scater) sim1 <- logNormCounts(sim.SCRIP1) sim1 <- runPCA(sim1) plotPCA(sim1, colour_by = "Group") + ggtitle("Big DE factors")
### Predefine parameters to simulate data sim.SCRIP2 <- SCRIPsimu(data=counts, params=params, method="paths", batchCells=300, group.prob = c(0.25, 0.25, 0.25, 0.25), de.prob = c(0.2, 0.2, 0.2, 0.2), de.downProb = c(0.5, 0.5, 0.5, 0.5), de.facLoc = c(0.2, 0.2, 0.2, 0.2), de.facScale=c(0.2, 0.2, 0.2, 0.2), path.nSteps=c(100, 100, 100, 100), path.from = c(0, 1, 2, 3), path.skew=c(0.5,0.5,0.5,0.5)) library(scater) sim2 <- logNormCounts(sim.SCRIP2) sim2 <- runPCA(sim2) plotPCA(sim2, colour_by = "Group") + ggtitle("Big DE factors")
\newpage
To mitigate the side effect of non-significant genes during analysis, it is essential to find genes that are differentially expressed within lineage or between lineages, referred as trajectory DE analysis. SCRIP enables evaluation of trajectory DE analysis methods with simulated DEGs for within lineage or between lineages. Between-lineage and within-lineage DEGs were simulated separately, with multiplicative differential expression factors generated from a log-normal distribution described above.
For within-lineage simulation, 20% genes were also randomly selected as within-lineage differentially expressed with other within-lineage DE parameters de.downProb, de.facLoc and de.facScale.
library(SCRIP) library(splatter) counts <- as.matrix(Tung) counts <- counts[1:10000,1:200] params <- splatEstimate((counts)) sim.SCRIP.within <- SCRIPsimu(data=counts, params=params, method="paths", de.prob = 0.2, de.downProb=0.5, de.facLoc=1.2, de.facScale=0.1)
sim <- sim.SCRIP.within All.genes <- rownames(counts(sim)) Truepaths <- rowData(sim) DE.within.path1 <- Truepaths$DEFacPath1 True.within <- rep("common",10000) True.within[which(DE.within.path1 != 1)] <- "DE" table(True.within)
Monocle3 test whether any of the genes change over time in their expression by fitting a generalized linear model.
library(monocle3) expression_matrix <- counts(sim.SCRIP.within) cell_metadata <- data.frame(libsize = apply(expression_matrix,2,sum)) rownames(cell_metadata) <- colnames(expression_matrix) gene_annotation <- data.frame(gene_short_name=rownames(expression_matrix)) rownames(gene_annotation) <- rownames(expression_matrix) cds.within <- new_cell_data_set(expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) cds.within <- preprocess_cds(cds.within, num_dim = 100, method = "PCA") cds.within <- reduce_dimension(cds.within,preprocess_method = "PCA", reduction_method = "UMAP") cds.within <- cluster_cells(cds.within,reduction_method="UMAP") cds.within <- learn_graph(cds.within) # We find all the cells that are close to the starting point closest_vertex <- cds.within@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix(closest_vertex[colnames(cds.within), ]) closest_vertex <- as.numeric(names(which.max(table(closest_vertex)))) mst <- principal_graph(cds.within)$UMAP root_pr_nodes <- igraph::V(mst)$name[closest_vertex] # We compute the trajectory cds.within <- order_cells(cds.within, root_pr_nodes = root_pr_nodes) # plot_cells(cds.within) gene_fits <- fit_models(cds.within, model_formula_str = "~pseudotime") fit_coefs <- coefficient_table(gene_fits) fit_coefs <- fit_coefs[which(fit_coefs$term != "(Intercept)"),] head(fit_coefs)
tradeSeq implements the associationTest() fucntion to test whether the average gene expression is significantly changing along pseudotime.
library(magrittr) # Get the closest vertice for every cell y_to_cells <- principal_graph_aux(cds.within)$UMAP$pr_graph_cell_proj_closest_vertex %>% as.data.frame() y_to_cells$cells <- rownames(y_to_cells) y_to_cells$Y <- y_to_cells$V1 # Get the root vertices # It is the same node as above root <- cds.within@principal_graph_aux$UMAP$root_pr_nodes # Get the other endpoints endpoints <- names(which(igraph::degree(mst) == 1)) endpoints <- endpoints[!endpoints %in% root] # For each endpoint cellWeights <- lapply(endpoints, function(endpoint) { # We find the path between the endpoint and the root path <- igraph::shortest_paths(mst, root, endpoint)$vpath[[1]] path <- as.character(path) # We find the cells that map along that path df <- y_to_cells[y_to_cells$Y %in% path, ] df <- data.frame(weights = as.numeric(colnames(cds.within) %in% df$cells)) colnames(df) <- endpoint return(df) }) %>% do.call(what = 'cbind', args = .) %>% as.matrix() rownames(cellWeights) <- colnames(cds.within) pseudotime <- matrix(pseudotime(cds.within), ncol = ncol(cellWeights), nrow = ncol(cds.within), byrow = FALSE) library(tradeSeq) sce.within <- fitGAM(counts = expression_matrix , pseudotime = pseudotime, cellWeights = cellWeights)
library(tradeSeq) ## Association of gene expression with pseudotime assoRes <- associationTest(sce.within) head(assoRes)
Between-lineage DE factors: Between-lineage DE factors are generated from a log-normal distribution with parameters including number of genes (nGenes), the path-specific proportion of DE genes (de.prob), the proportion of down-regulated DE genes (de.downProb), location factor (de.facLoc) and scale factor (de.facScale). \
library(SCRIP) library(splatter) counts <- as.matrix(Tung) counts <- counts[1:10000,1:400] params <- splatEstimate(counts) sim.SCRIP.between <- SCRIPsimu(data=counts, params=params, method="paths", group.prob = c(0.5,0.5), de.prob = c(0.0,0.2), de.downProb=c(0.5,0.5), de.facLoc=c(1.2,1.2), de.facScale=c(0.1,0.1), path.from = c(0, 1),path.nSteps = c(100, 100))
Truepaths <- rowData(sim.SCRIP.between) DE.between.path <- Truepaths$DEFacPath2 True.between <- rep("common",10000) True.between[which(DE.between.path != 1)] <- "DE" table(True.between)
The graph_test() in monocle3 uses a statistic from spatial autocorrelation analysis called Moran's I to identify the DEGs.
library(monocle3) expression_matrix <- counts(sim.SCRIP.between) cell_metadata <- data.frame(libsize = apply(expression_matrix,2,sum)) rownames(cell_metadata) <- colnames(expression_matrix) gene_annotation <- data.frame(gene_short_name=rownames(expression_matrix)) rownames(gene_annotation) <- rownames(expression_matrix) cds <- new_cell_data_set(expression_matrix, cell_metadata = cell_metadata, gene_metadata = gene_annotation) cds <- preprocess_cds(cds, num_dim = 100, method = "PCA") cds <- reduce_dimension(cds,preprocess_method = "PCA", reduction_method = "UMAP") cds <- cluster_cells(cds,reduction_method="UMAP") cds <- learn_graph(cds,use_partition = FALSE) # We find all the cells that are close to the starting point closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex closest_vertex <- as.matrix(closest_vertex[colnames(cds), ]) closest_vertex <- as.numeric(names(which.max(table(closest_vertex)))) mst <- principal_graph(cds)$UMAP root_pr_nodes <- igraph::V(mst)$name[closest_vertex] # We compute the trajectory cds <- order_cells(cds, root_pr_nodes = root_pr_nodes) plot_cells(cds, color_cells_by = "pseudotime") pr_graph_test_res <- graph_test(cds, neighbor_graph="knn", cores=8)
head(pr_graph_test_res)
The function patternTest() implements a statistical model to identify DEGs which have significantly different expression patterns.
library(magrittr) # Get the closest vertice for every cell y_to_cells <- principal_graph_aux(cds)$UMAP$pr_graph_cell_proj_closest_vertex %>% as.data.frame() y_to_cells$cells <- rownames(y_to_cells) y_to_cells$Y <- y_to_cells$V1 # Get the root vertices root <- cds@principal_graph_aux$UMAP$root_pr_nodes # Get the other endpoints endpoints <- names(which(igraph::degree(mst) == 1)) endpoints <- endpoints[!endpoints %in% root] # For each endpoint cellWeights <- lapply(endpoints, function(endpoint) { # We find the path between the endpoint and the root path <- igraph::shortest_paths(mst, root, endpoint)$vpath[[1]] path <- as.character(path) # We find the cells that map along that path df <- y_to_cells[y_to_cells$Y %in% path, ] df <- data.frame(weights = as.numeric(colnames(cds) %in% df$cells)) colnames(df) <- endpoint return(df) }) %>% do.call(what = 'cbind', args = .) %>% as.matrix() rownames(cellWeights) <- colnames(cds) pseudotime <- matrix(pseudotime(cds), ncol = ncol(cellWeights), nrow = ncol(cds), byrow = FALSE) library(tradeSeq) sce <- fitGAM(counts = expression_matrix, pseudotime = pseudotime, cellWeights = cellWeights)
library(tradeSeq) patternRes <- patternTest(sce) head(patternRes)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.