In alexyermanos/Echidna: Single Cell Immune Repertoire and Transcriptome Simulation

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

All example simulations are mouse cell simulations. Load the package and get started!

library(Echidna)

Simulate naive and expanded repertoires

In Echidna, there are pre-installed R objects used as parameters and users are allowed to change it. For example, class_switch_prob_mus controls the probability for mouse class switching. It is a 9*9 matrix.The row and column names are initially "IGHM","IGHD","IGHG1","IGHG2A","IGHG2B","IGHG2C","IGHG3","IGHE","IGHA". The probability for a cell to switch from "IGHM" to "IGHD" is the value at class_switch_prob_mus[1,2].

Naive B cell repertoires contain clones with minor clonal expansion and are mostly IgM+ cells. To simulate naive repertoires, first set the isotype and phenotype switching probability matrix to 0:

class_switch_prob_mus[class_switch_prob_mus>0]<-0
trans_switch_prob_b[trans_switch_prob_b>0]<-0

Then use simulator function: Here we turn off clonal.selection, to suppress the growth of big clones and give an even clonal frequency distribution:

naive<-simulate_repertoire(initial.size.of.repertoire = 500,
                              duration.of.evolution = 30,
                              vdj.branch.prob = 0.5,
                              cell.division.prob = c(0.01,0.3),
                              max.cell.number = 5000,
                              max.clonotype.number = 5000,
                              complete.duration=T,
                              clonal.selection =F,
                              death.rate = 0,
                              transcriptome.on = T,
                              SHM.nuc.prob = 0.00001
)

Use clonofreq.isotype.plot to see the barplot of clonal frequency, colored with isotypes.

clonofreq.isotype.plot(naive[[1]], 50, y.limit = 50)

Use clonofreq.trans.plot to see the barplot of clonal frequency, colored with phenotypes.

clonofreq.trans.plot(all.contig.annotations = naive[[1]], top.n = 50, y.limit = 50, trans.names = colnames(trans_switch_prob_b),history =naive[[12]] )

Expanded repertoires have more expanded clones, more variations in cell differentiation, and class-switching. To simulate expanded repertoire, First set the class switching probability matrix to a desired value - e.g., we can skew the final repertoire to have mostly IgG+ cells and some fraction of other isotypes; Then set the phenotype switching matrix to make the final phenotype distribution has majority of plasma cells and some germinal center and memory B cells.

class_switch_prob_mus[class_switch_prob_mus>0]<-0
class_switch_prob_mus[1,3]<-0.01
class_switch_prob_mus[1,4]<-0.01
class_switch_prob_mus[1,5]<-0.01
class_switch_prob_mus[1,2]<-0.001
class_switch_prob_mus[1,6]<-0.001
class_switch_prob_mus[1,7]<-0.001
class_switch_prob_mus[7,1]<-0.1


trans_switch_prob_b[trans_switch_prob_b>0]<-0
trans_switch_prob_b[1,3]<-0.01
trans_switch_prob_b[1,2]<-0.01
trans_switch_prob_b[2,3]<-0.8
trans_switch_prob_b[3,4]<-0.001
trans_switch_prob_b[4,3]<-0.01

In expanded B cell repertoires, the selected clones tend to expand more and have more IgG+ cells and plasma cells, while the non expanded clones are likely to be naive B cells of the IgM isotype.

In the function, the parameter clonal.selection is for controlling the division rate according to their clonal frequency, and eventually control the final clonal size distribution. If set to be TURE, cells in clones with higher frequency have their division probability proportional to the clonal frequency. If FALSE, clones with higher frequency will have lower probability to expand. To use adjust the different cell division rates, here comes in another parameter cell.division.prob.

The parameter cell.division.prob is the probability of cells to be duplicated in each time step. Default is 0.1. If uneven probability for different clones is needed, the input should be a vector of 2 numeric items, with the first item being the lower bound, the second item being the upper bound of the division rate. The most abundant clone will get the highest division rate, and division rate of other clones will follow arithmetic progression and keep decreasing until the last abundant clone with the lower limit of division rate. If a third value is given, the third value will be the division rate for cells with selected sequences. the selected sequences can have their special division probability. If a fourth number is given, the division probability of selected sequence will be sampled between the third number and the fourth number.

Here we set cell.division.prob as 0.2,0.2,0.5, meaning selected clones will have higher probability (0.5) to undergo cell division than not selected clones, while other cells all have the same cell division rate (0.2). transcriptome.switch.selection.dependent is set to be TRUE to make sure selected cells switch to the next cell state than naive cell.

class.switch.selection.dependent is set to be TRUE to make sure selected cells switch away from IgM to other isotype expressing cells.

expanded<-simulate_repertoire(initial.size.of.repertoire = 50,
                               duration.of.evolution = 30,
                               vdj.branch.prob = 0.1,
                               cell.division.prob = c(0.2,0.2,0.5),
                               max.cell.number = 5000,
                               max.clonotype.number = 5000,
                               complete.duration=T,
                               clonal.selection =T,
                               transcriptome.on = T,
                               death.rate = 0,
                               SHM.nuc.prob = 0.00001,
                               sequence.selection.prob = 0.3,
                               transcriptome.switch.selection.dependent = T,
                               class.switch.selection.dependent = T
)

clonofreq.isotype.plot(expanded[[1]],50,y.limit = 750)

Simulate repertoires with different V gene usage distribution

Originally, V/D/J genes are randomly selected from the gene pools stored in data lists. However, the lists can be modified according to need. To make genes easy to recognize, here we take the first 50 heavy chain V gene and 30 light chain V genes:

mus_b_h[[1]]<-mus_b_h[[1]][1:50,]
mus_b_l[[1]]<-mus_b_l[[1]][1:30,]

In the first case, we simulate a repertoire with all V gene randomly chosen, meaning the probability of V genes being selected are evenly distributed:

even<-simulate_repertoire(initial.size.of.repertoire = 5000,
                                duration.of.evolution = 0,
                                vdj.branch.prob = 0,
                                cell.division.prob = c(0,0),
                                max.cell.number = 5000,
                                max.clonotype.number = 5000,
                                complete.duration=F,
                                clonal.selection = F,
                                death.rate = 0,
                                transcriptome.on = F,
                                igraph.on = F
)

In the second case, we simulate a repertoire with the first 2 heavy chain V genes and the first light chain V gene more likely to be selected. mus_b_h is a list containing 3 dataframes storing the germline V, D, J gene sequences for mouse B cell heavy chains. The first item is for V genes, the second item is for D genes, the third item is for J genes. The corresponding list mus_b_l has 2 items, storing mouse B cell light chain germline V and J genes. We replicate the first 2 rows in the first item in mus_b_h, and the first row in mus_b_l, 30 times and 20 times respectively:

reph<-mus_b_h[[1]][c(1,2),]
repl<-mus_b_l[[1]][c(1),]

mus_b_h[[1]]<-rbind(mus_b_h[[1]],reph[rep(seq_len(nrow(reph)), each = 30), ])
mus_b_l[[1]]<-rbind(mus_b_l[[1]],repl[rep(seq_len(nrow(repl)), each = 20), ])

And simulate based on the new gene pool after modification:

uneven<-simulate_repertoire(initial.size.of.repertoire = 5000,
                                              duration.of.evolution = 0,
                                              vdj.branch.prob = 0,
                                              cell.division.prob = c(0,0),
                                              max.cell.number = 5000,
                                              max.clonotype.number = 5000,
                                              complete.duration=F,
                                              clonal.selection = F,
                                              death.rate = 0,
                                              transcriptome.on = F,
                                              igraph.on = F
)

Use get.vgu.matrix to get the heavy and light chain V gene usage data normalized on cell level or clone level.

even_vgu_cell<-get.vgu.matrix(even[[1]],"cell")
even_vgu_clone<-get.vgu.matrix(even[[1]],"clone")

head(even_vgu_cell)
head(even_vgu_clone)

Make clones with specific V gene have different probability to be chosen in clonal selection

The dataframe special_v is prepared for this propose. Set special.v.gene as TRUE, the simulator will execute the indicated probability to be chosen in clonal selection for clones with the specific V genes listed in the dataframe special_v. The parameter sequence.selection.prob is the probability for every sequence to be selected as expanding variant. Here we set it to be 0, to make only variants with special v gene get chance to be selected.

specialv<-simulate_repertoire(initial.size.of.repertoire = 100,
                                        duration.of.evolution = 30,
                                        vdj.branch.prob = 0.1,
                                        cell.division.prob = c(0.2,0.2,0.5),
                                        max.cell.number = 5000,
                                        max.clonotype.number = 5000,
                                        complete.duration=T,
                                        clonal.selection =T,
                                        death.rate = 0,
                                        transcriptome.on = T,
                                        SHM.nuc.prob = 0.00001,
                                        sequence.selection.prob = 0,
                                        transcriptome.switch.selection.dependent = T,
                                        class.switch.selection.dependent = T,
                                        special.v.gene = T
)

Use get.vgu.matrix to get the heavy and light chain V gene usage data normalized on cell level or clone level.

specialv_vgu_cell<-get.vgu.matrix(specialv[[1]],"cell")
head(specialv_vgu_cell)

specialv_vgu_clone<-get.vgu.matrix(specialv[[1]],"clone")
head(specialv_vgu_clone)

Simulate repertoires with different gene expression pattern

Seurat (Stuart and Butler et al.) is a gene expression data processing tool kit. With the provided tools such as tSNE, Umap, we can see how cell populations are distributed. We can simulate immune cell gene expression data along with their immune repertoire sequences. The repertoire can contain different cell populations, and the gene expression data of these populations might be distinct or somewhat close to each other. trans_switch_prob_t is a matrix indicating the probability of switching between phenotypes. In the first case, we simulate a CD8+ T cell repertoire with distinct cell population distribution based on experimental gene expression data mean values:

trans_switch_prob_t[trans_switch_prob_t>0]<-0
trans_switch_prob_t[4,5]<-0.3
trans_switch_prob_t[5,6]<-0.3
trans_switch_prob_t[6,7]<-0.3
trans_switch_prob_t[7,4]<-0.3

CD8_experimental<-simulate_repertoire(cell.type = "T",
                                  cd4.proportion = 0,
                                 initial.size.of.repertoire = 50,
                                 duration.of.evolution = 20,
                                 vdj.branch.prob = 0.1,
                                 cell.division.prob = c(0.5,0.5),
                                 class.switch.prob = class_switch_prob_mus,
                                 max.cell.number = 5000,
                                 max.clonotype.number = 5000,
                                 complete.duration=T,
                                 clonal.selection =T,
                                 death.rate = 0,
                                 transcriptome.on = T,
                                 igraph.on = F,
                                 SHM.nuc.prob = 0.00001)

In the second case, we simulate a repertoire with cell populations overlapping. The columns in mus_t_trans dataframe are vectors controlling gene expression level. The rownames are gene names, colnames are cell type names. In the default setting, column 4 to 7 are gene expression vectors for CD8+ T cells. First we make some gene expression vectors with close or remote distance from each other by editing mus_t_trans:

base1<-rnorm(nrow(mus_t_trans), mean = 1, sd = 0.3)
base2<-rnorm(nrow(mus_t_trans), mean = 1, sd = 0.3)
base3<-rnorm(nrow(mus_t_trans), mean = 1, sd = 0.3)
base4<-rnorm(nrow(mus_t_trans), mean = 1, sd = 0.5)

mus_t_trans[,4]<-base1
mus_t_trans[,5]<-base1
mus_t_trans[1:400,5]<-base2[1:400]
mus_t_trans[,6]<-mus_t_trans[,5]
mus_t_trans[20000:20200,6]<-base3[20000:20200]
mus_t_trans[,7]<-base4

And edit phenotype switching probability matrix trans_switch_prob_t, to ensure every phenotype have a mediate number of population.

trans_switch_prob_t[4,5]<-0.3
trans_switch_prob_t[5,6]<-0.3
trans_switch_prob_t[6,7]<-0.3
trans_switch_prob_t[7,4]<-0.3

Finally simulate based on the new parameters we just made.

CD8_overlap<-simulate_repertoire(cell.type = "T",
                                 initial.size.of.repertoire = 50,
                                 cd4.proportion = 0,
                                 duration.of.evolution = 20,
                                 vdj.branch.prob = 0.1,
                                 cell.division.prob = c(0.5,0.5),
                                 max.cell.number = 5000,
                                 max.clonotype.number = 5000,
                                 complete.duration=T,
                                 clonal.selection =T,
                                 death.rate = 0,
                                 transcriptome.on = T,
                                 SHM.nuc.prob = 0.00001)

Echidna provide functions that wrapped Seurat functions to visualize simulated gene expression data in few lines. Alternatively, the user can choose to run Seurat follow the instructions (https://satijalab.org/seurat/) starting from "gex <- Seurat::CreateSeuratObject(counts = CD8_overlap[[7]])" First plot elbow to select a proper dimension value for the next step of dimension reduction.

gex<-get.elbow(CD8_overlap[[7]])
Seurat::ElbowPlot(gex)

Then further process the data and get Umap.

gex<-get.umap(gex = gex,d = 4,reso = 0.1)
Seurat::DimPlot(gex, reduction = "umap", label=F, pt.size=0.8)

When needed, we can use function umap.top.highlight to highlight the clones ranking top in clonal frequency:

First prepare a named vector as color list:

col<-c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00")
names(col)<-paste0("CloneRank",c(1:5))
col[6]<-c('#D9D9D9')
names(col)[6]<-'Unselected'

Highlight the top 5 clones in clonal frequency:

gex<-umap.top.highlight(gex = gex,all.contig.annotations = CD8_overlap[[1]],top.n = 5)
Seurat::DimPlot(gex, reduction = "umap", cols=col, label=F,pt.size=0.8, order=as.list(c(paste0("CloneRank",c(5:1)),"Unselected")))

Plot mutational networks with igraph

Simulate B cell networks with moderate size

network<- simulate_repertoire(initial.size.of.repertoire = 50,
                           duration.of.evolution = 10,
                           vdj.branch.prob = 0.1,
                           cell.division.prob = c(0.3,0.5) ,
                           max.cell.number = 10000,
                           max.clonotype.number = 10000,
                           complete.duration=F,
                           clonal.selection =F,
                           death.rate = 0,
                           transcriptome.on = T,
                           igraph.on=T,
                           SHM.nuc.prob = 0.0001,
                           seq.name = 20)

Plot the mutational networks of the first clone colored with isotypes:

igraph::plot.igraph(network[[8]][[1]])

Plot networks colored with phenotypes:

igraph::plot.igraph(network[[9]][[1]])

Sometimes, we want to remove the empty node representing extincted variants in the network.

First use no.empty.node remove all the empty node in the whole simulation.

noempty<-no.empty.node(history = network[[12]],igraph.index = network[[13]])

Then plot the new network with removed empty node.

#x is the mutation network that has empty nodes which we want to see them to be removed.
#igraph::plot.igraph(noempty[[8]][[x]])

We can also use cluster.id.igraph to color the igraph with Seurat cluster ID.

First run Seurat work flow. The ElbowPlot is to determine the parameter d (dimension) value in the next function get.umap().

gex<-get.elbow(network[[7]])
Seurat::ElbowPlot(gex)
gex<-get.umap(gex = gex,d = 4,reso = 0.1)

cluster.id.igraph(meta.data = gex@meta.data,history = network[['history']],igraph.index = network[["igraph.index"]],empty.node = T)

Simulate with variantional autoencoders (VAEs)

The simulate_repertoire function has an option vdj.productive parameter, which controls how the new sequence simulation will happen. It has 3 possible settings "random",V, (D) and J gene segments are randomly chosen and undergo V(D)J recombination process. "naive", each new sequence is sampled from a pool of experimental naive cell sequencing data (productive_seq). * "vae", each new sequence is sample from a pool of sequences simulated by variantional autoencoders (VAEs) based on experimental data.

Simulate strating repertoire by sampling sequence from a vae generated sequence pool:

repertoire_vae<-simulate_repertoire(vdj.productive = "vae")

The vae_generate function provides the possibility for the users to generate data following the same pattern as the input data.

Before using functions with vae, we need to set up interface with python in R. This can be also found in README.md on https://github.com/alexyermanos/Echidna/blob/main/README.md :

Remove the "#" when run it

# install.packages("reticulate")
# reticulate::install_miniconda()
# install.pakcages("keras")
# install.pakcages("tensorflow")
# tensorflow::install_tensorflow()
# keras::install_keras()

Here we use the repertoire we just simulated in the last step as input in function vae_generate, but you can use your own sequencing data as input as well.

Remove the "#" when run it

# sequence<-repertoire_vae[['all_contig']]$seq
# vae_generated_sequences<-vae_generate(sequence=sequence, n.train = 80, n.sample = 10, epochs = 30)
# vae_generated_sequences

alexyermanos/Echidna documentation built on Feb. 13, 2022, 10:29 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com