R/Echidna_simulate_repertoire.R

Defines functions Echidna_simulate_repertoire

Documented in Echidna_simulate_repertoire

#' @title Simulate immune repertoire and transcriptome data
#' @description Simulate repertoire and transcriptome matrix, with igraph tree plot for each clone showing the evolution process. the node in the tree plot are colored with transcriptome state and isotype.
#'
#' @param initial.size.of.repertoire The initial number of existing cells when the evolution starts. Default is 10.
#' @param species The species of the simulated repertoire, can be "mus" for mouse or "hum" for human. Default is "mus".
#' @param cell.type The cell type for the simulation. "B" or "T"
#' @param cd4.proportion A number between 0 and 1 specifying the proportion of Cd4+ T cells, when cell.type is "T" and transcirptime states data is default. Default is 1, all the cells are Cd4. When user specify transciptome data for T cells, mixture of CD4+ and CD8+ T cells are not applicable.
#' @param duration.of.evolution The maxim time steps for simulation.
#' @param complete.duration TRUE or FALSE. Default is TURE. If TURE, after cell number or clone number reaches the upper limit, the evolution(class switch, mutation, transcriptional state switch) will continue until the duration.of.evolution is complete. If FLASE, the evolution will stop when either cell number or clone number reaches the limit.
#' @param vdj.productive "random": the sequence will be generated from random VDJ recombination, there might be a proportion of unproductive sequences.These VDJ genes were taken from IMGT. When more than one allele was present for a given gene, the first was used. "naive": the VDJ sequence be sampled from a pool of productive sequences obtained by filtering randomly simulated sequences with MIXCR. "vae": the VDJ sequence be sampled from a pool of productive sequences obtained by filtering sequences generated from vae models with MIXCR.
#' @param vdj.model Specifies the model used to simulate V-D-J recombination. Can be either "naive" or "data". "naive" is chain independent and does not differentiate between different species. To rely on the default "experimental" options, this should be "data" and the parameter vdj.insertion.mean should be "default". This will allow for different mean additions for either the VD and JD junctions and will differ depending on species.
#' @param vdj.insertion.mean Integer value describing the mean number of nucleotides to be inserted during simulated V-D-J recombination events. If "default" is entered, the mean will be normally distributed.
#' @param vdj.insertion.stdv Integer value describing the standard deviation corresponding to insertions of V-D-J recombination. No "default" parameter currently supported but will be updated with future experimental data. This should be a number if using a custom distribution for V-D-J recombination events, but can be "default" if using the "naive" vdj.model or the "data", with vdj.insertion.mean set to "default".
#' @param vdj.branch.prob Probability of new VDJ recombination event in each time step.when new VDJ recombination happen, a new cell with a new sequence will be generated. Default is 0.2.
#' @param clonal.selection TRUE or FALSE. If TURE, cells in clones with higher frequency have their division probability proportional to the clonal frequency. If FALSE, clones with higher frequency will have lower probability to expand.
#' @param cell.division.prob Probability of cells to be duplicated in each time step. Default is 0.1. If uneven probability for different clones is needed, the input should be a vector of 2 numeric items, with the first item being the lower bound, the second item being the upper bound of the division rate. The most abundant clone will get the highest division rate, and division rate of other clones will follow arithmetic progression and keep decreasing until the last abundant clone with the lower limit of division rate. If input 3 values, the third value will be the division rate for cells with selected sequences. If a fourth number is given, the division probability of selected sequence will be sampled between the third number and the fourth number.
#' @param sequence.selection.prob Probability of each unique sequence to be selected as expanding sequence.Expanding sequences can have their division rate specified in the third element of cell.division.prob.
#' @param special.v.gene If TRUE, simulation will apply special sequence.selection.prob for heavy and light chain v gene combination specified in dataframe "special_v".
#' @param class.switch.prob Probability matrix of class switching for b cells. The row names of the matrix are the isotypes the cell is switching from, the column names are the isotypes the cell is switching to. All B cells start from IGHM, and switch to one of the other isotypes or remain the same. Default values are in the attaching matrix "class_switch_prob_hum" and "class_switch_prob_mus". The order of isotype in rows and columns should be the same.
#' @param class.switch.selection.dependent If TRUE,class switching will happen when the cell is selected,if the cell has IgM or IgD isotype.
#' @param class.switch.independent If TRUE, class switching will happen randomly at each time step for all cells. If FALSE, random class switching will be switched off.
#' @param SHM.method The mode of SHM speciation events. Options are either: "poisson","data","motif","wrc", and "all". Specifying either "poisson" or "naive" will result in mutations that can occur anywhere in the heavy chain region, with each nucleotide having an equal probability for a mutation event. Specifying "data" focuses mutation events during SHM in the CDR regions (based on IMGT), and there will be an increased probability for transitions (and decreased probability for transversions). Specifying "motif" will cause neighbor dependent mutations based on a mutational matrix from high throughput sequencing data sets (Yaari et al., Frontiers in Immunology, 2013). "wrc" allows for only the WRC mutational hotspots to be included (where W equals A or T and R equals A or G). Specifying "all" will use all four types of mutations during SHM branching events, where the weights for each can be specified in the "SHM.nuc.prob" parameter.
#' @param SHM.nuc.prob Specifies the rate at which nucleotides change during speciation (SHM) events. This parameter depends on the type of mutation specified by SHM.method. For both "poisson" and "data", the input value determines the probability for each site to mutate (the whole sequence for "poisson" and the CDRs for "data"). For either "motif" or "wrc", the number of mutations per speciation event should be specified. Note that these are not probabilities, but the number of mutations that can occur (if the mutation is present in the sequence). If "all" is specified, the input should be a vector where the first element controls the poisson style mutations, second controls the "data", third controls the "motif" and fourth controls the "wrc".
#' @param SHM.isotype.dependent If TRUE, somatic hypermutation of certain isotype will happen based on probability specified in dataframe "iso_SHM_prob".
#' @param SHM.phenotype.dependent If TRUE, somatic hypermutation of certain phenotype will happen based on probability specified in dataframe "pheno_SHM_prob".
#' @param max.cell.number Integer value describing maximum number of cells allowed. Default is 1500.
#' @param max.clonotype.number Integer value describing maximum number of clones allowed. cell derived from the same mother cell belong to same clone.
#' @param death.rate Probability of cell death happen to each cell in each time step.
#' @param igraph.on If TRUE, mutational network for every B cell clone will be in the output. If False, the igraphs will not be included.
#' @param transcriptome.on If TRUE, the simulation will include transcriptome data. If FALSE, only vdj sequence will be simulated.
#' @param transcriptome.switch.independent TRUE or FALSE value describing whether transcriptome state is allowed to switch independently, not dependent on class switching or somatic hypermutation. If TURE, transcriptome.switch.prob should be specified to control the probability of transcriptome state switching.
#' @param transcriptome.switch.prob Probability of transcriptome state switching independently. Default values are in the attaching matrix "trans_switch_prob_b" and "trans_switch_prob_t". The order of cell type in rows and columns should be the same, and the order of the cell type in the matrix should match cell type names in transcriptome.states.
#' @param transcriptome.switch.isotype.dependent TRUE or FALSE value describing whether transcriptome state of a cell is allowed to switch depending on isotype switching. If TRUE, transcriptome state will switch once class switching happens.
#' @param transcriptome.switch.SHM.dependent TRUE or FALSE value describing whether transcriptome state of a cell is allowed to switch depending on somatic hypermutation. If TRUE, transcriptome state will switch once somatic hypermutation happens.
#' @param transcriptome.switch.selection.dependent If TRUE, selected cells will undergo transcriptome state switching if their transcriptome state is 1.
#' @param transcriptome.states A data.frame specifying base gene expression for different cell type, with gene names as row names, cell type names as column names. When missing, a default data.frame will be used. Default data.frame includes "GerminalcenterBcell", "NaiveBcell", "Plasmacell", "MemoryBcell" for B cells, and "Naïve Cd4", "ActivatedCd4", "MemoryCd4", "NaiveCd8", "EffectorCd8" ,"MemoryCd8","ExhaustedCd8" for T cells. The order of the cell type names in transcriptome.states should match cell type names in the transcriptome.switch.prob matrix.
#' @param transcriptome.noise A character expression specifying the distribution of noise ratio to be multiplied with the base gene expression for each cell. It should be a text expression that generates a numeric vector, which is of the same length as gene names in the trasncriptome.state input. Default value is "rnorm(nrow(transcriptome.states), mean = 1, sd = 0.3)".
#' @param seq.name Integer specifies how many top-ranking clones are included in Seq_Name dataframe in the output list for phylogenetic tree plotting in other pipeline. If missing, Seq_Name won't be included in the output.
#' @return A list containing the VDJ sequence and corresponding transcriptome data: "all_contig_annotations", "clonotypes", "all_contig", "consensus","reference","reference_real", "transcriptome","igraph_list_iso","igraph_list_trans","Seq_Name","igraph.index.attr","history","igraph.index","selected.seq","version","parameters".
#'
#' @export
Echidna_simulate_repertoire <- function(initial.size.of.repertoire,
                                species,
                                cell.type,
                                cd4.proportion,
                                duration.of.evolution,
                                complete.duration,
                                vdj.productive,
                                vdj.model,
                                vdj.insertion.mean,
                                vdj.insertion.stdv,
                                vdj.branch.prob,
                                clonal.selection,
                                cell.division.prob,
                                sequence.selection.prob,
                                special.v.gene,
                                class.switch.prob,
                                class.switch.selection.dependent,
                                class.switch.independent,
                                SHM.method,
                                SHM.nuc.prob,
                                SHM.isotype.dependent,
                                SHM.phenotype.dependent,
                                max.cell.number,
                                max.clonotype.number,
                                death.rate,
                                igraph.on,
                                transcriptome.on,
                                transcriptome.switch.independent,
                                transcriptome.switch.prob,
                                transcriptome.switch.isotype.dependent,
                                transcriptome.switch.SHM.dependent,
                                transcriptome.switch.selection.dependent,
                                transcriptome.states,
                                transcriptome.noise,
                                seq.name
){

  # load polybox

  load(url("https://polybox.ethz.ch/index.php/s/zETU3ruyfTjj8T8/download"))

  # for cran check, global bind data sets----

  class_switch_prob_hum <- class_switch_prob_hum
  class_switch_prob_mus <- class_switch_prob_mus
  hum_b_h <- hum_b_h
  hum_b_l <- hum_b_l
  hum_b_trans <- hum_b_trans
  hum_t_h <- hum_t_h
  hum_t_l <- hum_t_l
  hum_t_trans <- hum_t_trans
  iso_SHM_prob <- iso_SHM_prob
  mus_b_h <- mus_b_h
  mus_b_l <- mus_b_l
  mus_b_trans <- mus_b_trans
  mus_t_h <- mus_t_h
  mus_t_l <- mus_t_l
  mus_t_trans <- mus_t_trans
  pheno_SHM_prob <- pheno_SHM_prob
  productive_seq <- productive_seq
  special_v <- special_v
  trans_switch_prob_b <- trans_switch_prob_b
  trans_switch_prob_t <- trans_switch_prob_t
  vae_seq <- vae_seq


  if(missing(species)) species<-"mus"
  if(missing(cell.type)) cell.type<-"B"
  if(missing(cd4.proportion)) cd4.proportion<-1
  if(missing(vdj.productive)) vdj.productive<-"random"
  if(missing(initial.size.of.repertoire)) initial.size.of.repertoire <- 10
  if(missing(vdj.model)) vdj.model <- "naive"
  if(missing(vdj.insertion.mean))  vdj.insertion.mean<-0.1
  if(missing(vdj.insertion.stdv))  vdj.insertion.stdv<-0.05
  if(missing(duration.of.evolution))  duration.of.evolution<-20
  if(missing(complete.duration)) complete.duration<-T
  if(missing(vdj.branch.prob))  vdj.branch.prob<-0.2
  if(missing(clonal.selection)) clonal.selection<-F
  if(missing(cell.division.prob))  cell.division.prob<-c(0.1,0.1)
  if(missing(sequence.selection.prob)) sequence.selection.prob<-0.01
  if(missing(special.v.gene)) special.v.gene<-F
  if(missing(class.switch.prob)&species=="mus")  class.switch.prob<-class_switch_prob_mus
  if(missing(class.switch.prob)&species=="hum")  class.switch.prob<-class_switch_prob_hum
  if(missing(class.switch.selection.dependent)) class.switch.selection.dependent<-F
  if(missing(class.switch.independent)) class.switch.independent<-T
  if(missing(SHM.nuc.prob))  SHM.nuc.prob<-0.001
  if(missing(SHM.method))  SHM.method<-"naive"
  if(missing(SHM.isotype.dependent)) SHM.isotype.dependent<-F
  if(missing(SHM.phenotype.dependent)) SHM.phenotype.dependent<-F
  if(missing(max.cell.number))  max.cell.number<-1500
  if(missing(max.clonotype.number))  max.clonotype.number<-20
  if(missing(death.rate)) death.rate<-0.001
  if(missing(igraph.on)&cell.type=="T") igraph.on<-F
  if(missing(igraph.on)&cell.type=="B") igraph.on<-T
  if(missing(transcriptome.on)) transcriptome.on<-T
  if(missing(transcriptome.switch.independent)) transcriptome.switch.independent<-T
  if(missing(transcriptome.switch.prob)&cell.type=="B") transcriptome.switch.prob<-trans_switch_prob_b
  if(missing(transcriptome.switch.prob)&cell.type=="T") transcriptome.switch.prob<-trans_switch_prob_t
  if(missing(transcriptome.switch.isotype.dependent)) transcriptome.switch.isotype.dependent<-F
  if(missing(transcriptome.switch.SHM.dependent)) transcriptome.switch.SHM.dependent<-F
  if(missing(transcriptome.switch.selection.dependent)) transcriptome.switch.selection.dependent<-F
  if(missing(transcriptome.states)&species=="mus"&cell.type=="B") transcriptome.states<-mus_b_trans
  if(missing(transcriptome.states)&species=="hum"&cell.type=="B") transcriptome.states<-hum_b_trans
  if(missing(transcriptome.states)&species=="mus"&cell.type=="T") transcriptome.states<-mus_t_trans
  if(missing(transcriptome.states)&species=="hum"&cell.type=="T") transcriptome.states<-hum_t_trans
  if(missing(transcriptome.noise)) transcriptome.noise<-"rnorm(nrow(transcriptome.states), mean = 1, sd = 0.3)"
  if(missing(seq.name)) seq.name<-F

  if(is.logical(transcriptome.switch.independent) & is.logical(transcriptome.switch.isotype.dependent) &is.logical( transcriptome.switch.SHM.dependent)==F){
    stop("transcriptome.switch.* is a T or F input")
  }
  if(!(nrow(transcriptome.states)==length(eval(parse(text=transcriptome.noise))))){
    stop("transcriptome.noise vector length different from transcriptome gene length")
  }
  if(length(transcriptome.noise)>1&length(transcriptome.noise)!=ncol(transcriptome.states)){
    stop("plase assgin as many transcriptome.noise vector items as the number of transcriptome.states or just assign a common one")
  }
  if(!missing(transcriptome.states)) cd4.proportion<-1

#check the order of cell names in transcriptome.switch.prob and transcriptome.states
 if(all(!(rownames(transcriptome.switch.prob) == colnames(transcriptome.switch.prob)&rownames(transcriptome.switch.prob) == colnames(transcriptome.states)))){
   stop("transcriptome.state colnames and transcriptome.switch.prob colnames and rownames should be the same")
 }

 if(length(cell.division.prob)==1){
   cell.division.prob<-c(cell.division.prob,cell.division.prob)
 }
  if(max(cell.division.prob)>1){
    stop("Any value in cell.division.prob should less than 1")
  }


# require( stringr, reshape2, dplyr, do, igraph, stats, ggplot2)

#initial vectors------
  barcode<-c()
  raw_contig_id<-c()
  chain<-c()
  v_gene<-c()
  d_gene<-c()
  j_gene<-c()
  c_gene<-c()
  v_seq<-c()
  d_seq<-c()
  j_seq<-c()

  ref_seq<-c()
  raw_clonotype_id<-c()
  clonotype_num<-initial.size.of.repertoire

  seq<-c()
  Length<-c()
  #single format
  clonotype_id<-c()
  seq_combi<-c()
  barcode_uniq<-c()
  gen<-c()
  isotype<-c()
  seq.number<-c()
  seq.history<-c()
  igraph.index<-list()
  barcode.history<-c()
  igraph.index.attr<-list()
  igraph_list<-list()
  igraph.index.jr<-list()
  colors<-colors
  pie.values.list<-list()
  pie.values<-list()
  trans_dis<-c()
  trans_ls<-list()
  trans_state<-c()
  trans_state_his<-c()

  igraph_list_trans<-list()
  pie_values_trans_list<-list()
  cd4_prob<-cd4.proportion

  cdr3<-c()
  cdr3_nt<-c()

  selected_seq<-c()
  selected_rate<-c()

#load input----
  if(length(transcriptome.noise)>1){

    for (i in 1:ncol(transcriptome.states)){
      trans_dis[i]<-paste0("transcriptome.states[",i,"]*",transcriptome.noise[i])
    }
  }
  if(length(transcriptome.noise)==1){
    for (i in 1:ncol(transcriptome.states)){
      trans_dis[i]<-paste0("transcriptome.states[",i,"]*",transcriptome.noise)
    }
  }

  if(cell.type=="B" & species=="mus" | species=="mouse" | species=="blc6"){
  if(vdj.productive=="random"){
    seq_input_h<-mus_b_h
    seq_input_l<-mus_b_l
    }
  if(vdj.productive=="naive"){
    seq_input_h<-productive_seq$mus_b_h
    seq_input_l<-productive_seq$mus_b_l
    # ref_seq_h<-mus_b_h
    # ref_seq_l<-mus_b_l
  }
  if(vdj.productive=="vae"){
    seq_input_h<-vae_seq$mus_b_h
    seq_input_l<-vae_seq$mus_b_l
    # ref_seq_h<-mus_b_h
    # ref_seq_l<-mus_b_l
  }
    c_genename_h<-"IGHM"
  }
  if(cell.type=="B" & species=="hum" | species=="human"){
    if(vdj.productive=="random"){
    seq_input_h<-hum_b_h
    seq_input_l<-hum_b_l
    }
    if(vdj.productive=="naive"){
    seq_input_h<-productive_seq$hum_b_h
    seq_input_l<-productive_seq$hum_b_l
    # ref_seq_h<-hum_b_h
    # ref_seq_l<-hum_b_l
    }
    c_genename_h<-"IGHM"
  if(vdj.productive=="vae"){
    seq_input_h<-vae_seq$hum_b_h
    seq_input_l<-vae_seq$hum_b_l
    # ref_seq_h<-hum_b_h
    # ref_seq_l<-hum_b_l
  }
  c_genename_h<-"IGHM"
}
  if(cell.type=="T" & species=="mus" | species=="mouse" | species=="blc6" ){
    if(vdj.productive=="random"){
    seq_input_h<-mus_t_h
    seq_input_l<-mus_t_l
    }
    if(vdj.productive=="naive"){
    seq_input_h<-productive_seq$mus_t_h
    seq_input_l<-productive_seq$mus_t_l
    # ref_seq_h<-mus_t_h
    # ref_seq_l<-mus_t_l
    }
    if(vdj.productive=="vae"){
      seq_input_h<-vae_seq$mus_t_h
      seq_input_l<-vae_seq$mus_t_l
      # ref_seq_h<-mus_t_h
      # ref_seq_l<-mus_t_l
    }
    c_genename_h<-"TRBC"
  }
  if(cell.type=="T" & species=="hum" | species=="human"){
    if(vdj.productive=="random"){
    seq_input_h<-hum_t_h
    seq_input_l<-hum_t_l
    }
    if(vdj.productive=="naive"){
    seq_input_h<-productive_seq$hum_t_h
    seq_input_l<-productive_seq$hum_t_l
    # ref_seq_h<-hum_t_h
    # ref_seq_l<-hum_t_l
    }
    if(vdj.productive=="vae"){
      seq_input_h<-vae_seq$hum_t_h
      seq_input_l<-vae_seq$hum_t_l
      # ref_seq_h<-hum_t_h
      # ref_seq_l<-hum_t_l
    }
    c_genename_h<-"TRBC"
  }



#generate initial barcode vector------
barcode<-.GENERATE.BARCODE(initial.size.of.repertoire)

#initial----
for(i in 1:initial.size.of.repertoire){
  if(vdj.productive=="random"){
  #heavy chain seq
  indV <- sample(x = 1:nrow(seq_input_h[[1]]),size = 1,replace = FALSE)
  indD <- sample(x = 1:nrow(seq_input_h[[2]]),size = 1,replace=FALSE)
  indJ <- sample(x = 1:nrow(seq_input_h[[3]]),size = 1,replace=FALSE)
  #light chain seq
  indVL <- sample(x = 1:nrow(seq_input_l[[1]]),size = 1,replace = FALSE)
  indJL <- sample(x = 1:nrow(seq_input_l[[2]]),size=1,replace=FALSE)
  #seq
  seq[2*i-1]<- as.character(.VDJ_RECOMBIN_FUNCTION(as.character(seq_input_h[[1]][[2]][indV]),as.character(seq_input_h[[2]][[2]][indD]),as.character(seq_input_h[[3]][[2]][indJ]),
                                                   method=vdj.model,
                                                   chain.type="heavy",
                                                   species=species,
                                                   vdj.insertion.mean=vdj.insertion.mean,
                                                   vdj.insertion.stdv=vdj.insertion.stdv))
  seq[2*i] <- as.character(.VDJ_RECOMBIN_FUNCTION(as.character(seq_input_l[[1]][[2]][indVL]),"",as.character(seq_input_l[[2]][[2]][indJL]),
                                                  method=vdj.model,
                                                  chain.type="light",
                                                  species=species,
                                                  vdj.insertion.mean=vdj.insertion.mean,
                                                  vdj.insertion.stdv=vdj.insertion.stdv))

  #vdjc gene names
  v_gene[2*i-1]<-as.character(seq_input_h[[1]][[1]][indV])
  v_gene[2*i]<-as.character(seq_input_l[[1]][[1]][indVL])
  d_gene[2*i-1]<-as.character(seq_input_h[[2]][[1]][indD])
  d_gene[2*i]<-"None"
  j_gene[2*i-1]<-as.character(seq_input_h[[3]][[1]][indJ])
  j_gene[2*i]<-as.character(seq_input_l[[2]][[1]][indJL])
  c_gene[2*i-1]<-c_genename_h
  c_gene[2*i]<-paste0(stringr::str_sub(v_gene[2*i],1,3),"C")
  #ref seq
  v_seq[2*i-1]<-as.character(seq_input_h[[1]][[2]][indV])
  v_seq[2*i]<-as.character(seq_input_l[[1]][[2]][indVL])
  d_seq[2*i-1]<-as.character(seq_input_h[[2]][[2]][indD])
  d_seq[2*i]<-"None"
  j_seq[2*i-1]<-as.character(seq_input_h[[3]][[2]][indJ])
  j_seq[2*i]<-as.character(seq_input_l[[2]][[2]][indJL])
  #the very first ref seq
  ref_seq[2*i-1]<-seq[2*i-1]
  ref_seq[2*i]<-seq[2*i]
  }
  if(vdj.productive=="naive"|vdj.productive=="vae"){
    indH<-sample(x=1:nrow(seq_input_h),size=1)
    indL<-sample(x=1:nrow(seq_input_l),size=1)
    #seq
    seq[2*i-1]<- tolower(seq_input_h$targetSequences[indH])
    seq[2*i] <- tolower(seq_input_l$targetSequences[indL])
    #ref seq
    v_seq[2*i-1]<-tolower(seq_input_h$v_seq[indH])
    v_seq[2*i]<-tolower(seq_input_l$v_seq[indL])
    #which
    indH<-which(toupper(seq[2*i-1])==seq_input_h$targetSequences)
    indL<-which(toupper(seq[2*i])==seq_input_l$targetSequences)
    #vdjc gene names
    v_gene[2*i-1]<-seq_input_h$bestVHit[indH]
    v_gene[2*i]<-seq_input_l$bestVHit[indL]
    d_gene[2*i-1]<-seq_input_h$bestDHit[indH]
    d_gene[2*i]<-"None"
    j_gene[2*i-1]<-seq_input_h$bestJHit[indH]
    j_gene[2*i]<-seq_input_l$bestJHit[indL]
    c_gene[2*i-1]<-c_genename_h
    c_gene[2*i]<-paste0(stringr::str_sub(v_gene[2*i],1,3),"C")

    #cdr3
    cdr3[2*i-1]<-seq_input_h$aaSeqCDR3[indH]
    cdr3[2*i]<-seq_input_h$aaSeqCDR3[indL]
    cdr3_nt[2*i-1]<-seq_input_h$nSeqCDR3[indH]
    cdr3_nt[2*i]<-seq_input_h$nSeqCDR3[indL]
    #ref seq
    ref_seq[2*i-1]<-seq[2*i-1]
    ref_seq[2*i]<-seq[2*i]
  }

  #chain
  chain[2*i-1]<-stringr::str_sub(v_gene[2*i-1],1,3)
  chain[2*i]<-stringr::str_sub(v_gene[2*i],1,3)
  #length
  Length[2*i-1]<-nchar(seq[2*i-1])
  Length[2*i]<-nchar(seq[2*i])
  #contig
  raw_contig_id[2*i-1]<-paste(barcode[2*i-1],"contig_1",sep="_")
  raw_contig_id[2*i]<-paste(barcode[2*i],"contig_2",sep="_")
  #clonotype
  raw_clonotype_id[2*i-1]<-paste0("clonotype",i)
  raw_clonotype_id[2*i]<-paste0("clonotype",i)
  #single
  seq_combi[i]<- paste0(chain[2*i-1],":",seq[2*i-1],";",chain[2*i],":",seq[2*i])
  clonotype_id[i]<- paste0("clonotype",i)
  barcode_uniq[i]<-barcode[2*i]
  gen[i]<-1

  seq.number[i]<-i
  seq.history[i]<-seq_combi[i]
  igraph.index[[i]]<-c(seq.number[i])
  isotype[i]<-1

  trans_state[i]<-1
  if(cell.type=="T"&length(cd4_prob)>0){
    trans_state[i]<-sample(x=c(1,4),size = 1,replace = T, prob = c(cd4_prob,1-cd4_prob))
  }
  trans_state_his[i]<- trans_state[i]

  #special v_h&v_l combi check----

  if(length(cell.division.prob)>2&special.v.gene==T){

    pei<-which(special_v[,1]==v_gene[2*i-1])
    pei2<-which(special_v[,2]==v_gene[2*i])
    exe<-F
    if(length(pei)>0){
      if(length(pei2)>0|any(special_v[pei,2]=="")) exe<-T
    }
    if(length(pei2)>0){
      if(length(pei)>0|any(special_v[pei2,1]=="")) exe<-T
    }
    if(exe==T){
      special_row<-c(pei,pei2)[which(c(pei,pei2)>0)[1]]
      seq_selection<-sample(x=c(0,1),size = 1,replace = T, prob = c(special_v[,3][special_row],1-special_v[,3][special_row]))
      if(seq_selection==0){
        selected_rate<-c(selected_rate,stats::runif(1,cell.division.prob[3],cell.division.prob[length(cell.division.prob)]))
        selected_seq<-c(selected_seq,seq.number[i])
      }
    }
  }
}

  #check sequence selection initial----
  if(length(cell.division.prob)>2){
    seq_selection<-sample(x=c(0,1),size = length(seq.number),replace = T, prob = c(sequence.selection.prob,1-sequence.selection.prob))
    selected_rate<-c(selected_rate,stats::runif(sum(seq_selection==0),cell.division.prob[3],cell.division.prob[length(cell.division.prob)]))
    selected_seq<-c(selected_seq,seq.number[which(seq_selection==0)])
    if(special.v.gene==T){
    selected_rate<-selected_rate[!duplicated(selected_seq)]
    selected_seq<-selected_seq[!duplicated(selected_seq)]
    }
  }

  barcode.history<-barcode_uniq
#evolution----
  if(duration.of.evolution>=1){
    for(i in 1:duration.of.evolution){
  #new vdj----
    is_new_VDJ <- sample(x=c(0,1), replace=TRUE,size = 1, prob=c(vdj.branch.prob,1- vdj.branch.prob))
    if (is_new_VDJ==0 & length(barcode_uniq)<max.cell.number & clonotype_num< max.clonotype.number){
      clonotype_num<-clonotype_num+1
      l<-length(barcode)
      barcode<-.ADD.BARCODE(barcode)

      if(vdj.productive=="random"){
      #heavy chain seq
      indV <- sample(x = 1:nrow(seq_input_h[[1]]),size = 1,replace = FALSE)
      indD <- sample(x = 1:nrow(seq_input_h[[2]]),size = 1,replace=FALSE)
      indJ <- sample(x = 1:nrow(seq_input_h[[3]]),size = 1,replace=FALSE)
      #light chain seq
      indVL <- sample(x = 1:nrow(seq_input_l[[1]]),size = 1,replace = FALSE)
      indJL <- sample(x = 1:nrow(seq_input_l[[2]]),size=1,replace=FALSE)
      #seq
      seq[l+1]<- as.character(.VDJ_RECOMBIN_FUNCTION(as.character(seq_input_h[[1]][[2]][indV]),as.character(seq_input_h[[2]][[2]][indD]),as.character(seq_input_h[[3]][[2]][indJ]),
                                                     method=vdj.model,
                                                     chain.type="heavy",
                                                     species=species,
                                                     vdj.insertion.mean=vdj.insertion.mean,
                                                     vdj.insertion.stdv=vdj.insertion.stdv))
      seq[l+2] <- as.character(.VDJ_RECOMBIN_FUNCTION(as.character(seq_input_l[[1]][[2]][indVL]),"",as.character(seq_input_l[[2]][[2]][indJL]),
                                                      method=vdj.model,
                                                      chain.type="light",
                                                      species=species,
                                                      vdj.insertion.mean=vdj.insertion.mean,
                                                      vdj.insertion.stdv=vdj.insertion.stdv))



      #vdjc gene names
      v_gene[l+1]<-as.character(seq_input_h[[1]][[1]][indV])
      v_gene[l+2]<-as.character(seq_input_l[[1]][[1]][indVL])
      d_gene[l+1]<-as.character(seq_input_h[[2]][[1]][indD])
      d_gene[l+2]<-"None"
      j_gene[l+1]<-as.character(seq_input_h[[3]][[1]][indJ])
      j_gene[l+2]<-as.character(seq_input_l[[2]][[1]][indJL])
      c_gene[l+1]<-c_genename_h
      c_gene[l+2]<-paste0(stringr::str_sub(v_gene[l+2],1,3),"C")

      #vdj ref seq
      v_seq[l+1]<-as.character(seq_input_h[[1]][[2]][indV])
      v_seq[l+2]<-as.character(seq_input_l[[1]][[2]][indVL])
      d_seq[l+1]<-as.character(seq_input_h[[2]][[2]][indD])
      d_seq[l+2]<-"None"
      j_seq[l+1]<-as.character(seq_input_h[[3]][[2]][indJ])
      j_seq[l+2]<-as.character(seq_input_l[[2]][[2]][indJL])
      #another ref seq
      r<-length(ref_seq)
      ref_seq[r+1]<-seq[l+1]
      ref_seq[r+2]<-seq[l+2]
      }
      if(vdj.productive=="naive"|vdj.productive=="vae"){
        indH<-sample(x=1:nrow(seq_input_h),size=1)
        indL<-sample(x=1:nrow(seq_input_l),size=1)
        #seq
        seq[l+1]<- tolower(seq_input_h$targetSequences[indH])
        seq[l+2] <- tolower(seq_input_l$targetSequences[indL])
        #ref seq
        v_seq[l+1]<-tolower(seq_input_h$v_seq[indH])
        v_seq[l+2]<-tolower(seq_input_l$v_seq[indL])
        #vdjc gene names
        v_gene[l+1]<-seq_input_h$bestVHit[indH]
        v_gene[l+2]<-seq_input_l$bestVHit[indL]
        d_gene[l+1]<-seq_input_h$bestDHit[indH]
        d_gene[l+2]<-"None"
        j_gene[l+1]<-seq_input_h$bestJHit[indH]
        j_gene[l+2]<-seq_input_l$bestJHit[indL]
        c_gene[l+1]<-c_genename_h
        c_gene[l+2]<-paste0(stringr::str_sub(v_gene[l+2],1,3),"C")
        #cdr3
        cdr3[l+1]<-seq_input_h$aaSeqCDR3[indH]
        cdr3[l+2]<-seq_input_l$aaSeqCDR3[indL]
        cdr3_nt[l+1]<-seq_input_h$nSeqCDR3[indH]
        cdr3_nt[l+2]<-seq_input_l$nSeqCDR3[indL]
        #ref seq
        r<-length(ref_seq)
        ref_seq[r+1]<-seq[l+1]
        ref_seq[r+2]<-seq[l+2]
     }
    #common updates
      #chain
      chain[l+1]<-stringr::str_sub(v_gene[l+1],1,3)
      chain[l+2]<-stringr::str_sub(v_gene[l+2],1,3)
      #length
      Length[l+1]<-nchar(seq[l+1])
      Length[l+2]<-nchar(seq[l+2])
      #contig
      raw_contig_id[l+1]<-paste(barcode[l+1],"contig_1",sep="_")
      raw_contig_id[l+2]<-paste(barcode[l+2],"contig_2",sep="_")
      #clonotype
      raw_clonotype_id[l+1]<-paste("clonotype",clonotype_num,sep="")
      raw_clonotype_id[l+2]<-paste("clonotype",clonotype_num,sep="")
      #single format used for finding consensus later
      seq_combi[0.5*l+1]<- paste(chain[l+1],":",seq[l+1],";",chain[l+2],":",seq[l+2],sep="")
      clonotype_id[0.5*l+1]<- raw_clonotype_id[l+1]
      barcode_uniq[0.5*l+1]<- barcode[l+2]
      gen[0.5*l+1]<-1
      trans_state[0.5*l+1]<-1
      #history update
      hisl<-length(seq.history)
      barcode.history[hisl+1]<-barcode_uniq[0.5*l+1]
      seq.number[hisl+1]<-max(seq.number)+1
      seq.history[hisl+1]<-seq_combi[0.5*l+1]
      igraph.index[[clonotype_num]]<-c(seq.number[hisl+1])
      isotype[hisl+1]<-1
      ####!!!!
      if(cell.type=="T"&length(cd4_prob)>0){
        trans_state[0.5*l+1]<-sample(x=c(1,4),size = 1,replace = T, prob = c(cd4_prob,1-cd4_prob))
      }
      trans_state_his[hisl+1]<-trans_state[0.5*l+1]
      #check sequence selection in new VDJ####
      if(length(cell.division.prob)>2){
        seq_selection<-sample(x=c(0,1),size = 1, prob = c(sequence.selection.prob,1-sequence.selection.prob))

        if(special.v.gene==T){
          pei<-which(special_v[,1]==v_gene[l+1])
          pei2<-which(special_v[,2]==v_gene[l+2])
          exe<-F
          if(length(pei)>0){
            if(length(pei2)>0|any(special_v[pei,2]=="")) exe<-T
          }
          if(length(pei2)>0){
            if(length(pei)>0|any(special_v[pei2,1]=="")) exe<-T
          }
          if(exe==T){
            special_row<-c(pei,pei2)[which(c(pei,pei2)>0)[1]]
            seq_selection<-sample(x=c(0,1),size = 1,replace = T, prob = c(special_v[,3][special_row],1-special_v[,3][special_row]))
            if(seq_selection==0){
              selected_rate<-c(selected_rate,stats::runif(1,cell.division.prob[3],cell.division.prob[length(cell.division.prob)]))
              selected_seq<-c(selected_seq,seq.number[hisl+1])
            }
          }
        }
        if(special.v.gene==F){
          if(seq_selection==0){
            selected_rate<-c(selected_rate,stats::runif(1,cell.division.prob[3],cell.division.prob[length(cell.division.prob)]))
            selected_seq<-c(selected_seq,seq.number[hisl+1])
          }
        }
      }

    }
    if(complete.duration==F&!(length(barcode_uniq)<max.cell.number & clonotype_num< max.clonotype.number)) break
  #cell division----
    if(length(barcode_uniq)<max.cell.number & clonotype_num< max.clonotype.number){
      #check existing cell division for every cell
      for(j in 1:length(barcode_uniq)){

        if(clonal.selection==F){
        is_cell_division<- .CELL.DIVISION.LINEAR.INVERSE(clonotype_id,j,cell.division.prob)
        }

        if(clonal.selection==T){
        is_cell_division<- .CELL.DIVISION.LINEAR(clonotype_id,j,cell.division.prob)
        }

        if(length(cell.division.prob)>2){
          seq_number_j<-seq.number[which(barcode.history==barcode_uniq[j])]
          if(length(which(selected_seq==seq_number_j))>0){
            rate_j<-selected_rate[which(selected_seq==seq_number_j)]
            #!!!!!wenti
            is_cell_division<-sample(x=c(0,1),size=1,prob =c(rate_j,1-rate_j))
          }
        }
        if (is_cell_division==0&length(barcode_uniq)<max.cell.number & clonotype_num< max.clonotype.number ){

          l<-length(barcode)
          #barcode
          barcode<-.ADD.BARCODE(barcode)
          #seq stays the same
          seq[l+1]<-seq[2*j-1]
          seq[l+2]<-seq[2*j]
          #length
          Length[l+1]<-nchar(seq[2*j-1])
          Length[l+2]<-nchar(seq[2*j])
          #contig
          raw_contig_id[l+1]<-paste(barcode[l+1],"contig_1",sep="_")
          raw_contig_id[l+2]<-paste(barcode[l+2],"contig_2",sep="_")
          #clonotype stay the same
          raw_clonotype_id[l+1]<-clonotype_id[j]
          raw_clonotype_id[l+2]<-clonotype_id[j]
          #vdj seq for reference
          if(vdj.productive=="random"){
            d_seq[l+1]<-d_seq[2*j-1]
            d_seq[l+2]<-"None"
            j_seq[l+1]<-j_seq[2*j-1]
            j_seq[l+2]<-j_seq[2*j]
          }

          v_seq[l+1]<-v_seq[2*j-1]
          v_seq[l+2]<-v_seq[2*j]
          #gene names
          v_gene[l+1]<-v_gene[2*j-1]
          v_gene[l+2]<-v_gene[2*j]
          d_gene[l+1]<-d_gene[2*j-1]
          d_gene[l+2]<-"None"
          j_gene[l+1]<-j_gene[2*j-1]
          j_gene[l+2]<-j_gene[2*j]
          c_gene[l+1]<-c_gene[2*j-1]
          c_gene[l+2]<-c_gene[2*j]
          #cdr3
          if(vdj.productive=="naive"|vdj.productive=="vae"){
            cdr3[l+1]<-cdr3[2*j-1]
            cdr3[l+2]<-cdr3[2*j]
            cdr3_nt[l+1]<-cdr3_nt[2*j-1]
            cdr3_nt[l+2]<-cdr3_nt[2*j]
          }
          #chain names
          chain[l+1]<-stringr::str_sub(v_gene[l+1],1,3)
          chain[l+2]<-stringr::str_sub(v_gene[l+2],1,3)
          #single format
          trans_state[0.5*l+1]<-trans_state[j]
          seq_combi[0.5*l+1]<- paste(chain[l+1],":", seq[l+1],";",chain[l+2],":",seq[l+2],sep ="")
          clonotype_id[0.5*l+1]<- raw_clonotype_id[l+1]
          barcode_uniq[0.5*l+1]<-barcode[l+1]
          gen[0.5*l+1]<-gen[j]+1

          #history format
          n<- which(barcode.history==barcode_uniq[j])
          hisl<-length(seq.history)

          isotype[hisl+1]<-isotype[n]
          barcode.history[hisl+1]<-barcode[l+1]
          seq.number[hisl+1]<-seq.number[n]#seq number should be the same as the mother cell
          seq.history[hisl+1]<-seq.history[n]
          trans_state_his[hisl+1]<-trans_state_his[n]

        }
      }# for loop end
    }#if number check end, cell division end
    if(complete.duration==F&!(length(barcode_uniq)<max.cell.number & clonotype_num< max.clonotype.number)) break
  #transcriptome switch independent----
    trans_switch<-rep(F,length(barcode_uniq))
    if(transcriptome.switch.independent== T){
      for(k in 1: length(barcode_uniq)) {
        switched<-.TRANS.SWITCH(trans_state[k],transcriptome.switch.prob)
        if(switched[[2]]){
        trans_state[k]<-switched[[1]]
        m<-which(barcode.history==barcode_uniq[k])
        trans_state_his[m]<-switched[[1]]
        trans_switch[k]<-T
        }
      }
    }
    if(transcriptome.switch.selection.dependent==T){
      for(k in 1: length(barcode_uniq)) {
        m<-which(barcode.history==barcode_uniq[k])
        if(length(which(selected_seq==seq.number[m]))>0&trans_state[k]==1){
          switched<-.TRANS.SWITCH.DEPENDANT(trans_state[k],transcriptome.switch.prob)
          trans_state[k]<-switched[[1]]
          m<-which(barcode.history==barcode_uniq[k])
          trans_state_his[m]<-switched[[1]]
          trans_switch[k]<-T
        }
      }
    }

  #b cell special

  #class switch & mutation----
    if (cell.type =="B"){
      for(k in 1: length(barcode_uniq)) {
        # class switch----
        n<- which(barcode.history==barcode_uniq[k])
        selected<-(length(which(selected_seq==seq.number[n]))>0)
        if(class.switch.independent==T){
        switched<-.TRANS.SWITCH(isotype[n],class.switch.prob)
          if(switched[[2]]){
          isotype[n]<-switched[[1]]
          c_gene[2*k-1]<-colnames(class.switch.prob)[switched[[1]]]
          }
        }
        #selected sequence make sure switch
        if(class.switch.selection.dependent==T&isotype[n]==1&selected){
          switched<-.TRANS.SWITCH.DEPENDANT(isotype[n],class.switch.prob)
          isotype[n]<-switched[[1]]
          c_gene[2*k-1]<-colnames(class.switch.prob)[switched[[1]]]
        }
        #trans switch isotype dependent----
        if(switched[[2]]&trans_switch[k] == F & transcriptome.switch.isotype.dependent== T&sum(transcriptome.switch.prob[trans_state[k],])>0){
            switched<-.TRANS.SWITCH.DEPENDANT(trans_state[k],transcriptome.switch.prob)
            trans_state[k]<-switched[[1]]
            m<-which(barcode.history==barcode_uniq[k])
            trans_state_his[m]<-switched[[1]]
            trans_switch[k]<-T
          }

        #mutation----

        if((SHM.isotype.dependent==F|length(which(iso_SHM_prob[,1]==c_gene[2*k-1]))==0)&
           (SHM.phenotype.dependent==F|length(which(pheno_SHM_prob[,1]==colnames(transcriptome.switch.prob)[trans_state[k]]))==0)
           ){
          mut_h<-.SHM_FUNCTION_SEQUENCE4(seq[2*k-1],SHM.method,v_seq[2*k-1],SHM.nuc.prob)
          mut_l<-.SHM_FUNCTION_SEQUENCE4(seq[2*k],SHM.method,v_seq[2*k],SHM.nuc.prob)
          Ctrl<-mut_h[[2]]+mut_l[[2]]
        }

        if(!(
          (SHM.isotype.dependent==F|length(which(iso_SHM_prob[,1]==c_gene[2*k-1]))==0)&
          (SHM.phenotype.dependent==F|length(which(pheno_SHM_prob[,1]==colnames(transcriptome.switch.prob)[trans_state[k]]))==0)
           )){
          new.SHM.nuc.prob<-max(iso_SHM_prob[which(iso_SHM_prob[,1]==c_gene[2*k-1]),2],pheno_SHM_prob[which(pheno_SHM_prob[,1]==colnames(transcriptome.switch.prob)[trans_state[k]]),2])
          mut_h<-.SHM_FUNCTION_SEQUENCE4(seq[2*k-1],SHM.method,v_seq[2*k-1],new.SHM.nuc.prob)
          mut_l<-.SHM_FUNCTION_SEQUENCE4(seq[2*k],SHM.method,v_seq[2*k],new.SHM.nuc.prob)
          Ctrl<-mut_h[[2]]+mut_l[[2]]
        }


        if(Ctrl>0){
          #update seq
          seq[2*k-1]<- mut_h[[1]]
          seq[2*k]<-mut_l[[1]]

          #update length
          Length[2*k-1]<-nchar(seq[2*k-1])
          Length[2*k]<-nchar(seq[2*k])

          #update single format
          seq_combi[k]<- paste(chain[2*k-1],":",seq[2*k-1],";",chain[2*k],":",seq[2*k],sep ="")

          #update igraph history
          hisl<-length(seq.history)
          m<-which(barcode.history == barcode_uniq[k])

          barcode.history[m]<-NA
          barcode.history[hisl+1]<-barcode_uniq[k]
          seq.number[hisl+1]<-max(seq.number)+1
          seq.history[hisl+1]<-seq_combi[k]
          isotype[hisl+1]<-isotype[m]
          trans_state_his[hisl+1] <- trans_state_his[m]

          clonotype.number<-as.numeric(stringr::str_sub(clonotype_id[k],10,-1))
          igraph.index[[clonotype.number]]<-c(igraph.index[[clonotype.number]],seq.number[m],seq.number[hisl+1])

          #check sequence selection in mutation
          if(length(cell.division.prob)>2){
            seq_selection<-sample(x=c(0,1),size = 1,prob = c(sequence.selection.prob,1-sequence.selection.prob))
            if(seq_selection==0){
              selected_rate<-c(selected_rate,stats::runif(1,cell.division.prob[3],cell.division.prob[length(cell.division.prob)]))
              selected_seq<-c(selected_seq,seq.number[hisl+1])
            }
          }

        #try: modify cdr3
        #trans switch SHM dependent
        if(trans_switch[k] == F & transcriptome.switch.isotype.dependent== T&sum(transcriptome.switch.prob[trans_state[k],])>0){
          switched<-.TRANS.SWITCH.DEPENDANT(trans_state[k],transcriptome.switch.prob)
          trans_state[k]<-switched[[1]]
          m<-which(barcode.history==barcode_uniq[k])
          trans_state_his[m]<-switched[[1]]
          trans_switch[k]<-T
        }
      }
      }
    }#B cell special end
  #cell death----
    if(length(barcode_uniq<max.cell.number)&clonotype_num<max.clonotype.number){
      for(k in 1:length(barcode_uniq)){
        is_death <- sample(x=c(0,1), replace=TRUE,size = 1, prob=c(death.rate, 1- death.rate))
        if (is_death ==0){
          #fix row for double and history form first
          del<-c(-(2*k-1),-2*k)
          del.his<-which(barcode.history==barcode_uniq[k])
          #delete double format
          barcode<-barcode[del]
          seq<-seq[del]
          Length<-Length[del]
          raw_contig_id<-raw_contig_id[del]
          raw_clonotype_id<-raw_clonotype_id[del]
          v_gene<-v_gene[del]
          d_gene<-d_gene[del]
          j_gene<-j_gene[del]
          c_gene<-c_gene[del]
          chain<-chain[del]
          v_seq<- v_seq[del]
          d_seq<-d_seq[del]
          j_seq<-j_seq[del]
          cdr3<-cdr3[del]
          cdr3_nt<-cdr3_nt[del]
          #single format
          barcode_uniq<-barcode_uniq[-k]
          seq_combi<-seq_combi[-k]
          clonotype_id<-clonotype_id[-k]
          gen<-gen[-k]
          trans_state<-trans_state[-k]

          #history format
          barcode.history[del.his]<-NA
        }
      }
    }#cell death end
  }#evolution ends----
  }
#write data frames----
  #all_contig_annotations

  #dummy
  is_cell<-rep("Ture",length(barcode))
  high_confidence<-is_cell
  full_length<-is_cell
  productive<-is_cell
  reads<-rep(1000,length(barcode))
  umis<-rep(10,length(barcode))

  if(vdj.productive=="random"){
  cdr3<-rep("KKKK",length(barcode))
  cdr3_nt<-rep("AAAAAAAAAAAA",length(barcode))
  all_contig_annotations<-data.frame(barcode,is_cell,raw_contig_id,high_confidence,Length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id)
  all_contig_annotations$raw_consensus_id<-paste(raw_clonotype_id,"consensus",substr(raw_contig_id,nchar(raw_contig_id),nchar(raw_contig_id)),sep = "_")
  }

   #clonotypes from contig annotations
  clonotypes<-as.data.frame(table(clonotype_id))
  colnames(clonotypes)<-c("clonotype_id","frequency")
  clonotypes$proportion<-clonotypes$frequency/sum(clonotypes$frequency)

  if(vdj.productive=="naive"|vdj.productive=="vae"){
    #single format cdr3 for clonotype
    cdr3_combi<-.GENERATE.COMBI(cdr3,chain)
    cdr3_nt_combi<-.GENERATE.COMBI(cdr3_nt,chain)
    cdr3_df<-data.frame(clonotype_id,cdr3_combi,cdr3_nt_combi)
    clonotypes<-merge(clonotypes,cdr3_df,by="clonotype_id",all.x=T)
    clonotypes<-clonotypes[!(duplicated(clonotypes$clonotype_id)),]
    names(clonotypes)[4:5]<-c("cdr3","cdr3_nt")
    all_contig_annotations<-data.frame(barcode,is_cell,raw_contig_id,high_confidence,Length,chain,v_gene,d_gene,j_gene,c_gene,full_length,productive,cdr3,cdr3_nt,reads,umis,raw_clonotype_id)
    all_contig_annotations$raw_consensus_id<-paste(raw_clonotype_id,"consensus",substr(raw_contig_id,nchar(raw_contig_id),nchar(raw_contig_id)),sep = "_")
  }

  raw_consensus<-.WRITE.CONSENSUS(clonotype_id,sequence.combined = seq_combi,barcode.unique = barcode_uniq,clonotypes.dataframe = clonotypes)
  # all_contig_annotations$clonotype_chain<-paste(all_contig_annotations$raw_clonotype_id,stringr::str_sub(all_contig_annotations$raw_contig_id,-1,-1),sep="_")
  # all_contig_annotations<-merge(all_contig_annotations,raw_consensus,by="clonotype_chain",all=TRUE)
  # all_contig_annotations<-subset(all_contig_annotations,select=c("barcode","raw_contig_id","Length","chain","v_gene", "d_gene","j_gene","c_gene","raw_clonotype_id","raw_consensus_id"))
  consensus<-raw_consensus[,-1]


  #all_contig
  all_contig<-data.frame(raw_contig_id,seq)
  #reference
  if(vdj.productive=="random"){
    reference_id<-paste0(raw_clonotype_id,"_concat_ref_",stringr::str_sub(raw_contig_id,-1,-1))
    reference_seq<-c()
    d_seq1<-gsub(pattern = "None",replacement = "",x = d_seq)

    reference_seq<-paste0(v_seq,d_seq1,j_seq)
    reference<-data.frame(reference_id,reference_seq)
    reference<-reference[!(duplicated(reference$reference_id)),]

    reference_id_real<-c()
    for(cn in 1:clonotype_num){
      reference_id_real[2*cn-1]<-paste0("clonotype",cn,"_concat_ref_",1)
      reference_id_real[2*cn]<-paste0("clonotype",cn,"_concat_ref_",2)
    }
    reference_real<-data.frame(reference_id_real,ref_seq)
    colnames(reference_real)<-c("reference_id","reference_seq")
  }
  if(vdj.productive=="naive"|vdj.productive=="vae"){
    reference_id<-c()
    for(cn in 1:clonotype_num){
      reference_id[2*cn-1]<-paste0("clonotype",cn,"_concat_ref_",1)
      reference_id[2*cn]<-paste0("clonotype",cn,"_concat_ref_",2)
    }
    reference_real<-data.frame(reference_id,ref_seq)
    colnames(reference_real)<-c("reference_id","reference_seq")
    reference<-reference_real
  }
  history<-data.frame(seq.number,barcode.history,seq.history,isotype,trans_state_his)

  #igraph
  if(igraph.on==T){
  #igraph
  size.of.vertex<-as.data.frame(stats::aggregate(barcode.history~seq.number,history,length,na.action = stats::na.pass))
  #NA count as 1
  size.of.vertex1<-as.data.frame(stats::aggregate(barcode.history~seq.number,history,length))
  size.of.vertex<-merge(size.of.vertex1,size.of.vertex,all=T, by="seq.number")[,-3]
  size.of.vertex[!(size.of.vertex$seq.number %in% size.of.vertex1$seq.number),2]<-0
  #NA count as 0
  colnames(size.of.vertex)<-c("seq.number","size")
  rm(size.of.vertex1)

  #isotype distribution list
  history$isotype[is.na(history$barcode.history)]<-9999
  history$trans_state_his[is.na(history$barcode.history)]<-9999

  isotype.distribution<-reshape2::dcast(history, seq.number~isotype, value.var ="isotype",length)[,-1]
  if(length(which(colnames(isotype.distribution)==9999))!=0){
    isotype.distribution<-isotype.distribution[,1:ncol(isotype.distribution)-1]
    }
  isotype.name<-colnames(isotype.distribution)
  isotype.distribution <-as.list(as.data.frame(t(isotype.distribution)))
  #transcriptome state distribution list
  trans_state_distribution<-reshape2::dcast(history, seq.number~trans_state_his, value.var ="trans_state_his",length)[,-1]
  if(length(which(colnames(trans_state_distribution)==9999))!=0){
    trans_state_distribution<-trans_state_distribution[,1:ncol(trans_state_distribution)-1]
  }
  trans_state_name<-colnames(trans_state_distribution)
  trans_state_distribution <-as.list(as.data.frame(t(trans_state_distribution)))

  #write igraph list and assign attr
  for (i in 1:length(igraph.index)){

    if (length(igraph.index[[i]])>1){
      igraph.index[[i]]<-c(0,igraph.index[[i]])
      igraph.index.attr[[i]]<-data.frame(igraph.index[[i]])
      colnames(igraph.index.attr[[i]])<-"seq.number"

      igraph.index.jr[[i]]<-data.frame(igraph.index[[i]])
      colnames(igraph.index.jr[[i]])<-"seq.number"

      #delete duplicated for vertex attribute
      igraph.index.attr[[i]]<-subset(igraph.index.attr[[i]],!duplicated(igraph.index.attr[[i]]$seq.number))
      #size of vertex
      igraph.index.attr[[i]]<-merge(size.of.vertex,igraph.index.attr[[i]],by="seq.number",all.y=T)
      igraph.index.attr[[i]]$size[1]<-0
      igraph.index.attr[[i]]$adj_size<-(igraph.index.attr[[i]]$size*60/(range(igraph.index.attr[[i]]$size)[2]-range(igraph.index.attr[[i]]$size)[1]))^0.5+15 #visual size of vertex a number between 20 and 60
      #size of empty
      igraph.index.attr[[i]]$adj_size<-replace(igraph.index.attr[[i]]$adj_size, igraph.index.attr[[i]]$size==0, 10)

      #simplified index
      igraph.index.jr[[i]]$No<-match(x=igraph.index.jr[[i]]$seq.number,table=igraph.index.attr[[i]]$seq.number)

      #find empty
      size0<-which(igraph.index.attr[[i]]$size==0)

      No<- igraph.index.jr[[i]]$No

      #match isotype.distribution to each vertex
      pie.values<-list()
      for(j in 2:length(igraph.index.attr[[i]]$seq.number)){
        pie.values[[j]]<-isotype.distribution[[igraph.index.attr[[i]]$seq.number[j]]]
      }
      pie.values.list[[i]]<-pie.values

      #match trans state distribution to each vertex
      pie_values_trans<-list()
      for(j in 2:length(igraph.index.attr[[i]]$seq.number)){
        pie_values_trans[[j]]<-trans_state_distribution[[igraph.index.attr[[i]]$seq.number[j]]]
      }
      pie_values_trans_list[[i]]<-pie_values_trans

      #write graph list
      g<-igraph::graph(No,directed = T)
      g$layout<-igraph::layout_as_tree
      igraph::V(g)$label<-igraph.index.attr[[i]]$size
      igraph::V(g)$label[1]<-"Germline"
      igraph::V(g)$size<-igraph.index.attr[[i]]$adj_size
      igraph::V(g)$label.dist<-3

      igraph::V(g)$shape<-"pie"

      if(length(size0)>1){
        for(k in 1:length(size0)){
          igraph::V(g)$shape[[size0[k]]]<-"circle"
        }
      }
      if(length(size0)==1 ){
        igraph::V(g)$shape[[size0]]<-"circle"
      }

      igraph::V(g)$pie.color<-list(colors)
      igraph::V(g)$color<-"gray"
      igraph::V(g)$color[1]<-"black"
      p<-g
      ###!!!
      igraph::V(g)$pie<-pie.values.list[[i]]
      igraph::V(p)$pie<-pie_values_trans_list[[i]]



      igraph_list[[i]]<-g
      igraph_list_trans[[i]]<-p
    }

    if (length(igraph.index[[i]])==1){
      igraph.index.attr[[i]]<-size.of.vertex$size[which(size.of.vertex$seq.number==igraph.index[[i]])]#size
      igraph.index.jr[[i]]<-data.frame(igraph.index[[i]])
      pie.values.list[[i]]<-list(isotype.distribution[igraph.index[[i]]][[1]])
      pie_values_trans_list[[i]]<-list(trans_state_distribution[igraph.index[[i]]][[1]])

      g<-igraph::graph(c(1,1), edges = NULL)
      g$layout<-igraph::layout_as_tree
      igraph::V(g)$size<-igraph.index.attr[[i]]/30+20
      igraph::V(g)$label<-igraph.index.attr[[i]]
      igraph::V(g)$label.dist<-5
      igraph::V(g)$shape<-"pie"
      igraph::V(g)$pie.color<-list(colors)
      p<-g
      igraph::V(g)$pie<-pie.values.list[[i]]
      igraph::V(p)$pie<-pie_values_trans_list[[i]]

      igraph_list[[i]]<-g
      igraph_list_trans[[i]]<-p
    }
  }
  }
  if(igraph.on==F){
   igraph_list<-"none"
   igraph_list_trans<-"none"
 }

  #transcriptome
  if(transcriptome.on==T){
    ngene<-nrow(transcriptome.states)
    ncell<-length(barcode_uniq)
    transcriptome<-as.data.frame(matrix(0,ngene,ncell,dimnames = list(rownames(transcriptome.states),barcode_uniq)))
    for(i in 1:length(barcode_uniq)){
      transcriptome[,i]<-eval(parse(text=trans_dis[trans_state[i]]))
    }
    transcriptome<-as.matrix(transcriptome)
    transcriptome[transcriptome<0]<-0
  }
  if(transcriptome.on==F){
    transcriptome<-"none"
  }

  #Seq_Name

  Seq_Name<-"none"

  if(seq.name!=F){
    if(seq.name>clonotype_num){
    warning("seq.name > total number of clones")
    seq.name<-clonotype_num
    }
    c_gene_h<-c_gene[which(nrow(c_gene)%%2!=0)]
    Name<-paste0(clonotype_id,"_",c_gene_h,"_",barcode_uniq,"_cluster",trans_state)

    seq_germ<-c()
    for(i in 1:clonotype_num){
      seq_germ[i]<-paste(reference$reference_seq[2*i-1],reference$reference_seq[2*i],sep = "_")
    }

    clonotype_id_germ<-paste0("clonotype",1:clonotype_num)
    Name_germ<-rep("germline_clusterUnknown",clonotype_num)

    Seq<-c(seq_germ,seq_combi)
    Seq<-gsub(pattern = ";IGK:",replacement = "_",x = Seq)
    Seq<-gsub(pattern = ";IGL:",replacement = "_",x = Seq)
    Seq<-gsub(pattern = ";IGH:",replacement = "",x = Seq)

    Name<-c(Name_germ,Name)
    clonotype_id_temp<-c(clonotype_id_germ,clonotype_id)

    Seq_Name_temp<-data.frame(Seq,Name,clonotype_id_temp)
    frequency<-NULL
    uniq_clone<-dplyr::arrange(clonotypes,dplyr::desc(frequency))[1:seq.name,1]
    Seq_Name<-list()
  for(s in 1:length(uniq_clone)){
    Seq_Name[[s]]<- Seq_Name_temp[which(Seq_Name_temp$clonotype_id_temp==uniq_clone[s]),-3]
    }
  }

  #Version & Parameter
  version<-R.Version()
  parameters<-list(initial.size.of.repertoire,
  species,
  cell.type,
  cd4.proportion,
  duration.of.evolution,
  complete.duration,
  vdj.productive,
  vdj.model,
  vdj.insertion.mean,
  vdj.insertion.stdv,
  vdj.branch.prob,
  clonal.selection,
  cell.division.prob,
  sequence.selection.prob,
  special.v.gene,
  class.switch.prob,
  class.switch.selection.dependent,
  class.switch.independent,
  SHM.method,
  SHM.nuc.prob,
  SHM.isotype.dependent,
  SHM.phenotype.dependent,
  max.cell.number,
  max.clonotype.number,
  death.rate,
  igraph.on,
  transcriptome.on,
  transcriptome.switch.independent,
  transcriptome.switch.prob,
  transcriptome.switch.isotype.dependent,
  transcriptome.switch.SHM.dependent,
  transcriptome.switch.selection.dependent,
  transcriptome.states,
  transcriptome.noise,
  seq.name)

  names(parameters)<-c("initial.size.of.repertoire",
  "species",
  "cell.type",
  "cd4.proportion",
  "duration.of.evolution",
  "complete.duration",
  "vdj.productive",
  "vdj.model",
  "vdj.insertion.mean",
  "vdj.insertion.stdv",
  "vdj.branch.prob",
  "clonal.selection",
  "cell.division.prob",
  "sequence.selection.prob",
  "special.v.gene",
  "class.switch.prob",
  "class.switch.selection.dependent",
  "class.switch.independent",
  "SHM.method",
  "SHM.nuc.prob",
  "SHM.isotype.dependent",
  "SHM.phenotype.dependent",
  "max.cell.number",
  "max.clonotype.number",
  "death.rate",
  "igraph.on",
  "transcriptome.on",
  "transcriptome.switch.independent",
  "transcriptome.switch.prob",
  "transcriptome.switch.isotype.dependent",
  "transcriptome.switch.SHM.dependent",
  "transcriptome.switch.selection.dependent",
  "transcriptome.states",
  "transcriptome.noise",
  "seq.name")
#return----

  output_list<-list(all_contig_annotations,clonotypes,all_contig,consensus,reference,reference_real,transcriptome,igraph_list,igraph_list_trans,Seq_Name,igraph.index.attr,history,igraph.index,selected_seq,version,parameters)
  names(output_list)<-c("all_contig_annotations","clonotypes","all_contig","consensus","reference","reference_real","transcriptome","igraph_list_iso","igraph_list_trans","Seq_Name","igraph.index.attr","history","igraph.index","selected.seq","version","parameters")
  return(output_list)

}#function end

Try the Platypus package in your browser

Any scripts or data that you put into this service are public.

Platypus documentation built on Aug. 15, 2022, 9:08 a.m.