kmer_random_forest: Score kmer-importance through a random forest
In lindberg-m/contextendR: Analyse genomic data with regard to sequence contexts

Kmer "selection" performed in one (if kmer candidates are supplied) or two steps, the first step looks at odds ratios and p-values in a fisher-test (see 'kmer_freq') to categorize kmers as being significantly associated with mutation probability. These kmers together with trinucleotide patterns are incorporated in a random forest model. The mean decrease in gini from this forest together with odds-ratios and p-values from the fisher test can be used to estimate kmer importance

kmer_random_forest(
  dataset,
  ks = 5,
  kmers = NULL,
  pval_cutoff = 0.001,
  n_keep = 80,
  maxnodes = 20,
  cores = NULL,
  n_trees = 720,
  include_fit = FALSE
)

`dataset`	Granges object, with a 'sequence.pyr' column containing sequence region and 'mut.pyr' column containing mutations
`ks`	Int. Size of kmers to be used in the model
`kmers`	Character vector of candidate kmers (optional). Note that arguments "ks", "pval_cutoff" and "n_keep" is ignored if candidates are already supplied
`pval_cutoff`	Numeric. Parameter for the fisher test
`n_keep`	Positive Int. Number of kmers to include after preselection
`maxnodes`	Parameter controlling the depth of the desicion trees in the random forest
`cores`	Number of cores to use for parallelization
`n_trees`	Number of trees in forest
`include_fit`	Bool. Include resulting fit from random forest training

A list containing: (1) MeanDecreaseGini information on kmers and (2) a 3D-array of p-values and odd-ratios of kmers and optionally (3) the random forest fit

lindberg-m/contextendR documentation built on Jan. 8, 2022, 3:16 a.m.