kmer_random_forest: Score kmer-importance through a random forest

Description Usage Arguments Value

View source: R/kmer_random_forest.R

Description

Kmer "selection" performed in one (if kmer candidates are supplied) or two steps, the first step looks at odds ratios and p-values in a fisher-test (see 'kmer_freq') to categorize kmers as being significantly associated with mutation probability. These kmers together with trinucleotide patterns are incorporated in a random forest model. The mean decrease in gini from this forest together with odds-ratios and p-values from the fisher test can be used to estimate kmer importance

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
kmer_random_forest(
  dataset,
  ks = 5,
  kmers = NULL,
  pval_cutoff = 0.001,
  n_keep = 80,
  maxnodes = 20,
  cores = NULL,
  n_trees = 720,
  include_fit = FALSE
)

Arguments

dataset

Granges object, with a 'sequence.pyr' column containing sequence region and 'mut.pyr' column containing mutations

ks

Int. Size of kmers to be used in the model

kmers

Character vector of candidate kmers (optional). Note that arguments "ks", "pval_cutoff" and "n_keep" is ignored if candidates are already supplied

pval_cutoff

Numeric. Parameter for the fisher test

n_keep

Positive Int. Number of kmers to include after preselection

maxnodes

Parameter controlling the depth of the desicion trees in the random forest

cores

Number of cores to use for parallelization

n_trees

Number of trees in forest

include_fit

Bool. Include resulting fit from random forest training

Value

A list containing: (1) MeanDecreaseGini information on kmers and (2) a 3D-array of p-values and odd-ratios of kmers and optionally (3) the random forest fit


lindberg-m/contextendR documentation built on Jan. 8, 2022, 3:16 a.m.