run_sl | R Documentation |
Identify biomarkers using logistic regression, random forest, or support vector machine.
run_sl(
ps,
group,
taxa_rank = "all",
transform = c("identity", "log10", "log10p"),
norm = "none",
norm_para = list(),
nfolds = 3,
nrepeats = 3,
sampling = NULL,
tune_length = 5,
top_n = 10,
method = c("LR", "RF", "SVM"),
...
)
ps |
a |
group |
character, the variable to set the group. |
taxa_rank |
character to specify taxonomic rank to perform
differential analysis on. Should be one of
|
transform |
character, the methods used to transform the microbial
abundance. See
|
norm |
the methods used to normalize the microbial abundance data. See
|
norm_para |
named |
nfolds |
the number of splits in CV. |
nrepeats |
the number of complete sets of folds to compute. |
sampling |
a single character value describing the type of additional
sampling that is conducted after resampling (usually to resolve class
imbalances). Values are "none", "down", "up", "smote", or "rose". For
more details see |
tune_length |
an integer denoting the amount of granularity in the
tuning parameter grid. For more details see |
top_n |
an integer denoting the top |
method |
supervised learning method, options are "LR" (logistic regression), "RF" (rando forest), or "SVM" (support vector machine). |
... |
extra arguments passed to the classification. e.g., |
Only support two groups comparison in the current version. And the marker was selected based on its importance score. Moreover, The hyper-parameters are selected automatically by a grid-search based method in the N-time K-fold cross-validation. Thus, the identified biomarker based can be biased due to model overfitting for small datasets (e.g., with less than 100 samples).
The argument top_n
is used to denote the number of markers based on the
importance score. There is no rule or principle on how to select top_n
,
however, usually it is very useful to try a different top_n
and compare
the performance of the marker predictions for the testing data.
a microbiomeMarker object.
Yang Cao
caret::train()
,caret::trainControl()
data(enterotypes_arumugam)
# small example phyloseq object for test
ps_small <- phyloseq::subset_taxa(
enterotypes_arumugam,
Phylum %in% c("Firmicutes", "Bacteroidetes")
)
set.seed(2021)
mm <- run_sl(
ps_small,
group = "Gender",
taxa_rank = "Genus",
nfolds = 2,
nrepeats = 1,
top_n = 15,
norm = "TSS",
method = "LR",
)
mm
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.