View source: R/make_heterogeneous_dataset.R
make_heterogeneous_dataset | R Documentation |
This function extracts observations related to heterogeneous organisms from IEDB data and returns a data set that can be used to train machine learning models.
make_heterogeneous_dataset(
epitopes,
proteins,
taxonomy_list,
nPos,
nNeg,
removeIDs = NULL,
hostIDs = NULL,
min_epit = 8,
max_epit = 25,
only_exact = FALSE,
pos.mismatch.rm = "all",
set.positive = "mode",
window_size = 2 * min_epit - 1,
max.N = 2,
save_folder = "./",
rnd.seed = NULL,
ncpus = 1
)
epitopes |
data frame of epitope data (returned by |
proteins |
data frame of protein data (returned by |
taxonomy_list |
list containing taxonomy information
(generated by |
nPos |
number of positive examples to extract. NOTE: this refers to
the number of unique positive examples extracted from |
nNeg |
number of negative examples to extract |
removeIDs |
vector of organism IDs to remove (see |
hostIDs |
vector of host IDs to retain (see |
min_epit |
positive integer, shortest epitope to be considered |
max_epit |
positive integer, longest epitope to be considered |
only_exact |
logical, should only sequences labelled as "Exact Epitope"
in variable epit_struc_def (within |
pos.mismatch.rm |
should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence. |
set.positive |
how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode". |
window_size |
positive integer, size of window to use. |
max.N |
maximum length of N-peptide frequency features to be calculated. |
save_folder |
path to folder for saving the results. |
rnd.seed |
seed for random number generator |
ncpus |
number of cores to use for data windowing and feature calculation. |
The heterogeneous data set is assembled by sampling entries from
epitopes
by organism taxID (after filtering using removeIDs
and
hostIDs
) until the desired number of positive and negative observations is
reached. Random subsampling is performed if required to return the exact
number of unique epitope examples.
Data frame containing the resulting dataset.
Felipe Campelo (f.campelo@aston.ac.uk)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.