View source: R/make_OrgSpec_datasets.R
make_OrgSpec_datasets | R Documentation |
This function extracts organism or taxon-specific datasets from IEDB data and returns data sets for the development and assessment of epitope prediction models.
make_OrgSpec_datasets(
epitopes,
proteins,
taxonomy_list,
orgIDs = NULL,
hostIDs = NULL,
removeIDs = NULL,
save_folder = "./",
min_epit = 8,
max_epit = 25,
only_exact = FALSE,
pos.mismatch.rm = "all",
set.positive = "mode",
window_size = 2 * min_epit - 1,
max.N = 2,
split_level = "prot",
split_perc = c(75, 25),
split_names = c("01_training", "02_holdout"),
coverage_threshold = 80,
identity_threshold = 80,
ncpus = 1
)
epitopes |
data frame of epitope data (returned by |
proteins |
data frame of protein data (returned by |
taxonomy_list |
list containing taxonomy information
(generated by |
orgIDs |
vector of organism/taxon IDs to retain (see
|
hostIDs |
vector of host IDs to retain (see |
removeIDs |
vector of organism IDs to remove (see |
save_folder |
path to folder for saving the results. |
min_epit |
positive integer, shortest epitope to be considered |
max_epit |
positive integer, longest epitope to be considered |
only_exact |
logical, should only sequences labelled as "Exact Epitope"
in variable epit_struc_def (within |
pos.mismatch.rm |
should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence. |
set.positive |
how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode". |
window_size |
positive integer, size of window to use. |
max.N |
maximum length of N-peptide frequency features to be calculated. |
split_level |
which level should be used for splitting? Use "org" for splitting by source organism ID, "prot" by protein ID or "epit" by epitope ID. When "prot" is used the routine attempts to identify different protein versions and treat them as a single unit for splitting purposes. |
split_perc |
numeric vector of desired splitting percentages. See Details. |
split_names |
optional character vector with short names for each split. |
coverage_threshold |
coverage threshold for grouping proteins by similarity, see Details. |
identity_threshold |
identity threshold for grouping proteins by similarity, see Details. |
ncpus |
number of cores to use for data windowing and feature calculation. |
List containing the resulting datasets.
If the sum of split_perc
is less than 100 an extra split is generated
with the remaining observations - e.g., split_perc = c(50, 30)
results in
three sets with an approximately 50/30/20% split of the total observations.
If the sum is greater than 100 the splits are linearly scaled down so that
the sum becomes 100. Note that the split percents correspond to the number of
observations, not the number of unique IDs.
This function will attempt to approximate the desired split levels, but
depending on the size of the data set and the desired split_level
it may
not be possible (e.g., if split_level = "org"
and a single organism
corresponds to 90% of the data, one of the splits will necessarily correspond
to at least 90% of the data, regardless of the values informed in
split_perc
).
If split_level == "prot
the routine will keep any pairs of proteins
having (coverage >= coverage_threshold
AND
identity >= identity_threshold
) under the same split. This is useful to
prevent accidental data leakage due to quasi-identical proteins with
different UIDs.
NOTE: this will require BLAST+ to be installed in your
local machine. For details on how to set up BLAST+ on your machine, check
https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download.
This function was developed for versions blast 2.10.0
or later.
Felipe Campelo (f.campelo@aston.ac.uk)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.