make_OrgSpec_datasets: Build organism-specific data sets for training epitope...

View source: R/make_OrgSpec_datasets.R

make_OrgSpec_datasetsR Documentation

Build organism-specific data sets for training epitope prediction models

Description

This function extracts organism or taxon-specific datasets from IEDB data and returns data sets for the development and assessment of epitope prediction models.

Usage

make_OrgSpec_datasets(
  epitopes,
  proteins,
  taxonomy_list,
  orgIDs = NULL,
  hostIDs = NULL,
  removeIDs = NULL,
  save_folder = "./",
  min_epit = 8,
  max_epit = 25,
  only_exact = FALSE,
  pos.mismatch.rm = "all",
  set.positive = "mode",
  window_size = 2 * min_epit - 1,
  max.N = 2,
  split_level = "prot",
  split_perc = c(75, 25),
  split_names = c("01_training", "02_holdout"),
  coverage_threshold = 80,
  identity_threshold = 80,
  ncpus = 1
)

Arguments

epitopes

data frame of epitope data (returned by get_LBCE()).

proteins

data frame of protein data (returned by get_proteins()).

taxonomy_list

list containing taxonomy information (generated by get_taxonomy())

orgIDs

vector of organism/taxon IDs to retain (see filter_epitopes()). If NULL then no organism ID filtering is performed.

hostIDs

vector of host IDs to retain (see filter_epitopes()). If NULL then no host filtering is performed.

removeIDs

vector of organism IDs to remove (see filter_epitopes()). Useful for, e.g., using a Class-level orgIDs and removing some species or genera. If NULL then no removal is performed.

save_folder

path to folder for saving the results.

min_epit

positive integer, shortest epitope to be considered

max_epit

positive integer, longest epitope to be considered

only_exact

logical, should only sequences labelled as "Exact Epitope" in variable epit_struc_def (within epitopes) be considered?

pos.mismatch.rm

should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence.

set.positive

how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode".

window_size

positive integer, size of window to use.

max.N

maximum length of N-peptide frequency features to be calculated.

split_level

which level should be used for splitting? Use "org" for splitting by source organism ID, "prot" by protein ID or "epit" by epitope ID. When "prot" is used the routine attempts to identify different protein versions and treat them as a single unit for splitting purposes.

split_perc

numeric vector of desired splitting percentages. See Details.

split_names

optional character vector with short names for each split.

coverage_threshold

coverage threshold for grouping proteins by similarity, see Details.

identity_threshold

identity threshold for grouping proteins by similarity, see Details.

ncpus

number of cores to use for data windowing and feature calculation.

Value

List containing the resulting datasets.

Data splitting and BLAST requirements

If the sum of split_perc is less than 100 an extra split is generated with the remaining observations - e.g., split_perc = c(50, 30) results in three sets with an approximately 50/30/20% split of the total observations. If the sum is greater than 100 the splits are linearly scaled down so that the sum becomes 100. Note that the split percents correspond to the number of observations, not the number of unique IDs.

This function will attempt to approximate the desired split levels, but depending on the size of the data set and the desired split_level it may not be possible (e.g., if split_level = "org" and a single organism corresponds to 90% of the data, one of the splits will necessarily correspond to at least 90% of the data, regardless of the values informed in split_perc).

If ⁠split_level == "prot⁠ the routine will keep any pairs of proteins having (coverage >= coverage_threshold AND identity >= identity_threshold) under the same split. This is useful to prevent accidental data leakage due to quasi-identical proteins with different UIDs. NOTE: this will require BLAST+ to be installed in your local machine. For details on how to set up BLAST+ on your machine, check https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download. This function was developed for versions ⁠blast 2.10.0⁠ or later.

Author(s)

Felipe Campelo (f.campelo@aston.ac.uk)


fcampelo/epitopes documentation built on April 22, 2023, 12:23 a.m.