make_heterogeneous_dataset: Build a heterogeneou dataset of predefined size

View source: R/make_heterogeneous_dataset.R

make_heterogeneous_datasetR Documentation

Build a heterogeneou dataset of predefined size

Description

This function extracts observations related to heterogeneous organisms from IEDB data and returns a data set that can be used to train machine learning models.

Usage

make_heterogeneous_dataset(
  epitopes,
  proteins,
  taxonomy_list,
  nPos,
  nNeg,
  removeIDs = NULL,
  hostIDs = NULL,
  min_epit = 8,
  max_epit = 25,
  only_exact = FALSE,
  pos.mismatch.rm = "all",
  set.positive = "mode",
  window_size = 2 * min_epit - 1,
  max.N = 2,
  save_folder = "./",
  rnd.seed = NULL,
  ncpus = 1
)

Arguments

epitopes

data frame of epitope data (returned by get_LBCE()).

proteins

data frame of protein data (returned by get_proteins()).

taxonomy_list

list containing taxonomy information (generated by get_taxonomy())

nPos

number of positive examples to extract. NOTE: this refers to the number of unique positive examples extracted from epitopes, not to the size of the data frame returned (which is obtained after windowing using make_window_df()).

nNeg

number of negative examples to extract

removeIDs

vector of organism IDs to remove (see filter_epitopes()). Useful for, e.g., using a Class-level orgIDs and removing some species or genera. If NULL then no removal is performed.

hostIDs

vector of host IDs to retain (see filter_epitopes()). If NULL then no host filtering is performed.

min_epit

positive integer, shortest epitope to be considered

max_epit

positive integer, longest epitope to be considered

only_exact

logical, should only sequences labelled as "Exact Epitope" in variable epit_struc_def (within epitopes) be considered?

pos.mismatch.rm

should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence.

set.positive

how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode".

window_size

positive integer, size of window to use.

max.N

maximum length of N-peptide frequency features to be calculated.

save_folder

path to folder for saving the results.

rnd.seed

seed for random number generator

ncpus

number of cores to use for data windowing and feature calculation.

Details

The heterogeneous data set is assembled by sampling entries from epitopes by organism taxID (after filtering using removeIDs and hostIDs) until the desired number of positive and negative observations is reached. Random subsampling is performed if required to return the exact number of unique epitope examples.

Value

Data frame containing the resulting dataset.

Author(s)

Felipe Campelo (f.campelo@aston.ac.uk)


fcampelo/epitopes documentation built on April 22, 2023, 12:23 a.m.