make_proteins_dataset: Builds a partially labelled dataset from protein and epitope...
In fcampelo/epitopes: Processing and Feature Extraction for Epitope Data

make_proteins_dataset

R Documentation

Builds a partially labelled dataset from protein and epitope sequences

Description

This function extracts proteins sequences and labels them based on IEDB epitope data.

Usage

make_proteins_dataset(
  epitopes,
  proteins,
  taxonomy_list,
  prot_IDs,
  orgIDs = NULL,
  removeIDs = NULL,
  hostIDs = NULL,
  min_epit = 8,
  max_epit = 25,
  only_exact = FALSE,
  pos.mismatch.rm = "all",
  set.positive = "mode",
  window_size = 2 * min_epit - 1,
  max.N = 2,
  save_folder = "./",
  ncpus = 1
)

Arguments

`epitopes`	data frame of epitope data (returned by `get_LBCE()`).
`proteins`	data frame of protein data (returned by `get_proteins()`).
`taxonomy_list`	list containing taxonomy information (generated by `get_taxonomy()`)
`prot_IDs`	IDs of proteins to be extracted for the data set
`orgIDs`	vector of organism/taxon IDs to retain (see `filter_epitopes()`). If `NULL` then no organism ID filtering is performed.
`removeIDs`	vector of organism IDs to remove (see `filter_epitopes()`). Useful for, e.g., using a Class-level `orgIDs` and removing some species or genera. If `NULL` then no removal is performed.
`hostIDs`	vector of host IDs to retain (see `filter_epitopes()`). If `NULL` then no host filtering is performed.
`min_epit`	positive integer, shortest epitope to be considered
`max_epit`	positive integer, longest epitope to be considered
`only_exact`	logical, should only sequences labelled as "Exact Epitope" in variable epit_struc_def (within `epitopes`) be considered?
`pos.mismatch.rm`	should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence.
`set.positive`	how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode".
`window_size`	positive integer, size of window to use.
`max.N`	maximum length of N-peptide frequency features to be calculated.
`save_folder`	path to folder for saving the results.
`ncpus`	number of cores to use for data windowing and feature calculation.