make_proteins_dataset: Builds a partially labelled dataset from protein and epitope...

View source: R/make_proteins_dataset.R

make_proteins_datasetR Documentation

Builds a partially labelled dataset from protein and epitope sequences

Description

This function extracts proteins sequences and labels them based on IEDB epitope data.

Usage

make_proteins_dataset(
  epitopes,
  proteins,
  taxonomy_list,
  prot_IDs,
  orgIDs = NULL,
  removeIDs = NULL,
  hostIDs = NULL,
  min_epit = 8,
  max_epit = 25,
  only_exact = FALSE,
  pos.mismatch.rm = "all",
  set.positive = "mode",
  window_size = 2 * min_epit - 1,
  max.N = 2,
  save_folder = "./",
  ncpus = 1
)

Arguments

epitopes

data frame of epitope data (returned by get_LBCE()).

proteins

data frame of protein data (returned by get_proteins()).

taxonomy_list

list containing taxonomy information (generated by get_taxonomy())

prot_IDs

IDs of proteins to be extracted for the data set

orgIDs

vector of organism/taxon IDs to retain (see filter_epitopes()). If NULL then no organism ID filtering is performed.

removeIDs

vector of organism IDs to remove (see filter_epitopes()). Useful for, e.g., using a Class-level orgIDs and removing some species or genera. If NULL then no removal is performed.

hostIDs

vector of host IDs to retain (see filter_epitopes()). If NULL then no host filtering is performed.

min_epit

positive integer, shortest epitope to be considered

max_epit

positive integer, longest epitope to be considered

only_exact

logical, should only sequences labelled as "Exact Epitope" in variable epit_struc_def (within epitopes) be considered?

pos.mismatch.rm

should epitopes with position mismatches be removed? Use "all" (default) for removing any position mismatch or "align" if the routine should attempt to search the epitope sequence in the protein sequence.

set.positive

how to decide whether an observation should be of the "Positive" (+1) class? Use "any" to set a sequence as positive if $n_positive > 0$, "mode" to set it if $n_positive >= n_negative$, or "all" to set it if $n_negative == 0$. Defaults to "mode".

window_size

positive integer, size of window to use.

max.N

maximum length of N-peptide frequency features to be calculated.

save_folder

path to folder for saving the results.

ncpus

number of cores to use for data windowing and feature calculation.

Value

Data frame containing the resulting dataset.

Author(s)

Felipe Campelo (f.campelo@aston.ac.uk)


fcampelo/epitopes documentation built on April 22, 2023, 12:23 a.m.