split_epitope_data: Split epitope data based on epitope, protein or organism IDs.

View source: R/split_epitope_data.R

split_epitope_dataR Documentation

Split epitope data based on epitope, protein or organism IDs.

Description

Takes a data.table of data of class windowed_epit_dt (returned by make_window_df()) and split it into mutually exclusive sets of observations, based on columns Info_sourceOrg_id, Info_protein_id or Info_epitope_id.

Usage

split_epitope_data(
  wdf,
  split_level = "prot",
  split_perc = c(70, 30),
  split_names = NULL,
  save_folder = NULL,
  blast_file = NULL,
  coverage_threshold = 80,
  identity_threshold = 80
)

Arguments

wdf

data table of class windowed_epit_dt (returned by make_window_df())

split_level

which level should be used for splitting? Use "org" for splitting by source organism ID, "prot" by protein ID or "epit" by epitope ID. When "prot" is used the routine attempts to identify different protein versions and treat them as a single unit for splitting purposes.

split_perc

numeric vector of desired splitting percentages. See Details.

split_names

optional character vector with short names for each split.

save_folder

path to folder for saving the results.

blast_file

path to file containing all-vs-all BLASTp alignment results for all proteins in wdf. See Details.

coverage_threshold

coverage threshold for grouping proteins by similarity, see Details.

identity_threshold

identity threshold for grouping proteins by similarity, see Details.

Details

If the sum of split_perc is less than 100 an extra split is generated with the remaining observations - e.g., split_perc = c(50, 30) results in three sets with an approximately 50/30/20% split of the total observations. If the sum is greater than 100 the splits are linearly scaled down so that the sum becomes 100. Note that the split percents correspond to the number of observations, not the number of unique IDs.

This function will attempt to approximate the desired split levels, but depending on the size of wdf set and the desired split_level it may not be possible (e.g., if split_level = "org" and a single organism corresponds to 90% of the data, one of the splits will necessarily correspond to at least 90% of the data, regardless of the values informed in split_perc.

If a BLASTp file is provided the routine will keep any pairs of proteins having (coverage >= coverage_threshold AND identity >= identity_threshold) under the same split. This is useful to prevent accidental data leakage due to quasi-identical proteins with different UIDs. NOTE: this only works if ⁠split_level == "prot⁠.

Value

A list object containing the split data tables.

Author(s)

Felipe Campelo (f.campelo@aston.ac.uk)


fcampelo/epitopes documentation built on April 22, 2023, 12:23 a.m.