split_epitope_data: Split epitope data based on epitope, protein or organism IDs.
In fcampelo/epitopes: Processing and Feature Extraction for Epitope Data

split_epitope_data

R Documentation

Split epitope data based on epitope, protein or organism IDs.

Description

Takes a data.table of data of class windowed_epit_dt (returned by make_window_df()) and split it into mutually exclusive sets of observations, based on columns Info_sourceOrg_id, Info_protein_id or Info_epitope_id.

Usage

split_epitope_data(
  wdf,
  split_level = "prot",
  split_perc = c(70, 30),
  split_names = NULL,
  save_folder = NULL,
  blast_file = NULL,
  coverage_threshold = 80,
  identity_threshold = 80
)

Arguments

`wdf`	data table of class windowed_epit_dt (returned by `make_window_df()`)
`split_level`	which level should be used for splitting? Use "org" for splitting by source organism ID, "prot" by protein ID or "epit" by epitope ID. When "prot" is used the routine attempts to identify different protein versions and treat them as a single unit for splitting purposes.
`split_perc`	numeric vector of desired splitting percentages. See Details.
`split_names`	optional character vector with short names for each split.
`save_folder`	path to folder for saving the results.
`blast_file`	path to file containing all-vs-all BLASTp alignment results for all proteins in wdf. See Details.
`coverage_threshold`	coverage threshold for grouping proteins by similarity, see Details.
`identity_threshold`	identity threshold for grouping proteins by similarity, see Details.

Details

If the sum of split_perc is less than 100 an extra split is generated with the remaining observations - e.g., split_perc = c(50, 30) results in three sets with an approximately 50/30/20% split of the total observations. If the sum is greater than 100 the splits are linearly scaled down so that the sum becomes 100. Note that the split percents correspond to the number of observations, not the number of unique IDs.

This function will attempt to approximate the desired split levels, but depending on the size of wdf set and the desired split_level it may not be possible (e.g., if split_level = "org" and a single organism corresponds to 90% of the data, one of the splits will necessarily correspond to at least 90% of the data, regardless of the values informed in split_perc.

If a BLASTp file is provided the routine will keep any pairs of proteins having (coverage >= coverage_threshold AND identity >= identity_threshold) under the same split. This is useful to prevent accidental data leakage due to quasi-identical proteins with different UIDs. NOTE: this only works if ⁠split_level == "prot⁠.