View source: R/split_epitope_data.R
split_epitope_data | R Documentation |
Takes a data.table of data of class windowed_epit_dt (returned by
make_window_df()
) and split it into mutually exclusive sets of
observations, based on columns Info_sourceOrg_id, Info_protein_id or
Info_epitope_id.
split_epitope_data(
wdf,
split_level = "prot",
split_perc = c(70, 30),
split_names = NULL,
save_folder = NULL,
blast_file = NULL,
coverage_threshold = 80,
identity_threshold = 80
)
wdf |
data table of class windowed_epit_dt (returned by
|
split_level |
which level should be used for splitting? Use "org" for splitting by source organism ID, "prot" by protein ID or "epit" by epitope ID. When "prot" is used the routine attempts to identify different protein versions and treat them as a single unit for splitting purposes. |
split_perc |
numeric vector of desired splitting percentages. See Details. |
split_names |
optional character vector with short names for each split. |
save_folder |
path to folder for saving the results. |
blast_file |
path to file containing all-vs-all BLASTp alignment results for all proteins in wdf. See Details. |
coverage_threshold |
coverage threshold for grouping proteins by similarity, see Details. |
identity_threshold |
identity threshold for grouping proteins by similarity, see Details. |
If the sum of split_perc is less than 100 an extra split is generated
with the remaining observations - e.g., split_perc = c(50, 30)
results in
three sets with an approximately 50/30/20% split of the total observations.
If the sum is greater than 100 the splits are linearly scaled down so that
the sum becomes 100. Note that the split percents correspond to the number of
observations, not the number of unique IDs.
This function will attempt to approximate the desired split levels, but
depending on the size of wdf set and the desired split_level it may
not be possible (e.g., if split_level = "org"
and a single organism
corresponds to 90% of the data, one of the splits will necessarily correspond
to at least 90% of the data, regardless of the values informed in
split_perc
.
If a BLASTp file is provided the routine will keep any pairs of proteins
having (coverage >= coverage_threshold AND
identity >= identity_threshold) under the same split. This is useful to
prevent accidental data leakage due to quasi-identical proteins with
different UIDs. NOTE: this only works if split_level == "prot
.
A list object containing the split data tables.
Felipe Campelo (f.campelo@aston.ac.uk)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.