filter_netphorest | R Documentation |
NetPhorest will always scan the entire sequence for possible sites, even when
the sequence is a short fragment with the desired site in the middle, like
when using build_fastas()
to build NetPhorest input.
filter_netphorest()
detects and removes these unwanted sites, in favour of the
site in the middle of the original sequence.
Sites at the end of proteins are accounted for (see details), but sites at the
beginning (within source_window_size
/2) cannot reliably be detected without
data on the original position in the protein (see protein_pos_col
), due to
limitations in NetPhorest's output. Set keep_uncertain
to handle these.
name_col
, seq_col
, and pos_col
defaults are based on the read_netphorest()
default column names.
filter_netphorest( data, name_col = "fasta_id", seq_col = "fragment_11", pos_col = "position", fragment_col = NULL, match_fragments = TRUE, source_window_size, match_middle = TRUE, keep_uncertain = NULL )
data |
Data frame with data with one possible site per row (wide format),
presumably from |
name_col |
Name of column containing 'true' site names. Data will be filtered down to one row in every group denoted by this column. Default is 'fasta_id'. |
seq_col |
Name of column containing netphorest-outputted sequences. Default is 'fragment_11'. |
pos_col |
Name of column containing position of detected site in NetPhorest FASTA sequence. Default is 'position'. |
fragment_col |
Optional: name of column containing sequence fragments
surrounding the true site, as generally extracted from the header column.
If not supplied and |
source_window_size |
Length of the original site window sequences fed to NetPhorest. |
keep_uncertain |
One of TRUE/FALSE/NULL (default). Some choices can be uncertain,
especially if there is no extra information from fragments (see details).
If |
match_fragments: |
Optional: Whether to use fragments to match the detected site
to the true site. Fragments can be any uneven length and are either extracted
from the name column or directly used from |
This function will work with just the default netphorest output data
of name_col
, seq_col
and pos_col
. However, filtering can be improved by providing
the section around the true site (fragment), either extracted by default (fragment_col
kept as NULL) or from a separate column (fragment_col
supplied).
The original dataset, without incorrect sites. Global row order is not preserved.
# Default usage kinsub_netphorest_path <- system.file('extdata', 'kinsub_human_netphorest', package = 'phosphocie') kinsub_netphorest <- read_netphorest(kinsub_netphorest_path) kinsub_filtered <- filter_netphorest(kinsub_netphorest, source_window_size = 15) # Handle ambiguous sites ambiguous_data <- data.frame(id = c("P13796|LCP1|L-plastin|S5|RGSVS", "P13796|LCP1|L-plastin|S5|RGSVS"), pos = c(5, 7), seq = c("-MARGsVSDEE", "ARGSVtDEEMM")) ## Without further info, filter_phosphosite should pick 7 ## because it looks like an erroneous non-central site. filter_netphorest(ambiguous_data, name_col = 'id', seq_col = 'seq', pos_col = 'pos', match_fragments = FALSE) ## Return all or none instead with `keep_uncertain` filter_netphorest(ambiguous_data, 'id', 'seq', 'pos', match_fragments = FALSE, keep_uncertain = TRUE) filter_netphorest(ambiguous_data, 'id', 'seq', 'pos', match_fragments = FALSE, keep_uncertain = FALSE) ## Or return the true value by integrating data from the fasta header, manually or automatic: ambiguous_data_extra <- tidyr::extract(ambiguous_data, id, 'fragment', '\\|([A-Za-z_]{3,9})$', remove = FALSE, convert = TRUE) filter_netphorest(ambiguous_data_extra, 15, 'id', 'seq', 'pos') filter_netphorest(ambiguous_data_extra, 15, 'id', 'seq', 'pos', fragment_col = 'fragment')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.