filter_netphorest: Remove off-target NetPhorest sites

filter_netphorestR Documentation

Remove off-target NetPhorest sites

Description

NetPhorest will always scan the entire sequence for possible sites, even when the sequence is a short fragment with the desired site in the middle, like when using build_fastas() to build NetPhorest input. filter_netphorest() detects and removes these unwanted sites, in favour of the site in the middle of the original sequence.

Sites at the end of proteins are accounted for (see details), but sites at the beginning (within source_window_size/2) cannot reliably be detected without data on the original position in the protein (see protein_pos_col), due to limitations in NetPhorest's output. Set keep_uncertain to handle these.

name_col, seq_col, and pos_col defaults are based on the read_netphorest() default column names.

Usage

filter_netphorest(
  data,
  name_col = "fasta_id",
  seq_col = "fragment_11",
  pos_col = "position",
  fragment_col = NULL,
  match_fragments = TRUE,
  source_window_size,
  match_middle = TRUE,
  keep_uncertain = NULL
)

Arguments

data

Data frame with data with one possible site per row (wide format), presumably from read_netphorest().

name_col

Name of column containing 'true' site names. Data will be filtered down to one row in every group denoted by this column. Default is 'fasta_id'.

seq_col

Name of column containing netphorest-outputted sequences. Default is 'fragment_11'.

pos_col

Name of column containing position of detected site in NetPhorest FASTA sequence. Default is 'position'.

fragment_col

Optional: name of column containing sequence fragments surrounding the true site, as generally extracted from the header column. If not supplied and match_fragments is TRUE (both default), will attempt to extract the fragments from name_col by extracting the last 3-9 letter/underscore/dash characters from the fasta header.

source_window_size

Length of the original site window sequences fed to NetPhorest.

keep_uncertain

One of TRUE/FALSE/NULL (default). Some choices can be uncertain, especially if there is no extra information from fragments (see details). If keep_uncertain = TRUE, all possible uncertain values are kept; if keep_uncertain = FALSE, all uncertain groups are fully dropped; if keep_uncertain = NULL (default), a best guess is made based on proximity to the midpoint of the sequence.

match_fragments:

Optional: Whether to use fragments to match the detected site to the true site. Fragments can be any uneven length and are either extracted from the name column or directly used from fragment_col. Default is TRUE.

Details

This function will work with just the default netphorest output data of name_col, seq_col and pos_col. However, filtering can be improved by providing the section around the true site (fragment), either extracted by default (fragment_col kept as NULL) or from a separate column (fragment_col supplied).

Value

The original dataset, without incorrect sites. Global row order is not preserved.

Examples

# Default usage
kinsub_netphorest_path <- system.file('extdata', 'kinsub_human_netphorest', package = 'phosphocie')
kinsub_netphorest <- read_netphorest(kinsub_netphorest_path)
kinsub_filtered <- filter_netphorest(kinsub_netphorest, source_window_size = 15)

# Handle ambiguous sites
ambiguous_data <- data.frame(id = c("P13796|LCP1|L-plastin|S5|RGSVS", "P13796|LCP1|L-plastin|S5|RGSVS"),
                             pos = c(5, 7),
                             seq = c("-MARGsVSDEE", "ARGSVtDEEMM"))


## Without further info, filter_phosphosite should pick 7
## because it looks like an erroneous non-central site.
filter_netphorest(ambiguous_data,
                  name_col = 'id',
                  seq_col = 'seq',
                  pos_col = 'pos',
                  match_fragments = FALSE)

## Return all or none instead with `keep_uncertain`
filter_netphorest(ambiguous_data, 'id', 'seq', 'pos', match_fragments = FALSE, keep_uncertain = TRUE)
filter_netphorest(ambiguous_data, 'id', 'seq', 'pos', match_fragments = FALSE, keep_uncertain = FALSE)

## Or return the true value by integrating data from the fasta header, manually or automatic:
ambiguous_data_extra <- tidyr::extract(ambiguous_data, id, 'fragment', '\\|([A-Za-z_]{3,9})$',
                                       remove = FALSE, convert = TRUE)

filter_netphorest(ambiguous_data_extra, 15, 'id', 'seq', 'pos')
filter_netphorest(ambiguous_data_extra, 15, 'id', 'seq', 'pos', fragment_col = 'fragment')


casblaauw/phosphocie documentation built on March 30, 2022, 8:28 p.m.