knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
phosphocie
is a package to create kinase colour embeddings: continuous colours that
denote kinase specificities for a phosphorylation site. These colour embeddings are
based on the U-CIE framework:
reduce data down to 3 dimensions and optimally embed it in the CIELAB colourspace.
Specifically, phosphocie
provides functions to handle phosphorylation site data,
create kinase prediction tool input files, handle the resulting kinase prediction output,
and a pre-trained kinase colourspace based on the full
PhosphoSitePlus phosphorylation site dataset
In this vignette, we will demonstrate phosphocie
using the kinase-substrate-annotated
phosphorylation site dataset, from PhosphoSitePlus.
For a less annotated example with a real phosphoproteomics dataset, see the example vignette
library(phosphocie) library(dplyr) library(ggplot2) library(uwot)
An example extract of the kinase-substrate dataset can be loaded in with read_kinsub()
:
kinsub_path <- system.file('extdata', 'kinase_substrate_dataset_head', package = 'phosphocie') kinsub <- read_kinsub(kinsub_path) head(kinsub)
build_fastas()
creates a FASTA file with a separate FASTA entry (header and sequence)
for each site, which can be used for input in NetPhorest. It automatically appends 7-wide cutout
around the site to the header, which enables matching in the next steps.
It also returns the same data, but with the new headers as the fasta_id
column.
tmp <- tempdir() kinsub_headers <- build_fastas(kinsub, name_col = 'unique_id', seq_col = 'fragment_15', path = file.path(tmp, 'kinsub_peptides.fasta'))
Alternatively, retrieve_fastas()
retrieves full fasta files for a list of accession IDs.
However, phosphocie
is geared towards visualising kinase predictions for individual,
possibly unidentified sites, so this option is not recommended and requires more manual processing
to work with the follow-up functions.
# retrieve_fastas(kinsub_human$acc_id, file.path(tmp, 'kinsub.fasta'))
Kinase prediction can be done with your tool of choice, but phosphocie natively supports NetPhorest. For details on how to run NetPhorest locally, see the NetPhorest site.
# Run NetPhorest on the command line cat kinsub_human.fasta | ./netphorest > kinsub_human_netphorest
This output file is also included in the package, and can be read in using
read_netphorest()
and filter_netphorest()
:
kinsub_netphorest_path <- system.file('extdata', 'kinsub_head_netphorest', package = 'phosphocie') kinsub_netphorest_raw <- read_netphorest(kinsub_netphorest_path) kinsub_netphorest <- filter_netphorest(kinsub_netphorest_raw, source_window_size = 15)
filter_netphorest()
attempts to filter out sites in our sequences from build_fasta()
that were detected by NetPhorest, but that aren't the true intended sites in the middle
of the supplied sequence window.
It can determine conclusive sites for any non-truncated windows or windows truncated by the
end of the protein, but struggles to distinguish between true sites at the start of a protein
and false sites at the beginning of the window without additional data.
For example, we can take a look at one uncertain ID in this data:
uncertain <- kinsub_netphorest_raw %>% filter(fasta_id == 'P13796|LCP1|L-plastin|S5|ARGSVSD') uncertain
We know from the metadata encoded in fasta_id
that S5 is the true site.
It's not at position
8, where we normally expect sites in data with a source_window_size
of 15 like PhosphoSitePlus, because the original 'filler' was truncated.
However, without this metadata knowledge, it's impossible to distinguish this site
from a fake detected site at the start of a window.
By default, filter_netphorest()
will use the included site fragment included in the
fasta header by build_fastas()
to detect the true site:
uncertain %>% filter_netphorest(source_window_size = 15)
This can also be passed as a manual column:
uncertain %>% tidyr::extract(fasta_id, 'fragments', '\\|([A-Za-z_-]{3,9})$', remove = FALSE) %>% filter_netphorest(source_window_size = 15, fragment_col = 'fragments')
If this metadata is not available, keep_uncertain
determines how remaining uncertainties are handled:
keep_uncertain = NULL
(default): Keep the site closest to the middle.
keep_uncertain = TRUE
: Don't filter uncertain sites, and keep all possibilities instead.
* keep_uncertain = FALSE
: Drop any uncertain sites completely, only return sites with certainty.
uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = NULL) uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = TRUE) uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = FALSE)
We match by the ID column, which is the primary identifier carried over from original data to the prediction file.
However, if these are distinct but both contain information that can
be used to match (like gene/protein name, accession id, original protein location/residue),
this information can be extracted using a function like tidyr::separate()
and the resulting
columns can be used to match as well (i.e. by = c('acc_id', 'position', 'residue')
)
kinsub_full <- inner_join(kinsub_headers, kinsub_netphorest, by = 'fasta_id') kinsub_missing <- kinsub_headers %>% filter(!fasta_id %in% kinsub_full$fasta_id) # Data for which no prediction could be recovered
kinsub_preds <- select(kinsub_full, fasta_id, Abl_group:YSK_group) %>% tibble::column_to_rownames('fasta_id') %>% as.matrix() kinsub_info <- select(kinsub_full, -c(Abl_group:YSK_group))
This package includes a pre-calculated kinase colour embedding space, based on NetPhorest kinase predictions for all sites in PhosphoSitePlus's phosphorylation site dataset.
# Transform with reference UMAP kinsub_ref_ucie <- kinase2cielab(kinsub_preds)
We can also fit your dataset to a new colourspace, based solely on (the 3D representation of) the dataset.
kinsub_umap_3d <- uwot::tumap(kinsub_preds, n_components = 3) kinsub_transformations <- fit_transform(kinsub_umap_3d) kinsub_ucie <- transform_data(kinsub_umap_3d, kinsub_transformations)
kinsub_umap_2d <- uwot::tumap(kinsub_preds) kinsub_2d <- kinsub_umap_2d %>% magrittr::set_colnames(c('X', 'Y')) kinsub_plot_data <- bind_cols(kinsub_info, kinsub_2d) %>% left_join(kinsub_ucie, by = c('fasta_id' = 'name')) kinsub_ref_plot_data <- bind_cols(kinsub_info, kinsub_2d) %>% left_join(kinsub_ref_ucie, by = c('fasta_id' = 'name'))
kinsub_plot_data %>% ggplot(aes(X, Y, color = colour)) + geom_point(size = 3) + scale_color_identity() + guides(color = 'none') kinsub_plot_data %>% mutate(kin_gene = forcats::fct_lump_n(kin_gene, 15)) %>% ggplot(aes(X, Y, color = colour)) + geom_point(size = 3) + scale_color_identity() + guides(color = 'none') + facet_wrap(~kin_gene) kinsub_ref_plot_data %>% ggplot(aes(X, Y, color = colour)) + geom_point(size = 3) + scale_color_identity() + guides(color = 'none') kinsub_ref_plot_data %>% mutate(kin_gene = forcats::fct_lump_n(kin_gene, 15)) %>% ggplot(aes(X, Y, color = colour)) + geom_point(size = 3) + scale_color_identity() + guides(color = 'none') + facet_wrap(~kin_gene)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.