knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

0. Overview

phosphocie is a package to create kinase colour embeddings: continuous colours that denote kinase specificities for a phosphorylation site. These colour embeddings are based on the U-CIE framework: reduce data down to 3 dimensions and optimally embed it in the CIELAB colourspace. Specifically, phosphocie provides functions to handle phosphorylation site data, create kinase prediction tool input files, handle the resulting kinase prediction output, and a pre-trained kinase colourspace based on the full PhosphoSitePlus phosphorylation site dataset

In this vignette, we will demonstrate phosphocie using the kinase-substrate-annotated phosphorylation site dataset, from PhosphoSitePlus.

For a less annotated example with a real phosphoproteomics dataset, see the example vignette

library(phosphocie)
library(dplyr)
library(ggplot2)
library(uwot)

1. Prepare phosphorylation site data

Read in phosphorylation site data

An example extract of the kinase-substrate dataset can be loaded in with read_kinsub():

kinsub_path <- system.file('extdata', 'kinase_substrate_dataset_head', package = 'phosphocie')
kinsub <- read_kinsub(kinsub_path)

head(kinsub)

Build FASTAs

build_fastas() creates a FASTA file with a separate FASTA entry (header and sequence) for each site, which can be used for input in NetPhorest. It automatically appends 7-wide cutout around the site to the header, which enables matching in the next steps. It also returns the same data, but with the new headers as the fasta_id column.

tmp <- tempdir()
kinsub_headers <- build_fastas(kinsub, name_col = 'unique_id', seq_col = 'fragment_15', path = file.path(tmp, 'kinsub_peptides.fasta'))

Alternatively, retrieve_fastas() retrieves full fasta files for a list of accession IDs. However, phosphocie is geared towards visualising kinase predictions for individual, possibly unidentified sites, so this option is not recommended and requires more manual processing to work with the follow-up functions.

# retrieve_fastas(kinsub_human$acc_id, file.path(tmp, 'kinsub.fasta'))

Predict kinases

Kinase prediction can be done with your tool of choice, but phosphocie natively supports NetPhorest. For details on how to run NetPhorest locally, see the NetPhorest site.

# Run NetPhorest on the command line
cat kinsub_human.fasta | ./netphorest > kinsub_human_netphorest

This output file is also included in the package, and can be read in using read_netphorest() and filter_netphorest():

kinsub_netphorest_path <- system.file('extdata', 'kinsub_head_netphorest', package = 'phosphocie')
kinsub_netphorest_raw <- read_netphorest(kinsub_netphorest_path) 

kinsub_netphorest <- filter_netphorest(kinsub_netphorest_raw, source_window_size = 15)

filter_netphorest() attempts to filter out sites in our sequences from build_fasta() that were detected by NetPhorest, but that aren't the true intended sites in the middle of the supplied sequence window. It can determine conclusive sites for any non-truncated windows or windows truncated by the end of the protein, but struggles to distinguish between true sites at the start of a protein and false sites at the beginning of the window without additional data.

For example, we can take a look at one uncertain ID in this data:

uncertain <- kinsub_netphorest_raw %>% filter(fasta_id == 'P13796|LCP1|L-plastin|S5|ARGSVSD')
uncertain

We know from the metadata encoded in fasta_id that S5 is the true site. It's not at position 8, where we normally expect sites in data with a source_window_size of 15 like PhosphoSitePlus, because the original 'filler' was truncated. However, without this metadata knowledge, it's impossible to distinguish this site from a fake detected site at the start of a window.

By default, filter_netphorest() will use the included site fragment included in the fasta header by build_fastas() to detect the true site:

uncertain %>%
  filter_netphorest(source_window_size = 15)

This can also be passed as a manual column:

uncertain %>%
  tidyr::extract(fasta_id, 'fragments', '\\|([A-Za-z_-]{3,9})$', remove = FALSE) %>% 
  filter_netphorest(source_window_size = 15, fragment_col = 'fragments')

If this metadata is not available, keep_uncertain determines how remaining uncertainties are handled: keep_uncertain = NULL (default): Keep the site closest to the middle. keep_uncertain = TRUE: Don't filter uncertain sites, and keep all possibilities instead. * keep_uncertain = FALSE: Drop any uncertain sites completely, only return sites with certainty.

uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = NULL)
uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = TRUE)
uncertain %>% filter_netphorest(source_window_size = 15, match_fragments = FALSE, keep_uncertain = FALSE)

Create colour embeddings

Join together the original data and kinase predictions

We match by the ID column, which is the primary identifier carried over from original data to the prediction file.

However, if these are distinct but both contain information that can be used to match (like gene/protein name, accession id, original protein location/residue), this information can be extracted using a function like tidyr::separate() and the resulting columns can be used to match as well (i.e. by = c('acc_id', 'position', 'residue'))

kinsub_full <- inner_join(kinsub_headers, kinsub_netphorest, by = 'fasta_id') 
kinsub_missing <- kinsub_headers %>% filter(!fasta_id %in% kinsub_full$fasta_id) # Data for which no prediction could be recovered

Separate data into labels and predictions

kinsub_preds <- select(kinsub_full, fasta_id, Abl_group:YSK_group) %>% 
  tibble::column_to_rownames('fasta_id') %>% 
  as.matrix()
kinsub_info <- select(kinsub_full, -c(Abl_group:YSK_group))

Transform data using the reference U-CIE space

This package includes a pre-calculated kinase colour embedding space, based on NetPhorest kinase predictions for all sites in PhosphoSitePlus's phosphorylation site dataset.

# Transform with reference UMAP
kinsub_ref_ucie <- kinase2cielab(kinsub_preds)

Alternative: transform into standalone U-CIE space

We can also fit your dataset to a new colourspace, based solely on (the 3D representation of) the dataset.

kinsub_umap_3d <- uwot::tumap(kinsub_preds, n_components = 3)

kinsub_transformations <- fit_transform(kinsub_umap_3d)
kinsub_ucie <- transform_data(kinsub_umap_3d, kinsub_transformations)

Visualise kinase colour embeddings

Reduce data to 2D

kinsub_umap_2d <- uwot::tumap(kinsub_preds)
kinsub_2d <- kinsub_umap_2d %>% magrittr::set_colnames(c('X', 'Y'))
kinsub_plot_data <- bind_cols(kinsub_info, kinsub_2d) %>% 
  left_join(kinsub_ucie, by = c('fasta_id' = 'name'))
kinsub_ref_plot_data <- bind_cols(kinsub_info, kinsub_2d) %>% 
  left_join(kinsub_ref_ucie, by = c('fasta_id' = 'name'))

Create plots

kinsub_plot_data %>% 
  ggplot(aes(X, Y, color = colour)) + 
  geom_point(size = 3) + 
  scale_color_identity() + 
  guides(color = 'none')

kinsub_plot_data %>% 
  mutate(kin_gene = forcats::fct_lump_n(kin_gene, 15)) %>% 
  ggplot(aes(X, Y, color = colour)) + 
  geom_point(size = 3) + 
  scale_color_identity() + 
  guides(color = 'none') + 
  facet_wrap(~kin_gene)

kinsub_ref_plot_data %>% 
  ggplot(aes(X, Y, color = colour)) + 
  geom_point(size = 3) + 
  scale_color_identity() + 
  guides(color = 'none')

kinsub_ref_plot_data %>% 
  mutate(kin_gene = forcats::fct_lump_n(kin_gene, 15)) %>% 
  ggplot(aes(X, Y, color = colour)) + 
  geom_point(size = 3) + 
  scale_color_identity() + 
  guides(color = 'none') + 
  facet_wrap(~kin_gene)


casblaauw/phosphocie documentation built on March 30, 2022, 8:28 p.m.