KinSwingR: Predicting kinase activity from phosphoproteomics data
In KinSwingR: KinSwingR: network-based kinase activity prediction

Introduction to KinSwing

KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: @Kasper2019 (in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network.

KinSwingR is implemented as 3 core functions:

buildPWM() builds position weight matrices (PWMs) from known kinase-substrate sequences
scoreSequences() score PWMs build using buildPWM() against input phosphoproteome data
swing() integrates PWM scores, direction of phosphopeptide change and significance of phosphopeptide change into a "swing" score.

The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance.

Additional functions are also provided:

cleanAnnotation() function to tidy annotations and extract peptide sequences.
viewPWM() function to view PWM models

Detailed information for each of these functions can be accessed using the ? command before the function of interest. E.g. ?buildPWM

KinSwingR example workflow

We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package.

Begin by loading the KinSwingR library and the two data libraries included in the package.

library(KinSwingR)
data(example_phosphoproteome)
data(phosphositeplus_human)

# View the datasets:
head(example_phosphoproteome)

head(phosphositeplus_human)

## sample 100 data points for demonstration
sample_data <- head(example_phosphoproteome, 1000)

# Random sample for demosntration purposes
set.seed(1234)
sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),]

# for visualising a motif, sample only CAMK2A
CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",]

Extracting peptides for analysis

Where the centered peptide sequences (on the phosphosite of interest) are not provided in the format required for scoreSequences() (see the argument "input_data", in ?scoreSequences), these can be required to be extracted from another column of annotated data. NB. "input_data" table format must contain columns for "annotation", "peptide", "fold-change" and "p-values".

In the example dataset provided, example_phosphoproteome, peptides have not been extracted into a stand-a-lone peptide column. cleanAnnotation() is provided as a function to extract peptides from annotation columns and place into the peptide column.

In the example dataset, example_phosphoproteome, the peptide sequence is the 4th component of the annotation, which corresponds to using the argument seq_number = 4 below, and is seperated by |, which corresponds to the argument annotation_delimiter = "|". In this case, the annotated data also contains multi-mapped and multi-site information. For example the following annotation A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA contains two peptides PAQPGHVSPTPATTS and SPTPATTSPGEKGEA that map to different sites from the same reference gene Numbl, where the peptides are seperated by ;. The annotated data also includes multi-protein mapped (where a peptide could map to more than one protein - not shown) and contains X instead of _ to indicate sequences that were outside of the length of the coding sequences. KinSwingR requires that these sequences outside of the coding region are marked with _ as deafult and therefore replace_search = "X" and replace_with = "_" can be used as arguments in cleanAnnotation() to replace these. This allows for full flexibility of the input data here, depending of the software used to generate determine the peptide sequences. NB: characters other than _ can be used, but these need to be declared when calling buildPWM and scoreSequences functions later (see their help files).

Calling cleanAnnotation() will produce a new table with the unique combinations of peptide sequences extracted from the annotation column into the peptide column:

annotated_data <- cleanAnnotation(input_data = sample_data, 
                                  annotation_delimiter = "|",
                                  multi_protein_delimiter = ":", 
                                  multi_site_delimiter = ";",
                                  seq_number = 4, 
                                  replace = TRUE, 
                                  replace_search = "X",
                                  replace_with = "_")

head(annotated_data)

Build Position Weight Matrices (PWMs)

The first step to inferring kinase activity, is to build Position Weight Matrices (PWMs) for kinases. This can be done using buildPWM() for any table containing centered substrate peptide sequences for a list of kinases. The example data data(phosphositeplus_human) indicates the required format for building PWM models. Below, for demosntration, we use a subset that was sampled above sample_pwm

To generate the PWMs:

pwms <- buildPWM(sample_pwm)

This will build the PWM models, accessible as PWM$pwm and list the number of substrate sequences used to build each PWM, accesible as PWM$kinase.

To view the list of kinases and the number of sequences used:

head(pwms$kinase)

Visualising motifs

Motif amino acids are coloured according to their properties. color_scheme parameter allows options are "lesk" or "shapely" (default). Y-axis is the information content, measured as bits.

CAMK2A_pwm <- buildPWM(CAMK2A_example)
CAMK2A <- viewPWM(CAMK2A_pwm, 
                  which_pwm="CAMK2A",
                  view_pwm = TRUE,
                  color_scheme = "shapely")

Score PWM matches against peptide sequences

Next, we will use the PWM models generated, pwms, to identify matches in the annotated_data table that was cleaned using cleanAnnotation() above. ``scoreSequencessupports multi-core processing - see the example below for setting the number of workers to 4.scoreSequencesdraws a random background by default of sizen = 1000. It is recommended to useset.seed()prior to callingscoreSequencesif you wish to reproduce your results. To access the help file, which explains all the arguments, type?scoreSequences``` into the console.

# As an example of control over multi-core processing
# load BiocParallel library
library(BiocParallel)
# finally set/register the number of cores to use
register(SnowParam(workers = 4))

# set seed for reproducible results
set.seed(1234)
scores <- scoreSequences(input_data = annotated_data, 
                         pwm_in = pwms,
                         n = 100)

The outputs of scores are transparent and accessible. These are however primarily intermediate tables for obtaining swing scores. scores is a simple list object that contains peptide scores (scores$peptide_scores), p-values for the peptide scores (scores$peptide_p) and the background peptides used to score significance (scores$background) for reproducibility (i.e. the background can saved and reused for reproducibility).

In summary, scoreSequences() scores each input sequence for a match against all PWMs provided using `buildPWM() and generates p-values for scores. This is effectively one large network of kinase-substrate edges of dimensions kinase, k, by substrate, s.

Predict kinase activity using swing()

Having built a kinase-substrate network, swing() then integrates the kinase-substrate predictions, directionality and significance of phosphopeptide fold change to assess local connectivity (or swing) of kinase-substrate networks. The final score is a normalised score of predicted kinase activity that is weighted for the number of substrates used in the PWM model and number of peptides in the local kinase-substrate network. By default, this will permute the network 1000 times (here we use 10 for example purposes). It is recommended to use set.seed() prior to calling swing if you wish to reproduce your results. ``swing``` supports multi-core processing - see the example below for setting the number of workers to 4.

# after loading BiocParallel library, set/register the number of cores to use
register(SnowParam(workers = 4))

# set seed for reproducible results
set.seed(1234)
swing_out <- swing(input_data = annotated_data, 
                   pwm_in = pwms, 
                   pwm_scores = scores,
                   permutations = 10)

# This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores:
head(swing_out$scores)

The outputs of this table indicate the following:

kinase: The kinase
pos: Number of positively regulated kinase substrates
neg: Number of negatively regulated kinase substrates
all: Total number of regulated kinase substrates
pk: Proportion of positively regulated kinase substrates
nk: Proportion of negatively regulated kinase substrates
swing_raw: Raw - weighted score
n: Number of subtrate sequence in kinase PWM
swing: Normalised (Z-score transformed) - weighted score
p_greater: probability of observing a swing score greater than
p_less: probability of observing a swing score less than

Note that the pos, neg and all include a pseudo-count, that is set in ?swing, note pseudo_count.

See @Kasper2019 for methods description

KinSwingR algorithm

For a full description of the KinSwing algorithm, see @Kasper2019

In brief:

buildPWM() generates Position Weight Matrices (PWMs) for kinases based on known substrate sequence (Equation 1), where each kinase, $K$, is considered as the log-likelihood ratio of the average frequency of amino acid, $a$, at each position, $p$, divided by background frequencies, $B$ ($C$ is a pseudo count to avoid log zero):

Equation 1: $PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)$

scoreSequences() scores each kinase, $K$, match to a substrate $S$, given as $S_{score}=∑^n_{(i=1)}f(a,p)$ , which corresponds to the sum of the corresponding amino acid, $a$, of peptide sequence length, $i$, from position, $p$, of $PWM_{(a,p)}$ and $f(a,p)=PWM_{ap}∈PWM_{(a,p)}$. The probability of observing $S_score$ for kinase, $K$, is determined as conditional on a randomly sampled reference distribution of size $N$ sequences $P(S_{score}|R,N)$, where $R$ sequences are determined to have a test statistic less than or equal to $S_{score}$:

Equation 2: $R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)$

swing() integrates phosphosite data and kinase-substrate scores from scoreSequences() into a network for scoring kinase activity based on local connectivity, $swing_k$, (Equation 3). $swing_k$ is the weighted product of the proportion of positive, $Pos_k$, and negative, $Neg_k$, network edges, determined as the product of a logic function (described here: @Kasper2019) given a local network of size, $C_k$, with $n$ substrates for kinase, $K$:

Equation 3: $swing_k=log_2((Pos_k+c)/(Neg_k+c))log_2(C_k)log_2(S_n)$

$swing_k$, is transformed into a z-score, $Z(swing_k)$, where, $μ$, is the mean and, $σ$, the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions.

KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, $K$, by random chance?” by computing $swing_k$ given $N$ permutations of kinase node labels, $K$, to substrates, $S$, of the total network, $M_{ks}$. Thus, the probability of observing $swing_k$ is conditional on this permuted reference distribution, of size, $N$ (Equation 2). This is computed for each tail of the distribution, that is, positive and negative $swing_k$ scores.