KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: @Kasper2019 (in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network.

KinSwingR is implemented as 3 core functions:

builds position weight matrices (PWMs) from known kinase-substrate sequences`buildPWM()`

score PWMs build using`scoreSequences()`

`buildPWM()`

against input phosphoproteome dataintegrates PWM scores, direction of phosphopeptide change and significance of phosphopeptide change into a "swing" score.`swing()`

The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance.

Additional functions are also provided:

function to tidy annotations and extract peptide sequences.`cleanAnnotation()`

function to view PWM models`viewPWM()`

Detailed information for each of these functions can be accessed using the `?`

command before the function of interest. E.g. `?buildPWM`

We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package.

Begin by loading the KinSwingR library and the two data libraries included in the package.

library(KinSwingR) data(example_phosphoproteome) data(phosphositeplus_human) # View the datasets: head(example_phosphoproteome) head(phosphositeplus_human) ## sample 100 data points for demonstration sample_data <- head(example_phosphoproteome, 1000) # Random sample for demosntration purposes set.seed(1234) sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),] # for visualising a motif, sample only CAMK2A CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",]

Where the centered peptide sequences (on the phosphosite of interest) are not provided in the format required for `scoreSequences()`

(see the argument "input_data", in ?scoreSequences), these can be required to be extracted from another column of annotated data. NB. "input_data" table format must contain columns for "annotation", "peptide", "fold-change" and "p-values".

In the example dataset provided, `example_phosphoproteome`

, peptides have not been extracted into a stand-a-lone peptide column. `cleanAnnotation()`

is provided as a function to extract peptides from annotation columns and place into the peptide column.

In the example dataset, `example_phosphoproteome`

, the peptide sequence is the 4th component of the annotation, which corresponds to using the argument `seq_number = 4`

below, and is seperated by `|`

, which corresponds to the argument `annotation_delimiter = "|"`

. In this case, the annotated data also contains multi-mapped and multi-site information. For example the following annotation `A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA`

contains two peptides `PAQPGHVSPTPATTS`

and `SPTPATTSPGEKGEA`

that map to different sites from the same reference gene `Numbl`

, where the peptides are seperated by `;`

. The annotated data also includes multi-protein mapped (where a peptide could map to more than one protein - not shown) and contains `X`

instead of `_`

to indicate sequences that were outside of the length of the coding sequences. KinSwingR requires that these sequences outside of the coding region are marked with `_`

as deafult and therefore `replace_search = "X"`

and `replace_with = "_"`

can be used as arguments in `cleanAnnotation()`

to replace these. This allows for full flexibility of the input data here, depending of the software used to generate determine the peptide sequences. NB: characters other than `_`

can be used, but these need to be declared when calling buildPWM and scoreSequences functions later (see their help files).

Calling `cleanAnnotation()`

will produce a new table with the unique combinations of peptide sequences extracted from the annotation column into the peptide column:

annotated_data <- cleanAnnotation(input_data = sample_data, annotation_delimiter = "|", multi_protein_delimiter = ":", multi_site_delimiter = ";", seq_number = 4, replace = TRUE, replace_search = "X", replace_with = "_") head(annotated_data)

The first step to inferring kinase activity, is to build Position Weight Matrices (PWMs) for kinases. This can be done using `buildPWM()`

for any table containing centered substrate peptide sequences for a list of kinases. The example data `data(phosphositeplus_human)`

indicates the required format for building PWM models. Below, for demosntration, we use a subset that was sampled above `sample_pwm`

To generate the PWMs:

pwms <- buildPWM(sample_pwm)

This will build the PWM models, accessible as `PWM$pwm`

and list the number of substrate sequences used to build each PWM, accesible as `PWM$kinase`

.

To view the list of kinases and the number of sequences used:

head(pwms$kinase)

Motif amino acids are coloured according to their properties. `color_scheme`

parameter allows options are "lesk" or "shapely" (default). Y-axis is the information content, measured as bits.

CAMK2A_pwm <- buildPWM(CAMK2A_example) CAMK2A <- viewPWM(CAMK2A_pwm, which_pwm="CAMK2A", view_pwm = TRUE, color_scheme = "shapely")

Next, we will use the PWM models generated, `pwms`

, to identify matches in the `annotated_data`

table that was cleaned using `cleanAnnotation()`

above. ``scoreSequences`supports multi-core processing - see the example below for setting the number of workers to 4.`

scoreSequences`draws a random background by default of size`

n = 1000`. It is recommended to use`

set.seed()`prior to calling`

scoreSequences`if you wish to reproduce your results. To access the help file, which explains all the arguments, type`

?scoreSequences``` into the console.

# As an example of control over multi-core processing # load BiocParallel library library(BiocParallel) # finally set/register the number of cores to use register(SnowParam(workers = 4)) # set seed for reproducible results set.seed(1234) scores <- scoreSequences(input_data = annotated_data, pwm_in = pwms, n = 100)

The outputs of `scores`

are transparent and accessible. These are however primarily intermediate tables for obtaining swing scores. `scores`

is a simple list object that contains peptide scores `(scores$peptide_scores)`

, p-values for the peptide scores `(scores$peptide_p)`

and the background peptides used to score significance `(scores$background)`

for reproducibility (i.e. the background can saved and reused for reproducibility).

In summary, `scoreSequences()`

scores each input sequence for a match against all PWMs provided using ``buildPWM()`

and generates p-values for scores. This is effectively one large network of kinase-substrate edges of dimensions kinase, ** k**, by substrate,

Having built a kinase-substrate network, `swing()`

then integrates the kinase-substrate predictions, directionality and significance of phosphopeptide fold change to assess local connectivity (or swing) of kinase-substrate networks. The final score is a normalised score of predicted kinase activity that is weighted for the number of substrates used in the PWM model and number of peptides in the local kinase-substrate network. By default, this will permute the network 1000 times (here we use 10 for example purposes). It is recommended to use `set.seed()`

prior to calling `swing`

if you wish to reproduce your results. ``swing``` supports multi-core processing - see the example below for setting the number of workers to 4.

# after loading BiocParallel library, set/register the number of cores to use register(SnowParam(workers = 4)) # set seed for reproducible results set.seed(1234) swing_out <- swing(input_data = annotated_data, pwm_in = pwms, pwm_scores = scores, permutations = 10) # This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores: head(swing_out$scores)

The outputs of this table indicate the following:

`kinase`

: The kinase`pos`

: Number ofregulated kinase substrates*positively*`neg`

: Number ofregulated kinase substrates*negatively*`all`

: Total number of regulated kinase substrates`pk`

: Proportion ofregulated kinase substrates*positively*`nk`

: Proportion ofregulated kinase substrates*negatively*`swing_raw`

: Raw - weighted score`n`

: Number of subtrate sequence in`kinase`

PWM`swing`

: Normalised (Z-score transformed) - weighted score`p_greater`

: probability of observing a swing score greater than`p_less`

: probability of observing a swing score less than

Note that the `pos`

, `neg`

and `all`

include a pseudo-count, that is set in `?swing`

, note `pseudo_count`

.

* See @Kasper2019 for methods description *

* For a full description of the KinSwing algorithm, see @Kasper2019 *

**In brief:**

`buildPWM()`

generates Position Weight Matrices (PWMs) for kinases based on known substrate sequence (Equation 1), where each kinase, $K$, is considered as the log-likelihood ratio of the average frequency of amino acid, $a$, at each position, $p$, divided by background frequencies, $B$ ($C$ is a pseudo count to avoid log zero):

** Equation 1:** $PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)$

`scoreSequences()`

scores each kinase, $K$, match to a substrate $S$, given as $S_{score}=∑^n_{(i=1)}f(a,p)$ , which corresponds to the sum of the corresponding amino acid, $a$, of peptide sequence length, $i$, from position, $p$, of $PWM_{(a,p)}$ and $f(a,p)=PWM_{ap}∈PWM_{(a,p)}$. The probability of observing $S_score$ for kinase, $K$, is determined as conditional on a randomly sampled reference distribution of size $N$ sequences $P(S_{score}|R,N)$, where $R$ sequences are determined to have a test statistic less than or equal to $S_{score}$:

** Equation 2:** $R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)$

`swing()`

integrates phosphosite data and kinase-substrate scores from `scoreSequences()`

into a network for scoring kinase activity based on local connectivity, $swing_k$, (Equation 3). $swing_k$ is the weighted product of the proportion of positive, $Pos_k$, and negative, $Neg_k$, network edges, determined as the product of a logic function (described here: @Kasper2019) given a local network of size, $C_k$, with $n$ substrates for kinase, $K$:

** Equation 3:** $swing_k=log_2((Pos_k+c)/(Neg_k+c))

$swing_k$, is transformed into a z-score, $Z(swing_k)$, where, $μ$, is the mean and, $σ$, the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions.

KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, $K$, by random chance?” by computing $swing_k$ given $N$ permutations of kinase node labels, $K$, to substrates, $S$, of the total network, $M_{ks}$. Thus, the probability of observing $swing_k$ is conditional on this permuted reference distribution, of size, $N$ (Equation 2). This is computed for each tail of the distribution, that is, positive and negative $swing_k$ scores.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.