README.md
In edpratt1/KINATESTID: A Pipeline for Kinase-Specific Artificial Peptide Design

KINATESTID

Introduction

KINATEST-ID is a pipeline to design artificial peptide substrates which are kinase-specific. This pipeline employs a combination of curated collections of known endogenous substrate sequences and data from phosphoproteomics-informed kinase assays to develop an in silico peptide library. This library is used for systematic identification of position-specific amino acid preferences. These preferences inform the design of artificial candidate sequences optimized for kinase-specific performance which can then be characterized in vitro or in vivo.

This vignette demonstrates how the KINATEST-ID package to load data, identify relevant residues, and generate and score candidate sequences.

Installing the `KINATEST-ID` package

Using the `devtools` Package

To install the package from a local source is using the install_local() function from devtools package:

library(devtools)
install_local("path_to_source/KINATESTID2.tar.gz", dependencies = TRUE)

Where, path_to_source is absolute path of local source file.

To install the package from the local source file:

install.packages(path_to_source, repos = NULL, type="source")

Another possible way is using RStudio:

Go to Tools.
Go to Install Packages.
In the Install From set it as Package Archive File (.zip; .tar.gz).
Then Browse to KINATESTID2.tar.gz and click Install.

Loading the `KINATEST-ID` package

Once the package has been installed, it can be loaded using:

library(KINATESTID)

Data File Nomenclature

The KINATESTID pipeline accepts both PTM peptide reports generated using the KinaMine Galaxy-P workflow (legacy) and the newer Peaks search program (non-legacy).

The phosphoproteomics file name should follow this format: KINASE_CONDITION_REPLICATE_*.csv.

KINASE: The abbreviation for the kinase used in the enzymatic reaction.
CONDITION: Accepts two values, PLUS indicates an enzyme-treated sample and MINUS indicates an untreated (i.e. negative control) sample.
REPLICATE: Should follow the format of R* where * is a wildcard representing the replicate number (numeric). Any additional annotation can be added with an _ following the replicate number and will be ignored for file processing.

Important: Legacy file names must also contain Substrate for substrate files and SBF or FREQ for substrate background frequency files.

The phosphoproteomics data file should be a .csv and must contain columns for each flank position ranging form -7 to +7. Each position column should contain 1) an amino acid single letter code or 2) a blank space or - to indicate an empty value. Other required column headers are:

Non-Legacy

UniProt Identifier: A column containing the proteome identifier associated with each peptide sequence detected.
File Name: The raw file names from batch processing. File names should follow the same nomenclature listed under Data File Name.
AScore: A column of numerical probability scores used for peptide QC. There is an automatic AScore cutoff of 30 for non-legacy files.
Modification Name: A character column listing the detected PTM type. Accepted modifications must be listed in the data file ptm_key.
Amino Acid Letter: A character column with the single letter code of the modified amino acid.

Legacy

Reference: A column containing the proteome identifier associated with each peptide sequence detected. If the column name is different, it can be manually provided in the import_substrates function.

General

Species: A character column identifying the organism the protein is associated with.

Header names Phosphite and Phosphosite are retained but currently not used. All other columns are ignored during data import.

Quick start using the in-built dataset

Our example uses the btk_kalip dataset, which comes as part of the package. This dataset is the result of a Kinase Assay-Linked Phosphoproteomics (KALIP) experiment. It contains a triplicate treatment of K562 cell-derived peptide libraries with Bruton’s tyrosine kinase (BTK). It can be loaded into the workspace using:

path <- system.file("extdata", package = "KINATESTID")
substrates <- import_substrates(path, ptm_type = "Phosphorylation (STY)-Y",
                                legacy =  FALSE, freq = TRUE)

The main import function import_substrates is a wrapper function with the following essential parameters:

path: the directory containing the analyzed phosphoproteomics data.
ptm_type: the post-translational modification to be analyzed and its targeted the amino acid. Currently, only one ptm type can be analyzed at a time. A list of ptm options can be accessed by typing ptm_key.
legacy: a logical parameter where TRUE is for files generated using the old Galaxy-P KinaMine workflow. Is set to FALSE by default.
freq: a logical parameter where TRUE indicates substrate frequency across technical replicates should be recorded. Is set to TRUE by default.
ref_col: Only applicable to legacy files. A string with the name of the column containing the uniprot identifier information. If none is specified, it will search for a Reference column automatically.

Part of the processed btk_kalip dataset is shown below:

substrate_barcode file_name Row uniprot_id Protein Amino Acid Letter Modification Number Modification Name AScore Position Peptide Orignal Peptide String -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 class replicate control sample ——-YIDQEEEL BTK_PLUS_R1_protein-peptides.csv 324 P07900 HS90A_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 284 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 1 0 3 ——-YIDQEEEL BTK_PLUS_R1_protein-peptides.csv 452 P08238 HS90B_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 276 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 1 0 3 ——-YIDQEEEL BTK_PLUS_R1_run2_protein-peptides.csv 463 P08238 HS90B_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 276 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 1 0 3 ——-YIDQEEEL BTK_PLUS_R1_run2_protein-peptides.csv 661 P07900 HS90A_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 284 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 1 0 3 ——-YIDQEEEL BTK_PLUS_R2_protein-peptides.csv 141 P08238 HS90B_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 276 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 2 0 3 ——-YIDQEEEL BTK_PLUS_R2_protein-peptides.csv 286 P07900 HS90A_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 284 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 2 0 3 ——-YIDQEEEL BTK_PLUS_R3_protein-peptides.csv 219 P08238 HS90B_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 276 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 3 0 3 ——-YIDQEEEL BTK_PLUS_R3_protein-peptides.csv 352 P07900 HS90A_HUMAN Y (+79.97) Phosphorylation (STY) 1000.00 284 YIDQEELNK Y(+79.97)IDQEELNK - - - - - - - Y I D Q E E E L sample 3 0 3 ——DYEEIGPPS BTK_PLUS_R2_protein-peptides.csv 4496 P61158 ARP3_HUMAN Y (+79.97) Phosphorylation (STY) 56.47 400 DYEEIGPSICR DY(+79.97)EEIGPSIC(+57.02)R - - - - - - D Y E E I G P P S sample 2 0 1 ——DYFEQYGGK BTK_PLUS_R2_protein-peptides.csv 1113 F8W6I7 F8W6I7_HUMAN Y (+79.97) Phosphorylation (STY) 42.89 124 DYFEQYGK DY(+79.97)FEQYGK - - - - - - D Y F E Q Y G G K sample 2 0 1

The function bkgrnd_corr will automatically select peptide sequences which are present in maximum number of treated replicates and minimum number of control samples.

uniq_subs <- bkgrnd_corr(substrates)

Generating the Postion-Specific Scoring Matrix (PSSM)

A library of theoretical negative peptide sequences can be generated by extracting full protein sequences from UniprotKb and performing an in silico tryptic digest. There is a pre-loaded digest named uniprot_tryptic.

output_dir <- getwd()
uniprot <- import_uniprot(uniq_subs, uniprot_tryptic, path = output_dir)

Instead of locating the output_dir using getwd(), a user should provide the path, e.g. /path/to/analysis/files. If there are more than 200 candidate sequences, the import_uniprot function will loop to extract, PTM-center, and export the sequences as .csv files before merging and importing the data back into R. This is a workaround for the high memory usage of the current PTM-centering function.

Fisher exact test is used to calculate the odds ratio and significance of each residue across the flanking sequence. This step is performed by the function substrate_fisher_test shown below:

pssm <- substrate_fisher_test(substrates_dt = uniq_subs, uniprot_dt = uniprot,
                              type = "aa")

The output is a list containing two data tables:

Exact Test P-values
Fisher Exact Odds

For the -7 to +7 flanking positions of the designated PTM. We can examine both elements to see if they are consistent in terms of flank order (column) and amino acid (row).

head(pssm[[1]])
#>          -7         -6         -5        -4          -3        -2          -1 0
#> A 0.1133353 0.19067508 0.03525808 0.3503830 0.567168574 0.1193333 0.802610625 1
#> C 1.0000000 0.08136039 0.03201637 0.5575060 1.000000000 1.0000000 0.114075137 1
#> D 0.0884393 1.00000000 0.12416658 0.1333070 0.058179850 0.6127682 0.123481932 1
#> E 1.0000000 0.75219969 0.42872427 1.0000000 0.003951593 0.2872787 0.321567706 1
#> F 1.0000000 1.00000000 1.00000000 1.0000000 0.738902243 0.6099774 0.414531982 1
#> G 1.0000000 0.71780589 1.00000000 0.1084457 0.310722003 0.6248805 0.002175226 1
#>           1         2          3         4          5          6         7
#> A 0.0203472 0.8001227 0.06960006 0.3591426 0.62425552 0.16218362 0.6856293
#> C 0.6211382 0.5369008 0.09627782 1.0000000 1.00000000 1.00000000 1.0000000
#> D 0.8020600 0.4250087 1.00000000 1.0000000 0.05160001 0.13259150 0.2366401
#> E 0.1728406 0.3739120 0.11975221 0.2909379 0.07398821 0.03663005 0.7230380
#> F 0.2025829 0.3967163 0.52852602 0.6971342 0.62818755 1.00000000 0.4439286
#> G 0.5677780 0.3608784 1.00000000 0.7928085 1.00000000 0.76257455 0.2492259

head(pssm[[2]])
#>          -7        -6       -5        -4        -3        -2        -1    0
#> A 2.8828155 2.1572349 2.549460 1.5225831 1.2403152 1.8776049 1.1103800 0.05
#> C 0.0500000 4.9003053 5.087841 1.2824186 0.0500000 0.9968278 2.7832186 0.05
#> D 3.2510456 0.7201938 2.125238 2.0293844 2.1342050 1.1892363 0.2086158 0.05
#> E 0.6949790 1.0931315 1.500962 0.9847433 2.9609795 1.4829928 0.3969203 0.05
#> F 0.0500000 0.0500000 0.050000 0.8705247 1.1913971 1.1016606 1.3991439 0.05
#> G 0.6327807 0.4280962 1.021093 1.8267348 0.4156149 1.2813852 0.0500000 0.05
#>           1         2         3         4         5         6         7
#> A 2.2138842 1.1356438 0.0500000 0.3157393 0.0500000 0.0500000 1.1945659
#> C 0.0500000 1.3625518 3.0297666 1.0327090 0.0500000 0.0500000 0.0500000
#> D 1.1181638 1.3625770 0.7080411 0.9878358 2.4884854 2.0427021 1.9153451
#> E 1.7246087 1.4295457 0.2190738 1.5662301 2.3658769 2.5288035 0.3985889
#> F 1.8824513 0.0500000 1.4393602 1.1358747 0.0500000 0.0500000 1.8303415
#> G 0.7040812 0.5152855 0.9154457 0.7101717 0.8433523 0.6190274 0.0500000

If an amino acid is observed at a specific flank position in only the background frequency tables, but not the sample frequency tables, the fisher odds will be identically zero. This will also occur if the dataset spans a smaller flank around the central ptm (e.g. -4 to +4). This introduces a number of downstream analysis issues, especially for sequence scoring. There are a number of proposed workarounds, ranging from Bayesian predictors to introduction of small arbitrary variable such as a pseudocount. The current implementation uses the simplest approach of adding a negligible pseudo-odds to structural zeros in the fisher odds table. If entire columns are empty (e.g. -4 to +4 flanking sequence), they will not be ignored during peptide scoring.

A similar analysis can be run based on amino acid properties instead of individual residues using the type “aa_property”.

pssm <- substrate_fisher_test(substrates_dt = uniq_subs, uniprot_dt = uniprot,
                              type = "aa_property", 
                              property = "property_chemical")

Available amino acid property types are listed in the data file aa_classificiation.

Setting Thresholds for Kinase-Specific Peptide Scoring

The KINATESTID package contains the data file screener_raw which contains kinase-specific preferred substrates curated from the literature and from phosphoproteomics experiments. The screener_uniprot file is the corresponding protein-matched in silico control peptide library.

The function multi_screener takes as inputs 1) the kinase-specific preferred substrates file and 2) the protein-matched in silico control peptide library. It also accepts the following parameters:

path: the file path where output data should be saved.
method: Peptide scoring algorithim. Accepts product of odds ratios (“prod”), weighted product of odds ratio (“w_prod”), and the sum of log scores (“log2_sum”).
pval_corr: a logical parameter where TRUE will set all fisher odds with non-significant p-values (> 0.05) to 1. Set to FALSE by default.
type: Accepts either aa or aa_property. aa scores peptides based on fisher odds for each amino acid and aa_property scores peptides based on general aa properties.
norm_method: Accepts either none or bkgrndto indicate whether raw scores or background-corrected ones should be used. Background-correction subtracts the mean and divides by the standard deviation of the negative control substrate scores.
property: Define amino acid property type to be analyzed. Set to NULL by default.

screener <- multi_screener(screener_raw, screener_uniprot, 
                           path = output_dir,
                           method = "prod",
                           pval_corr = FALSE,
                           type = "aa",
                           norm_method = "none",
                           constrain = 0.90)

For each kinase, an optimal threshold will be calculated to separate the peptide scores of modified vs control sequences. This is calculated using the package cutpointr. A bootstrap model is used to select an optimal cutpoint to maximize sensitivity with a minimum specificity of 90%.

The performance of the identified threshold will be printed to the command line, as shown here:

# > Subgroup: SYK 
# > -------------------------------------------------------------------------------------------------------------------------------------- 
# >   AUC    n n_pos n_neg
# > 0.9717 1100    62  1038
# > 
# > optimal_cutpoint sens_constrain    acc sensitivity specificity tp fn fp  tn
# > 1.363         0.9516 0.9136      0.9516      0.9114 59  3 92 946
# > 
# > Predictor summary: 
# >   Data         Min.           5%      1st Qu.       Median        Mean      3rd Qu.          95%       Max.          SD NAs
# > Overall 1.610584e-09 7.301556e-06 6.458562e-04 9.655520e-03  558.351221 1.715666e-01    66.379461 309355.741 10083.35581   0
# > sample 1.083963e-02 1.619602e+00 3.780979e+01 3.180207e+02 9792.742795 2.727807e+03 26704.369987 309355.741 41711.57507   0
# > bkgrnd 1.610584e-09 6.547317e-06 5.320628e-04 6.864003e-03    6.778699 9.040747e-02     4.834312   1534.229    74.98865   0
# > 
# > Bootstrap summary: 
# >   Variable Min.   5% 1st Qu. Median Mean 3rd Qu.  95% Max.   SD NAs
# > optimal_cutpoint 0.92 1.05    1.26   1.62 1.79    2.10 3.10 3.64 0.68   0
# > AUC_b 0.94 0.95    0.96   0.97 0.97    0.98 0.98 0.99 0.01   0
# > AUC_oob 0.94 0.95    0.96   0.97 0.97    0.98 0.99 0.99 0.01   0
# > sens_constrain_b 0.85 0.89    0.92   0.95 0.94    0.97 0.98 1.00 0.03   0
# > sens_constrain_oob 0.76 0.82    0.90   0.93 0.93    0.96 1.00 1.00 0.06   0
# > acc_b 0.90 0.91    0.91   0.92 0.92    0.93 0.94 0.95 0.01   0
# > acc_oob 0.88 0.89    0.91   0.92 0.92    0.93 0.94 0.94 0.01   0
# > sensitivity_b 0.85 0.89    0.92   0.95 0.94    0.97 0.98 1.00 0.03   0
# > sensitivity_oob 0.76 0.82    0.90   0.93 0.93    0.96 1.00 1.00 0.06   0
# > specificity_b 0.90 0.91    0.91   0.92 0.92    0.93 0.94 0.94 0.01   0
# > specificity_oob 0.88 0.89    0.91   0.92 0.92    0.93 0.94 0.95 0.02   0
# > cohens_kappa_b 0.46 0.49    0.51   0.53 0.53    0.54 0.62 0.64 0.04   0
# > cohens_kappa_oob 0.39 0.43    0.49   0.51 0.52    0.55 0.59 0.65 0.05   0

The function multi_screener will automatically output a graph showing the distribution of peptide scores for each kinase, with the selected threshold marked with a vertical line.

Generating Candidate Artificial Peptide Sequences

Identification of kinase-preferred amino acids, generation of artificial peptide substrates, and in silico scoring for potential efficiency and specificity are wrapped under the generate_substrates function.

candidates <- generate_substrates(pssm, uniprot, screener, 
                                  target_kinase = "BTK",
                                  screening_kinase = "ALL",
                                  n_hits = 10)

target_kinase: Input is the abbreviation for the kinase the substrate is to be optimized for.
screening_kinase: Input is a vector of kinases to optimize against. Setting screening_kinase = "ALL" will screen against every kinase in the existing screener file.

The output is a list containing four elements:

A list of amino acids used for permutation when creating candidate substrates.
- sig_pos: The number of residues which were statistically significant (p \<= 0.05) at that position
In silico screen for activity against panel of kinases indicated by the user.
- active: If the substrate score is > than the threshold, it will be marked as active for that enzyme.
- perf: Predicted performance is split into “low”, “medium”, and “high” based on the peptide scoring falling above the 10th, 50th or 90th quantile, respectively.
- n_active: The total number of kinases marked as “active” based on the criteria above.
Table of candidate peptide substrates in order of raw score.
The top n_hits substrates scored to be active for `target_kinase and the lowest substrate scores for all other kinases in the screener panel.

An example output is shown below:

#> $candidate_aa
#>    amino_acid flank_pos fisher_odds fisher_pval barcode sig_pos
#> 1:          A        -7    2.882816  0.11333533    A:-7       0
#> 2:          A        -6    2.157235  0.19067508    A:-6       0
#> 3:          A        -5    2.549460  0.03525808    A:-5       2
#> 4:          A        -2    1.877605  0.11933330    A:-2       0
#> 5:          A         1    2.213884  0.02034720     A:1       2
#> 6:          C        -5    5.087841  0.03201637    C:-5       2
#> 
#> [[2]]
#>    kinase active     perf        score  threshold substrate_barcode n_active
#> 1:    ABL   TRUE   medium    13.350626   2.719099                 1       10
#> 2:    ARG   TRUE   medium     6.430851   2.903400                 1       10
#> 3:    AXL   TRUE     high    13.801509   2.428566                 1       10
#> 4:    BTK   TRUE     high 25533.582387   3.111091                 1       10
#> 5:    CSK  FALSE inactive     1.365685 934.714727                 1       10
#> 6:    FES   TRUE   medium  3316.614922  15.360549                 1       10
#> 
#> [[3]]
#>    substrate_barcode raw_score -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 n_active
#> 1:                 1 118551.71  A  A  A  N  D  A  I Y A D V K D E D       10
#> 2:                 2  44676.03  I  A  A  N  D  A  I Y A D V K D E D        9
#> 3:                 3 108698.54  P  A  A  N  D  A  I Y A D V K D E D        8
#> 4:                 4 236588.21  A  A  C  N  D  A  I Y A D V K D E D        9
#> 5:                 5  89157.89  I  A  C  N  D  A  I Y A D V K D E D        5
#> 6:                 6 216924.68  P  A  C  N  D  A  I Y A D V K D E D        9
#> 
#> $top_hits
#>    substrate_barcode raw_score -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 n_active
#> 1:             12308  26321.37  I  A  C  R  Q  A  L Y D E V K D E L        1
#> 2:             12605  37179.56  I  A  C  R  E  A  L Y D L V K D E L        1
#> 3:             12632  38214.22  I  A  C  R  Q  A  L Y D L V K D E L        1
#> 4:             12929  22641.74  I  A  C  R  E  A  L Y D T V K D E L        1
#> 5:             12938  46491.21  I  A  C  N  Q  A  L Y D T V K D E L        1
#> 6:             16196  25024.51  I  A  C  R  Q  A  L Y D E V K E E L        1

Data Visualization

`peptide_intersect`: Peptide Commonality Across Replicates

To better understand the distribution of peptides detected, there are some simple analyses pre-built into KINATEST-ID. The first is looking at overlap in detected peptides across technical replicates.

venn <- peptide_intersect(substrates)[[1]]

The peptide_intersect function output is a list of two elements:

A ggplot object containing the venn diagram for each technical replicate. The number of sample sets is detected automatically (up to three replicates are currently supported).
A nested list containing the substrate_barcode information for each overlap set.

`substrates_heatmap`: Heatmap of PSSM Data

Another visualization option is to view the position-specific frequency of each amino acid using a heatmap.

heatmap <- substrates_heatmap(uniq_subs, scramble = FALSE, seed = 123)

The substrates_heatmap function accepts the following arguments:

scramble: A logical value indicating whether the rows should be randomly shuffled. Set to FALSE by default.
seed: An arbitrary value to generate reproducible random numbers. Only needed in cases where scramble is set to TRUE.

`pssm_volcano`: Create Volcano Plot

A volcano plot can be generated for a more general view of differential residue “preferences”.

volcano <- pssm_volcano(pssm, odds_cutoff = 0.5, pval_cutoff = 0.05)
volcano

The pssm_volcano function takes the following parameters:

odds_cutoff: The log 2 odds ratio threshold for being considered significant. If not specified it is set to 0.5 automatically.
pval_cutoff: The threshold p-value to be considered significant. If not specified it is set to 0.05 automatically.

These analyses and other quality control plots are contained in the wrapper function substrates_qc.

qc <- substrates_qc(substrates, uniprot, output_dir)

Other Functions

The generated PSSM can be readily used to generate motif or affinity logos for presentation. The following is an example using the package motifStack, but many similar R packages exist.

library(motifStack)
motif <- new("psam", mat= pssm[[2]], name = "Affinity Logo", 
             color = colorset(alphabet = "AA", colorScheme = "chemistry"))
plot(motif)

`multi_candidate_screener`: Scoring Existing Peptide Sequences

Existing sequences can be re-scored using the KINATEST-ID algorithim as long as the input .csv file has a similar -7 to +7 input structure:

old_peptides <- fread("path_to_source/data_file.csv")

# 
#   substrate_barcode kinase   -7   -6   -5 -4 -3 -2 -1 0 1 2 3 4    5    6    7
#1:                U6    ALL <NA> <NA> <NA>  D  E  P  I Y D T V E <NA> <NA> <NA>
#2:                U3    ALL <NA> <NA> <NA>  D  E  D  I Y G T P E <NA> <NA> <NA>
#3:                U7    ALL <NA> <NA> <NA>  E  D  D  V Y D S V P <NA> <NA> <NA>
#4:                U2    ALL <NA> <NA> <NA>  E  D  P  I Y V T L E <NA> <NA> <NA>
#5:                U1    ALL <NA> <NA> <NA>  E  D  D  E Y Y V T P    E <NA> <NA>
#6:                U4    ALL <NA> <NA> <NA>  D  E  P  I Y D T P E <NA> <NA> <NA>
#7:                U5    ALL <NA> <NA> <NA>  D  E  A  I Y A T V A <NA> <NA> <NA>
#8:                U8    ALL <NA> <NA> <NA>  E  D  D  E Y I S P E <NA> <NA> <NA>
#9:                U9    ALL <NA> <NA> <NA>  E  D  D  E Y A T P E <NA> <NA> <NA>

Mutate the input data to long format and create a new barcode column which combines flank position and amino acid information.

old_peptides <- data.table(reshape2::melt(old_peptides,
                                          id.vars = c("substrate_barcode", "kinase"), 
                                          value.name = "amino_acid", 
                                          variable.name = c("flank_pos")))
old_peptides[, barcode:= paste0(amino_acid, ":", flank_pos)]
old_peptides <- na.omit(old_peptides)

head(old_peptides)
#    substrate_barcode kinase flank_pos amino_acid barcode
# 1:                U6    ALL        -4          D    D:-4
# 2:                U3    ALL        -4          D    D:-4
# 3:                U7    ALL        -4          E    E:-4
# 4:                U2    ALL        -4          E    E:-4
# 5:                U1    ALL        -4          E    E:-4
# 6:                U4    ALL        -4          D    D:-4

scores <- multi_candidate_screener(screener, old_peptides, "ALL", FALSE)
head(scores)
#        kinase active     perf      score   cutpoint substrate_barcode n_active
#     1:    ABL   TRUE   medium   4.686755   2.696951                U6       11
#     2:    ARG   TRUE   medium   7.352788   2.928542                U6       11
#     3:    AXL   TRUE     high  17.898059   2.401886                U6       11
#     4:    BTK   TRUE     high 162.656945   3.214976                U6       11
#     5:    CSK  FALSE inactive  96.024821 599.754951                U6       11

Raw peptide scores can be translated to an absolute scale based on the range of peptide scores found in the screener file provided. norm_scores will scale raw peptide scores based on the following equation:

$$ score_{scaled} = \frac{log(score_{raw}) - log(score_{min})}{(log(score_{max}) - log(score_{min})}*100 $$ Where scorescaled is the converted score, scoreraw is the raw peptide score, scoremin is the minimum peptide score observed for a kinase substrate and scoremax is the maximum.

normscores <- norm_scores(scores, screener)
lapply(normscores, head)
# [[1]]
#   substrate_barcode ABL ARG AXL BTK CSK FES FLT3 FYN HCK JAK2 LCK LYN PYK2 SRC SYK TYRO3 YES
# 1                U1  75  75  83  71  51  59   96  53  54   86  46  68   68  79  70    82  83
# 2                U2  77  79  86  84  75  76   90  63  62   74  38  79   81  79  75    82  82
# 3                U3  89  91  81  86  73  65   84  69  60   58  37  81   83  84  79    78  91
# 4                U4  82  85  86  88  76  74   76  70  60   54  45  79   84  84  79    78  90
# 5                U5  88  90  87  91  62  84   97  72  64   42  44  84   72  67  70    79  74
# 6                U6  78  82  88  91  72  86   79  74  60   54  36  80   82  83  78    76  90
# 
# [[2]]
# kinase norm_thresholds
# 1:    ABL              74
# 2:    ARG              77
# 3:    AXL              73
# 4:    BTK              74
# 5:    CSK              76
# 6:    FES              69

The norm_scores function output is a list of two elements:

Converted scorescaled for each substrate provided across all kinases listed in the screener file.
The corresponding scaled thresholds for each kinase cutpointscaled.

Session Info

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] magrittr_2.0.2         kableExtra_1.3.4       KINATESTID_2.1.1      
#>  [4] DT_0.20                shiny_1.7.1            stringr_1.4.0         
#>  [7] VennDiagram_1.7.1      futile.logger_1.4.3    RColorBrewer_1.1-2    
#> [10] reshape2_1.4.4         patchwork_1.1.1        pheatmap_1.0.12       
#> [13] ggplotify_0.1.0        EnhancedVolcano_1.12.0 ggrepel_0.9.1         
#> [16] ggplot2_3.3.5          data.table_1.14.2      cutpointr_1.1.1       
#> [19] cowplot_1.1.1          devtools_2.4.3         usethis_2.1.5         
#> 
#> loaded via a namespace (and not attached):
#>  [1] fs_1.5.2             webshot_0.5.2        httr_1.4.2          
#>  [4] ash_1.0-15           rprojroot_2.0.2      tools_4.1.2         
#>  [7] utf8_1.2.2           R6_2.5.1             KernSmooth_2.23-20  
#> [10] vipor_0.4.5          colorspace_2.0-2     withr_2.4.3         
#> [13] tidyselect_1.1.1     prettyunits_1.1.1    ggrastr_1.0.1       
#> [16] processx_3.5.2       ggalt_0.4.0          compiler_4.1.2      
#> [19] extrafontdb_1.0      rvest_1.0.2          cli_3.1.1           
#> [22] formatR_1.11         xml2_1.3.3           desc_1.4.0          
#> [25] scales_1.1.1         proj4_1.0-11         callr_3.7.0         
#> [28] systemfonts_1.0.3    digest_0.6.29        yulab.utils_0.0.4   
#> [31] svglite_2.0.0        rmarkdown_2.11       pkgconfig_2.0.3     
#> [34] htmltools_0.5.2      extrafont_0.17       sessioninfo_1.2.2   
#> [37] highr_0.9            fastmap_1.1.0        maps_3.4.0          
#> [40] htmlwidgets_1.5.4    rlang_1.0.0          rstudioapi_0.13     
#> [43] gridGraphics_0.5-1   generics_0.1.1       dplyr_1.0.7         
#> [46] Rcpp_1.0.8           ggbeeswarm_0.6.0     munsell_0.5.0       
#> [49] fansi_1.0.2          lifecycle_1.0.1      stringi_1.7.6       
#> [52] yaml_2.2.2           MASS_7.3-54          brio_1.1.3          
#> [55] pkgbuild_1.3.1       plyr_1.8.6           promises_1.2.0.1    
#> [58] crayon_1.4.2         knitr_1.37           ps_1.6.0            
#> [61] pillar_1.6.5         codetools_0.2-18     pkgload_1.2.4       
#> [64] futile.options_1.0.1 glue_1.6.1           evaluate_0.14       
#> [67] lambda.r_1.2.4       remotes_2.4.2        vctrs_0.3.8         
#> [70] httpuv_1.6.5         foreach_1.5.1        testthat_3.1.2      
#> [73] Rttf2pt1_1.3.9       gtable_0.3.0         purrr_0.3.4         
#> [76] cachem_1.0.6         xfun_0.29            mime_0.12           
#> [79] xtable_1.8-4         later_1.3.0          viridisLite_0.4.0   
#> [82] tibble_3.1.6         iterators_1.0.13     beeswarm_0.4.0      
#> [85] memoise_2.0.1        ellipsis_0.3.2

edpratt1/KINATESTID documentation built on Feb. 5, 2022, 1:21 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

edpratt1/KINATESTID
A Pipeline for Kinase-Specific Artificial Peptide Design

README.md
In edpratt1/KINATESTID: A Pipeline for Kinase-Specific Artificial Peptide Design

KINATESTID

Introduction

Installing the `KINATEST-ID` package

Using the `devtools` Package

From Local Zip File

From R Studio

Loading the `KINATEST-ID` package

Data File Nomenclature

Data File Name

Data File Structure

Quick start using the in-built dataset

Importing Data Files

Filtering Peptide Substrates

Generating the Postion-Specific Scoring Matrix (PSSM)

Generating an In Silico Control Peptide Library

Fisher Odds Based Position-Specific Scoring Matrix (PSSM)

Handling Structural Zeros

Analyzing Amino Acid Properties

Setting Thresholds for Kinase-Specific Peptide Scoring

Generating Candidate Artificial Peptide Sequences

Data Visualization

`peptide_intersect`: Peptide Commonality Across Replicates

`substrates_heatmap`: Heatmap of PSSM Data

`pssm_volcano`: Create Volcano Plot

Other Functions

Motif Generation

`multi_candidate_screener`: Scoring Existing Peptide Sequences

Scaling Raw Peptide Scores

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

edpratt1/KINATESTID A Pipeline for Kinase-Specific Artificial Peptide Design

README.md In edpratt1/KINATESTID: A Pipeline for Kinase-Specific Artificial Peptide Design

KINATESTID

Introduction

Installing the KINATEST-ID package

Using the devtools Package

From Local Zip File

From R Studio

Loading the KINATEST-ID package

Data File Nomenclature

Data File Name

Data File Structure

Quick start using the in-built dataset

Importing Data Files

Filtering Peptide Substrates

Generating the Postion-Specific Scoring Matrix (PSSM)

Generating an In Silico Control Peptide Library

Fisher Odds Based Position-Specific Scoring Matrix (PSSM)

Handling Structural Zeros

Analyzing Amino Acid Properties

Setting Thresholds for Kinase-Specific Peptide Scoring

Generating Candidate Artificial Peptide Sequences

Data Visualization

peptide_intersect: Peptide Commonality Across Replicates

substrates_heatmap: Heatmap of PSSM Data

pssm_volcano: Create Volcano Plot

Other Functions

Motif Generation

multi_candidate_screener: Scoring Existing Peptide Sequences

Scaling Raw Peptide Scores

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

edpratt1/KINATESTID
A Pipeline for Kinase-Specific Artificial Peptide Design

README.md
In edpratt1/KINATESTID: A Pipeline for Kinase-Specific Artificial Peptide Design

Installing the `KINATEST-ID` package

Using the `devtools` Package

Loading the `KINATEST-ID` package

`peptide_intersect`: Peptide Commonality Across Replicates

`substrates_heatmap`: Heatmap of PSSM Data

`pssm_volcano`: Create Volcano Plot

`multi_candidate_screener`: Scoring Existing Peptide Sequences