In martinry/proteasy: Protease Mapping

knitr::opts_chunk$set(echo = TRUE)

Generation of raw data files included in proteasy

mer.tab.gz - MEROPS source data

The file "meropsweb121.tar.gz" was downloaded from ftp://ftp.ebi.ac.uk/pub/databases/merops/current_release/meropsweb121.tar.gz. This page, on the official MEROPS website, gives the following description of the file: 'MEROPS Release 12.1. The large file "meropsweb121.tar.gz" is a compressed version of all the SQL statements required to build a MySQL instance of the current release of the MEROPS database. Some functions such as the Searches and BLAST server are not supported in this format, however.'
The compressed file was unpacked.
A mysql database "merops_db" was created using the command: {sql create, eval = FALSE} CREATE DATABASE merops_db;
The "Substrate_search" table was constructed by importing the relevant SQL statements: r mysql merops_db < Substrate_search.sql -u root -p
The "Substrate_search" table was exported as a text file. r sudo mysqldump mydb Substrate_search --fields-terminated-by ',' --fields-enclosed-by '"' --fields-escaped-by '\' --no-create-info --tab /usr/local/var/mysql/
The resulting tab-delimited file was read into R and filtered so that only column numbers 2, 3, 8, 14, 15, 16, 17, and 23 remained. These columns were assigned the names "Cleaved residue", "Protease (MEROPS)", "Substrate name", "Substrate (Uniprot)", "Residue number", "Substrate organism", "Protease name", and "Cleavage type".
"Cleaved residue" was converted from Abbr to Letter using the following table: ```r

structure(list(Name = c("Alanine", "Arginine", "Asparagine", "Aspartic Acid", "Cysteine", "Glutamine", "Glutamic Acid", "Glycine", "Histidine", "Isoleucine", "Leucine", "Lysine", "Methionine", "Phenylalanine", "Proline", "Serine", "Threonine", "Tryptophan", "Tyrosine", "Valine"), Abbr = c("Ala", "Arg", "Asn", "Asp", "Cys", "Gln", "Glu", "Gly", "His", "Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp", "Tyr", "Val"), Letter = c("A", "R", "N", "D", "C", "Q", "E", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")), row.names = c(NA, -20L), class = "data.frame")

```
Rows which only contained the value "\N" in the "Substrate (Uniprot)" or "Substrate organism" column were removed.
Finally, the table derived from MEROPS was written as a compressed tab-separated file named mer.tab.gz.

merops.map.tab.gz - Identifier mapping from UniProt knowledge base

The file was generated by the following steps:

Navigated to https://www.uniprot.org
Clicked Search.
Clicked Columns.
Selected only "Entry", "Review status", "Organism", and "MEROPS". Clicked Save.
Clicked Download, Download all, Tab-separated, Compressed, Go.
Downloaded file was read into R. Columns were renamed "Protease (Uniprot)", "Protease status", "Protease organism", "Protease (MEROPS)".
Rows where "Protease (MEROPS)" were empty were excluded.
"Protease organism" was formatted to comply with the format of "Substrate organism" column in mer.tab.gz. This was done by the following R command.

r library(magrittr) merops_map$`Protease organism %<>% sub(" \\(.*", "", .)
Only organisms ("Protease organism) which exist in mer$Substrate organism were kept.

r merops_map %<>% .[`Protease organism` %in% unique(mer$`Substrate organism`)]
The resulting object was was written as a compressed tab-separated file named merops_map.tab.gz.