knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# load BCB420-2019.ESA itself for knitr:
pkgName <- trimws(gsub("^Package:", "", readLines("../DESCRIPTION")[1]))
library(pkgName, character.only = TRUE)

 

If any of this information is ambiguous, inaccurate, outdated, or incomplete, please check the [most recent version](https://github.com/hyginn/BCB420.2019.ESA) of the package on GitHub and if the problem has not already been addressed, please [file an issue](https://github.com/hyginn/BCB420.2019.ESA/issues)!

 

About this vignette:

This vignette describes the workflow that was used to prepare the STRINGactions dataset for the package. Source data is action types for protein links from STRING.

STRINGactions Data

STRING is a collection of curated protein-protein action data. STRING protein action data is licensed under the CC license. This document describes work with STRING v11.0 protein actions data for homo sapiens (2018-11-22).

 

Data semantics

STRING interaction data is available on the STRING consortium website.

STRING data comes in only one format, and genes are identified by Ensemble protein IDs (ENSP). We must download this data and process it through the script below to map the ENSPs to HGNCs present a consistent identifier for our package tools to work with. This dataset is further curated during the mapping process by keeping only those rows where combined_score > 800. This allows us to ensure the edges used in our tools' analyses are high-confidence.

We are interested in all the columns contained in the file 9606.protein.actions.v11.0.txt.gz:

  1. Protein 1: The ENSP identifier of the protein action pair
  2. Protein 2: The ENSP identifier of the protein action pair
  3. Mode: the name of the interaction, controlled vocabulary consisting of the set: {"binding", "catalysis", "reaction", "activation", "inhibition", "ptmod", "expression"}
  4. Is_directional: boolean column, answering "does one protein act on the other in this action pair?"
  5. a_is_acting: boolean column, answering: "is Protein 1 acting on Protein 2"? If false and if it is a directional action, the opposite is true of the interaction
  6. Combined_score: the total STRING score computed for the protein action pair (which is in itself a STRING edge).

 

Data download and processing

  1. Navigate to the download directory of the STRING database.
  2. Search for "human" or "homo sapiens" in the "Choose an organism" search bar
  3. Download the following data file:
  4. 9606.protein.actions.v11.0.txt.gz (14.4 Mb);
  5. Uncompress the file and place it into a sister directory of your working directory which is called data. (It should be reachable with file.path("..", "data")). Warning: ../data/9606.protein.actions.v11.0.txt is 211.3 Mb!

 

Preparations: packages, functions, files

To begin processing, we need to make sure the required packages are installed:

readr provides functions to highly suitable for large datasets. These are much faster than the built-in read.X() functions. However, readr functions return "tibbles", not data frames. (Here's the difference.)

if (! requireNamespace("readr")) {
  install.packages("readr")
}

BCB420.2019.STRING provides the data required to map ENSP IDs to HGNC symbols (ensp2sym.RData). You by installing Dr. Steipe's BCB420.2019.STRING package

if (! requireNamespace("devtools")) {
  install.packages("devtools")
  devtools::install_github("hyginn/BCB420.2019.STRING")
}

 

Selecting protein actions

##### Map the protein action dataset mappings ####
tmp <- readr::read_tsv(file.path("./data", "9606.protein.actions.v11.0.txt"),
                         skip = 1,
                         col_names = c("protein1", "protein2", "mode", "action",
                                        "is_directional", "a_is_acting", "combined_score")) # 11,759,454 rows
# Keep "high confidence" interactions, and
# remove "action" col since that information is duplicated in "mode" for our purposes
tmp <- tmp[,c("protein1", "protein2", "mode",
              "is_directional", "a_is_acting", "combined_score")]
tmp <- tmp[tmp$combined_score >= 800, ]

# remove "9606." prefix
tmp$protein1 <- gsub("^9606\\.", "", tmp$protein1)
tmp$protein2 <- gsub("^9606\\.", "", tmp$protein2)

# Map ENSP to HGNC symbols: use Dr. Steipe's mapping tool:
load(file = file.path(".", "data", "ensp2sym.RData"))
tmp$protein1 <- ensp2sym[tmp$protein1]
tmp$protein2 <- ensp2sym[tmp$protein2]

# Validate initial mapping
any(grepl("ENSP", tmp$protein1))  # Nope
any(grepl("ENSP", tmp$protein2))  # None left here either

# Clean duplicate edges (from Dr. Steipe)
sPaste <- function(x, collapse = ":") {
  return(paste(sort(x), collapse = collapse))
}
tmp$key <- apply(tmp[ , c("protein1", "protein2", "mode")], 1, sPaste) # takes a min
length(tmp$key) # 2031426
length(unique(tmp$key)) # 548932
tmp <- tmp[( ! duplicated(tmp$key)),
           c("protein1", "protein2", "mode",
             "is_directional", "a_is_acting", "combined_score") ]

# Remove NA nodes
sum(is.na(tmp$protein1)) # NUM
sum(is.na(tmp$protein2)) # NUM
STRINGactions <- tmp[( ! is.na(tmp$protein1)) & ( ! is.na(tmp$protein2)), ] # 545423

# Save the file
saveRDS(STRINGactions, file = file.path(".", "data", "STRINGactions.RDS"))  # 1.6 Mb

# The dataset was uploaded to the assets server and is available with:
STRINGactions <- fetchData("STRINGactions")

 

References

 

Steipe, Boris (2019). BCB420.2019.STRING (STRING data annotatation of human genes). R package Github repository

Szklarczyk, D., Gable, A.L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N.T., Morris, J.H., Bork, P., Jensen, L.J., & Mering, C.V. (2018). STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research.

 

Acknowledgements

 

Session Info

This release of the BCB420.2019.ESA package was produced in the following context of supporting packages:

sessionInfo()

References



hyginn/BCB420.2019.ESA documentation built on May 29, 2019, 1:23 p.m.