knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # load BCB420-2019.ESA itself for knitr: pkgName <- trimws(gsub("^Package:", "", readLines("../DESCRIPTION")[1])) library(pkgName, character.only = TRUE)
This vignette describes the workflow that was used to prepare the STRINGactions dataset for the package. Source data is action types for protein links from STRING.
STRING is a collection of curated protein-protein action data. STRING protein action data is licensed under the CC license. This document describes work with STRING v11.0 protein actions data for homo sapiens (2018-11-22).
STRING interaction data is available on the STRING consortium website.
STRING data comes in only one format, and genes are identified by Ensemble protein IDs (ENSP). We must download this data and process it through the script below to map the ENSPs to HGNCs present a consistent identifier for our package tools to work with. This dataset is further curated during the mapping process by keeping only those rows where combined_score > 800. This allows us to ensure the edges used in our tools' analyses are high-confidence.
We are interested in all the columns contained in the file 9606.protein.actions.v11.0.txt.gz
:
9606.protein.actions.v11.0.txt.gz
(14.4 Mb);data
. (It should be reachable with file.path("..", "data")
). Warning: ../data/9606.protein.actions.v11.0.txt
is 211.3 Mb!
To begin processing, we need to make sure the required packages are installed:
readr
provides functions to highly suitable for
large datasets. These are much faster than the built-in read.X() functions. However, readr functions return "tibbles", not data frames. (Here's the difference.)
if (! requireNamespace("readr")) { install.packages("readr") }
BCB420.2019.STRING provides the data required to map ENSP IDs to HGNC symbols (ensp2sym.RData). You by installing Dr. Steipe's BCB420.2019.STRING package
if (! requireNamespace("devtools")) { install.packages("devtools") devtools::install_github("hyginn/BCB420.2019.STRING") }
##### Map the protein action dataset mappings #### tmp <- readr::read_tsv(file.path("./data", "9606.protein.actions.v11.0.txt"), skip = 1, col_names = c("protein1", "protein2", "mode", "action", "is_directional", "a_is_acting", "combined_score")) # 11,759,454 rows # Keep "high confidence" interactions, and # remove "action" col since that information is duplicated in "mode" for our purposes tmp <- tmp[,c("protein1", "protein2", "mode", "is_directional", "a_is_acting", "combined_score")] tmp <- tmp[tmp$combined_score >= 800, ] # remove "9606." prefix tmp$protein1 <- gsub("^9606\\.", "", tmp$protein1) tmp$protein2 <- gsub("^9606\\.", "", tmp$protein2) # Map ENSP to HGNC symbols: use Dr. Steipe's mapping tool: load(file = file.path(".", "data", "ensp2sym.RData")) tmp$protein1 <- ensp2sym[tmp$protein1] tmp$protein2 <- ensp2sym[tmp$protein2] # Validate initial mapping any(grepl("ENSP", tmp$protein1)) # Nope any(grepl("ENSP", tmp$protein2)) # None left here either # Clean duplicate edges (from Dr. Steipe) sPaste <- function(x, collapse = ":") { return(paste(sort(x), collapse = collapse)) } tmp$key <- apply(tmp[ , c("protein1", "protein2", "mode")], 1, sPaste) # takes a min length(tmp$key) # 2031426 length(unique(tmp$key)) # 548932 tmp <- tmp[( ! duplicated(tmp$key)), c("protein1", "protein2", "mode", "is_directional", "a_is_acting", "combined_score") ] # Remove NA nodes sum(is.na(tmp$protein1)) # NUM sum(is.na(tmp$protein2)) # NUM STRINGactions <- tmp[( ! is.na(tmp$protein1)) & ( ! is.na(tmp$protein2)), ] # 545423 # Save the file saveRDS(STRINGactions, file = file.path(".", "data", "STRINGactions.RDS")) # 1.6 Mb # The dataset was uploaded to the assets server and is available with: STRINGactions <- fetchData("STRINGactions")
Steipe, Boris (2019). BCB420.2019.STRING (STRING data annotatation of human genes). R package Github repository
Szklarczyk, D., Gable, A.L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N.T., Morris, J.H., Bork, P., Jensen, L.J., & Mering, C.V. (2018). STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research.
This release of the BCB420.2019.ESA
package was produced in the following context of supporting packages:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.