knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # load BCB420-2019.ESA itself for knitr: pkgName <- trimws(gsub("^Package:", "", readLines("../DESCRIPTION")[1])) library(pkgName, character.only = TRUE)
This vignette describes the workflow that was used to prepare the STRINGedges0.8 dataset for the package. Source data is interaction scores for protein links from STRING.
STRING is a collection of curated protein-protein interaction scores under different categories calculated and defined by the STRING consortium. STRING protein action data is licensed under the CC license. This document describes work with STRING v11.0 protein network data for homo sapiens (2018-11-22).
STRING interaction data is available on the STRING consortium website.
STRING data comes in only one format, and genes are identified by Ensemble protein IDs (ENSP). We must download this data and process it through the script below to map the ENSPs to HGNCs present a consistent identifier for our package tools to work with. This dataset is further curated during the mapping process by keeping only those rows where combined_score > 800. This allows us to ensure the edges used in our tools' analyses are high-confidence, as well as consistent with the STRINGctions dataset (which is the intended purpose of this dataset).
We are interested in all the columns contained in the file 9606.protein.links.detailed.v11.0.txt.gz
:
9606.protein.links.detailed.v11.0.txt.gz
(110.1 Mb);data
. (It should be reachable with file.path("..", "data")
). Warning: ../data/9606.protein.links.detailed.v11.0.txt
is 741 Mb!
To begin processing, we need to make sure the required packages are installed:
readr
provides functions to highly suitable for
large datasets. These are much faster than the built-in read.X() functions. However, readr functions return "tibbles", not data frames. (Here's the difference.)
if (! requireNamespace("readr")) { install.packages("readr") }
BCB420.2019.STRING provides the data required to map ENSP IDs to HGNC symbols (ensp2sym.RData). You can obtain this data by installing Dr. Steipe's BCB420.2019.STRING package
if (! requireNamespace("devtools")) { install.packages("devtools") devtools::install_github("hyginn/BCB420.2019.STRING") }
# Load raw STRING detailed dataset tmp <- readr::read_delim(file.path("./data", "9606.protein.links.detailed.v11.0.txt"), delim = " ", skip = 1, col_names = c("protein1", "protein2", "neighborhood", "fusion", "cooccurence", "coexpression", "experimental", "database", "textmining", "combined_score")) # 11,759,454 rows # Keeping col revlevant to our analysis (high coexpression edges) tmp <- tmp[,c("protein1", "protein2","coexpression","combined_score")] tmp <- tmp[(tmp$combined_score >= 800), ] #should combined_score be included? # Do all elements have the right tax id? all(grepl("^9606\\.", tmp$protein1)) # TRUE all(grepl("^9606\\.", tmp$protein2)) # TRUE # remove "9606." prefix tmp$protein1 <- gsub("^9606\\.", "", tmp$protein1) tmp$protein2 <- gsub("^9606\\.", "", tmp$protein2) # Map ENSP to HGNC symbols: use Dr. Steipe's mapping tool: load(file = file.path(".", "data", "ensp2sym.RData")) tmp$protein1 <- ensp2sym[tmp$protein1] tmp$protein2 <- ensp2sym[tmp$protein2] # Validate initial mapping any(grepl("ENSP", tmp$protein1)) # Nope any(grepl("ENSP", tmp$protein2)) # None left here either # Clean duplicate edges (from Dr. Steipe) sPaste <- function(x, collapse = ":") { return(paste(sort(x), collapse = collapse)) } tmp$key <- apply(tmp[ , c("protein1", "protein2")], 1, sPaste) # takes a min length(tmp$key) # 35072 length(unique(tmp$key)) # 17379 tmp <- tmp[( ! duplicated(tmp$key)), c("protein1", "protein2", "coexpression", "combined_score") ] # Remove NA nodes sum(is.na(tmp$protein1)) # 51 sum(is.na(tmp$protein2)) # 153 STRINGedges <- tmp[( ! is.na(tmp$protein1)) & ( ! is.na(tmp$protein2)), ] # 17175 # Save the file saveRDS(STRINGedges, file = file.path(".", "data", "STRINGedges.RDS"))
Steipe, Boris (2019). BCB420.2019.STRING (STRING data annotatation of human genes). R package Github repository
Szklarczyk, D., Gable, A.L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N.T., Morris, J.H., Bork, P., Jensen, L.J., & Mering, C.V. (2018). STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research.
This release of the BCB420.2019.ESA
package was produced in the following context of supporting packages:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.