knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(igraph)
library(RWRtoolkit)

Introduction

This vignette describes the usage of RWR_CV with respect to validating a multiplex network against an independently curated set of highly connected genes.

RWR-CV can be used in two ways: 1. Using an independently curated set of well connected genes to score a newly generated multiplex. 2. Using a well generated multiplex to explore seed set connectivity.

In this vignette, we will be exploring the first method. To run this algorithm, we will need a multiplex object and an independently curated gene set.

The predictive ability of a multiplex network can be determined using cross validation of gold standard gene sets with the RWR_CV command. A gold standard gene set ought to contain genes that are known to be functionally related (e.g. all are members of a well characterized biosynthetic pathway, sharing the same annotation in a GO/MAPMAN term, etc). In theory, if only some members of the gold set are used as seeds, those that were left out should be found with relatively high precision (i.e. highly ranked by RWR) if the underlying networks are indeed functionally predictive.

The RWR_CV command allows the user to provide a gold set and use k-fold, leave-one-out (LOO), or singleton cross validation methods. Rank metrics and summary statistics such as AUROC and AUPRC are provided in order for the user to determine if their multiplex network is truly predictive. If a multiplex network were constructed from layers containing entirely random edges between genes, then the expectation would be AUROC = 0.50 and AUPRC = g/n where g is the size of the geneset and n is the number of unique genes in the multiplex. If RWR_CV shows significantly higher scores with respect to the gold sets, this indicates a functionally useful biological multiplex.

Four output files are generated by the running of RWR_CV are:

  1. The full ranks file gives the ranks obtained from each particular fold of the cross validation in addition to output statistics.
  2. The mean ranks file: contains the mean rank for each ranked node across each fold. For example, from the example command output, gene PFKP is of rank 2 in the first 2 folds, yet rank 17 in the third fold. While PFKP enjoys the rank of 2 in two of the three folds, it would be spurious to discount the fold in which PFKP's rank is 17. By taking the mean of the ranks across all folds, a more generic scoring metric can be obtained. Note, if there exist any ties within the mean ranks, the rerank column denotes consecutive ranks with respect to those ties.
  3. The metrics file: is comprised of metrics for all mean ranked genes denoting True Positive (TP) (i.e. ranked genes in the supplied geneset), False Positive (FP) (ranked genes that do not exist within the geneset), Cumulative TP and FP as ranks increase, False Positive Rate (i.e. the cumulative number of false positives at row i over the total number of ranked genes), Precision, and Recall.
  4. The summary file: contains summary statistics for each particular fold and the mean of the folds. For each fold, values are reported for varying measures such as Average Precision, AUROC, AUPRC.

Setup

Loading a Multiplex Network:

A multiplex network can be read locally from a file generated with RWR_make_multiplex, or read from a URL. For this vignette, we will be reading our multiplex networks from a URL.

athal_go_multiplex_path <-'https://github.com/dkainer/RWRtoolkit-data/blob/main/GO_AT_d0.5_v01.RData?raw=True'#
athal_multiplex_url <- url(athal_go_multiplex_path)
load(athal_multiplex_url)

For the purposes of this vignette, we'll use an A. thaliana GO multiplex consisting of 1 layer (for further description of the layer, please follow its layer's associated link), however, users are encouraged to substitute other multiplexes in (note: the larger the multiplexes, the more time consuming the randomization of layers will take):

| Network | Nodes | Edges | | ------------------------------------------------------------ | ------ | ------- | | GO semantic similarity (GO) version 0.4 | 7,625 | 2,930,141 |

Creating a network with randomly sampled edges:

In order to illustrate the differences between what is considered to be a "well validated" network and a network hat has a poor validation score, we will need to create a similarly sized multiplex (with a similar amount of nodes and edges) with randomly generated edges. First we need to create a new rewired layer for each layer in the multiplex (in this case, just the GO network) and write it to file. Finally we'll need to collate the layer(s) into a multiplex:

write_table <- function(table, path, row_names = F, col_names = T, verbose = FALSE) {
  if (length(table) == 0 || any(is.na(table))) {
    warning(sprintf("Table to be saved at %s is empty\n", path))
  }
  # Create out_dir if it doesn't exist (avoid warning message if out_dir exists).
  out_dir <- dirname(path)
  if (!dir.exists(out_dir)) {
    dir.create(out_dir, recursive = TRUE)
  }
  # Save the table.
  write.table(table,
    path,
    sep = "\t",
    quote = F,
    col.names = col_names,
    row.names = row_names
  )
  if (verbose) {
    message(sprintf("Saved data to file: %s\n", path))
  }
}

write_networks_to_file <- function(
    network,
    output_file){
    edgelist <- igraph::get.edgelist(network, names=T)
    from <- edgelist[,1]
    to <- edgelist[,2]
    write_table(
        data.frame(
            from = from,
            to = to
        ),
        output_file,
        col_names=F
    )
}

create_random_layer <- function(mpo_layer){
    num_vertices <- vcount(mpo_layer)
    num_edges <- ecount(mpo_layer)
    random_layer <- igraph::erdos.renyi.game(n = num_vertices, p.or.m= num_edges, type = 'gnm')
    V(random_layer)$name <- V(mpo_layer)$name
    return(random_layer)
}

create_layer_and_write_to_file <- function(mpo_layer, file_path){
    new_layer <- create_random_layer(mpo_layer)
    write_networks_to_file(network=new_layer, output_file=file_path)
}

create_random_copy_multiplex <- function(mpo, output_base_path='./'){
    layer_paths <- c()
    layer_names <- c()
    mpo_names <- names(mpo)
    for (layer_number in seq(1, mpo$Number_of_Layers)){
        layer <- mpo[[layer_number]]
        layer_name <- mpo_names[layer_number]
        output_name <- paste(output_base_path, '/', layer_name, '.tsv', sep='')

        create_layer_and_write_to_file(layer, output_name)

        layer_paths <- c(layer_paths, output_name)
        layer_names <- c(layer_names, layer_name)
    }

    flist_df <- data.frame(layer_paths, layer_names)
    write_table(flist_df, paste(output_base_path, 'flist.tsv', sep='/'), row_names=F, col_names=F)
}

output_dir <- './random_layers'
create_random_copy_multiplex(nw.mpo, output_dir)
random_net_filepath <- paste(output_dir, 'randomized_multiplex.Rdata', sep='/')
RWRtoolkit::RWR_make_multiplex(paste(output_dir, 'flist.tsv', sep='/'), output = random_net_filepath)

With our rewired multiplex now created, we can test both our GO_AT Network and the Erdos.Renyi randomized networks with respect to our gold standard gene set. To do so, first we'll need to create our gold standard gene set and write it to file.

Building a "Gold Set"

We will need a "gold set" of genes to test our multiplex networks against. These "gold standard" genes are ones that are independently curated and defined as being highly related in some biological fashion.

For our gold set of genes, we will use the well described Jasmonte pathway genes defined by MapMan. We will then need to annotate those genes with a "set name" as gold set files follow the format (where the set name is some defining string and the gene ids are strings that match to node names within the multiplex):

| Set Name | Gene ID | |:---------|:-------------| | goldSet | node1 | | goldSet | node2 | | goldSet | node3 | | goldSet | node4 |

We can create our own gold set file with:

# Define Genes
jasmonate_seed_genes <- c(
    "AT2G44810", "AT1G17420", "AT1G55020", "AT1G67560",
    "AT1G72520", "AT3G22400", "AT3G45140", "AT5G42650",
    "AT1G13280", "AT3G25760", "AT3G25770", "AT3G25780",
    "AT1G09400", "AT1G17990", "AT1G18020", "AT1G76680",
    "AT1G76690", "AT2G06050", "AT1G17380", "AT1G19180",
    "AT1G48500", "AT1G70700", "AT1G72450", "AT1G74950",
    "AT3G17860", "AT3G43440", "AT5G20900"
)
# Add annotations
seed_gene_setname <- rep("jasmonate_mm", length(jasmonate_seed_genes))
seed_gene_df <- data.frame("set" = seed_gene_setname, "seed" = jasmonate_seed_genes)

# Write File
goldset_filename <- "./goldset_jasomonate.tsv"
write.table(seed_gene_df, goldset_filename, sep = "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)

Multiplex Network Validation Using RWR-CV

In order to validate our multiplex using RWR-CV, we simply need to supply the multiplex in question and an independently defined gold standard gene set. For this network validation we'll use our GO semantic similarity network and the gold set we just recently generated:

gold_net_cv_path <- './go_net_cv_kfold/'

comprehensive_net_cv_validation <- RWRtoolkit::RWR_CV(
  athal_go_multiplex_path, 
  geneset_path = goldset_filename, 
  method = 'kfold',
  folds = 5,
  outdir = gold_net_cv_path, 
  plot = TRUE)
image_path <- paste(gold_net_cv_path, "RWR-CV_jasmonate_mm_GO_AT_d0.5_v01_default.plots.png", sep="/")

knitr::include_graphics(image_path)
random_net_cv_path <- './random_net_cv_kfold/'

comprehensive_net_cv_validation <- RWRtoolkit::RWR_CV(
  random_net_filepath, 
  geneset_path = goldset_filename, 
  method = 'kfold',
  folds = 5,
  outdir = random_net_cv_path, 
  plot = TRUE)
image_path <- paste(random_net_cv_path, "RWR-CV_jasmonate_mm_randomized_multiplex_default.plots.png", sep="/")

knitr::include_graphics(image_path)

Ultimately, as we rewire the networks, we expect to generally see an AUROC around 0.5 (i.e. the ROC curve moves from bottom left to top right) as the results are completely random. Similarly we expect our AUPRC to approach 0 far quicker than our actual networks as similarity



dkainer/RWRtoolkit documentation built on Jan. 11, 2025, 3:26 a.m.