devtools::load_all()

https://doi.org/10.1093/bioinformatics/btz652{width=80%}

Introduction

This document describes the R package mND [@mND], a new network diffusion based method for integrative multi-omics analysis.

Installation

The package can be installed from within R:

install.packages(c("devtools", "igraph"))
devtools::install_github("emosca-cnr/mND", build_vignettes = TRUE)
library(mND)

Definition of the input matrices

The analysis requires the following inputs:

head(mND_demo$W)
mND_demo$X0

IMPORTANT:

Case study: integration of mutation and expression changes in breast cancer

As proof of principle, mND is applied to find functionally related genes on the basis of gene mutations ($L_1$) and gene expression variation ($L_2$) in breast cancer (BC).

Somatic mutations and gene expression data from matched tumour-normal samples (blood for SM and solid tissue for GE) were collected from The Cancer Genome Atlas (TCGA) [@TCGA] for BC, using the R package TCGAbiolinks [@TCGAbiolinks] and considering the human genome version 38 (hg38).

Mutation Annotation Format files were obtained from 4 pipelines: Muse [@Muse], Mutect2 [@Mutect], SomaticSniper [@somaticsniper], Varscan2 [@varscan2]. Only mutation sites detected by at least two variant callers were considered. Gene mutation frequencies ($f$) were calculated as the fraction of subjects in which a gene was associated with at least one mutation.

Gene expression data were obtained using the TCGA workflow ``HTSeq-Counts''. The R package limma [@limma] was used to normalize and quantify differential expression in matched tumor-normal samples, yielding log-fold changes, the corresponding p-values and FDRs (BH method).

We obtained the normalized adjacency matrix ($W$) using the interactions available in STRING [@string], while $X0$ matrix was defined as the two column vectors: $x_1 = f$ and $x_2 = -log_{10}(FDR)$.

Permutations

A set of $k$ permutations can be created using the two functions of NPATools perm_vertices() and perm_X0():

vert_deg <- setNames(rowSums(sign(W)), rownames(W))
perms <- perm_vertices(vert_deg, method = "d", k = 99, cut_par = 9, bin_type = "number")
X0_perm <- perm_X0(X0, perms = perms)

Here's how X0_perm should look like:

mND_demo$X0_perm

Network Diffusion (ND)

Next, we run the ND over the list of permutations X0_perm. This can be time-consuming, so it's a good idea to run it in parallel:

BPPARAM <- MulticoreParam(4)
Xs <- run_ND(X0 = X0_perm, W=W, BPPARAM = BPPARAM)

Here's how $Xs$ should look like:

mND_demo$Xs

Calculation of mND score and significance assessment

Let's use the function neighbour_index to retrieve the list of gene neighbors indexes from the adjacency matrix:

ind_adj <- neighbour_index(W)

Now, we can apply mND function to find functionally related genes on the basis of $x_1^$ and $x_2^$ and their top $k$ neighbours; precomputed mND scores (calculated with $k = 3$) are included in mND package data (i.e. data(mND_score)). The value of $k$ can be optimised exploring the effect of values around 3 on the ranked gene list provided by mND in output (see the optimize_k function).

mND_score <- mND(Xs = Xs, ind_adj = ind_adj, k=3, BPPARAM = BPPARAM)

Empirical $p$-values are calculated by using the pool of $r$ permuted versions of $X0$:

mND_score <- signif_assess(mND_score, BPPARAM = BPPARAM)

You will obtain a list with the following data frames

mND_demo$mND_score 

Gene classification

Lastly, let's classify genes in every layer. We define the set of the high scoring genes as:

Further, we set the cardinalities of gene sets $N_1$ and $N_2$, containing the genes with the highest scoring neighborhoods, as $|H_1|=|N_1|$ and $|H_2|=|N_2|$:

Hl <- list(l1 = rownames(X0[X0[, 1]>0, ]), l2 = names(X0[order(X0[,2], decreasing = T), 2][1:1200]))
top_Nl <- lengths(Hl)

The classification is computed as follows:

class_res <- classification(mND = mND_score, X0 = X0, Hl = Hl, top = top_Nl)
mND_demo$clas_res

References



emosca-cnr/mND documentation built on April 11, 2024, 12:49 p.m.