diffuse: Diffuse scores on a network
In b2slab/diffusion: Diffusion scores on biological networks

Description Usage Arguments Details Value References Examples

Function diffuse takes a network in igraph format (or a graph kernel matrix stemming from a graph) and an initial state to score all the nodes in the network. The seven diffusion scores hereby provided differ on (a) how they distinguish positives, negatives and unlabelled examples, and (b) their statistical normalisation. The argument method offers the following options:

Methods without statistical normalisation:

raw: positive nodes introduce unitary flow (y_raw[i] = 1) to the network, whereas neither negative nor unlabelled nodes introduce anything (y_raw[j] = 0) [Vandin, 2011]. They are computed as:

f_{raw} = K y_{raw}

where K is a graph kernel, see ?kernels. These scores treat negative and unlabelled nodes equivalently.
ml: same as raw, but negative nodes introduce a negative unit of flow [Zoidi, 2015] and are therefore not equivalent to unlabelled nodes.
gm: same as ml, but the unlabelled nodes are assigned a (generally non-null) bias term based on the total number of positives, negatives and unlabelled nodes [Mostafavi, 2008].
ber_s: this is a quantification of the relative change in the node score before and after the network smoothing. The score for a particular node i can be written as

f_{ber_s}[i] = f_{raw}[i]/(y_{raw}[i] + eps)

where eps is a parameter controlling the importance of the relative change.

Methods with statistical normalisation: the raw diffusion score of every node i is computed and compared to its own diffusion scores stemming from a permuted input.

mc: the score of node i is based in its empirical p-value, computed by permuting the input n.perm times:

p[i] = (r[i] + 1)(n.perm + 1)

p[i] is roughly the proportion of input permutations that led to a diffusion score as high or higher than the original diffusion score (a total of r[i] for node i, in absolute terms). This assesses how likely a high diffusion score is to arise from chance, in absence of signal. To be consistent with the direction, mc is defined as:

f_{mc}[i] = 1 - p[i]
ber_p: as used in [Bersanelli, 2016], this score combines raw and mc, in order to take into account both the magnitude of the raw scores and the effect of the network topology:

f_{ber_p}[i] = -log10(p[i]) f_{raw}[i]
z: this is a parametric alternative to mc. The raw score of node i is subtracted its mean value and divided by its standard deviation. The statistical moments have a closed analytical form, see the main vignette, and are inspired in [Harchaoui, 2013]. Unlike mc and ber_p, the z scores do not require actual permutations, giving them an advantage in terms of speed.

If the input labels are not quantitative, i.e. positive(1), negative(0) and possibly unlabelled, all the scores (raw, gm, ml, z, mc, ber_s, ber_p) can be used. Quantitative inputs are naturally defined on raw, z, mc, ber_s and ber_p by extending the definitions above, and are readily available in diffuStats. Further details on the scores can be found in the main vignette.

1
2
3

diffuse(graph, scores, method, ...)

diffuse_grid(scores, grid_param, ...)

`graph`	igraph object for the diffusion. Alternatively, a kernel matrix can be provided through the argument `K` insted of the igraph object.
`scores`	scores to be smoothed; either a named numeric vector, a column-wise matrix whose rownames are nodes and colnames are different scores, or a named list of such matrices.
`method`	character, one of `raw`, `gm`, `ml`, `z`, `mc`, `ber_s`, `ber_p`. For batch analysis of several methods, see `?diffuse_grid`.
`...`	additional arguments for the diffusion method. `mc` and `ber_p` accept `n.perm` (number of permutations), `seed` (for reproducibility, defaults to `1`) and `sample.prob`, a list of named vectors -one per background- with sampling probabilities for the null model, uniform by default. More details available in `?diffuse_mc`. On the other hand, `ber_s` accepts `eps`, a parameter controlling the importance of the relative change.
`grid_param`	data frame containing parameter combinations to explore. The column names should be the names of the parameters. Parameters that have a fixed value can be specified in the grid or through the additional arguments (`...`)

Input scores can be specified in three formats. A single set of scores to smooth can be represented as (1) a named numeric vector, whereas if several of these vectors that share the node names need to be smoothed, they can be provided as (2) a column-wise matrix. However, if the unlabelled entities are not the same from one case to another, (3) a named list of such score matrices can be passed to this function. The input format will be kept in the output.

The implementation of mc and ber_p is optimized for sparse inputs. Dense inputs might take a longer time to compute. Another relevant note: z can give NaN for a particular node when the observed nodes are disconnected from the node being scored. This is because these nodes are neither annotated with experimental not network (topology) data.

diffuse returns the diffusion scores, with the same format as scores

diffuse_grid returns a data frame containing the diffusion scores for the specified combinations of parameters

Scores "raw": Vandin, F., Upfal, E., & Raphael, B. J. (2011). Algorithms for detecting significantly mutated pathways in cancer. Journal of Computational Biology, 18(3), 507-522.

Scores "ml": Zoidi, O., Fotiadou, E., Nikolaidis, N., & Pitas, I. (2015). Graph-based label propagation in digital media: A review. ACM Computing Surveys (CSUR), 47(3), 48.

Scores "gm": Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C., & Morris, Q. (2008). GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome biology, 9(1), S4.

Scores "mc", "ber_s", "ber_p": Bersanelli, M., Mosca, E., Remondini, D., Castellani, G., & Milanesi, L. (2016). Network diffusion-based analysis of high-throughput data for the detection of differentially enriched modules. Scientific reports, 6.

Scores "z": Harchaoui, Z., Bach, F., Cappe, O., & Moulines, E. (2013). Kernel-based methods for hypothesis testing: A unified view. IEEE Signal Processing Magazine, 30(4), 87-97.

##############################

library(igraph)
library(ggplot2)
data(graph_toy)
input_vec <- graph_toy$input_vec
n <- vcount(graph_toy)

##############################

# Examples for 'diffuse':

# Using a binary vector as input
diff_scores <- diffuse(
    graph = graph_toy,
    scores = input_vec,
    method = "raw")

# Using a matrix as input
diff_scores <- diffuse(
    graph = graph_toy,
    scores = graph_toy$input_mat,
    method = "raw")

# Using a list of matrices as input
diff_scores <- diffuse(
    graph = graph_toy,
    scores = list(myScores1 = graph_toy$input_mat,
        myScores2 = head(graph_toy$input_mat, n/2)),
    method = "raw")

##############################

# Examples for 'diffuse_grid':

# Using a single vector of scores and comparing the methods 
# "raw", "ml", and "z"
df_diff <- diffuse_grid(
    graph = graph_toy,
    scores = graph_toy$input_vec,
    grid_param = expand.grid(method = c("raw", "ml", "z")))
head(df_diff)

# Same settings, but comparing several choices of the 
# parameter epsilon ("eps") in the scores "ber_s"
df_diff <- diffuse_grid(
    graph = graph_toy,
    scores = graph_toy$input_vec,
    grid_param = expand.grid(method = "ber_s", eps = 1:5/5))
ggplot(df_diff, aes(x = factor(eps), fill = eps, y = node_score)) + 
    geom_boxplot()

# Using a matrix with four set of scores
# called Single, Row, Small_sample, Large_sample
# See the 'quickstart' vignette for more details on these toy scores
# We compute scores for methods "ber_p" and "mc" and 
# permute both 1e3 and 1e4 times in each run
df_diff <- diffuse_grid(
    graph = graph_toy,
    scores = graph_toy$input_mat,
    grid_param = expand.grid(
        method = c("mc", "ber_p"), 
        n.perm = c(1e3, 1e4)))
dim(df_diff)
head(df_diff)

##############################

# Differences when using (1) a quantitative input and
# (2) different backgrounds. 

# In this example, the 
# small background contains binary scores and continuous scores for 
# half of the nodes in the 'graph_toy' example graph. 

# (1) Continuous scores have been generated by 
# changing the positive labels to a random, positive numeric value. 
# The user can see the impact of this in the scores 'raw', 'ber_s', 
# 'ber_p', 'mc' and 'z'

# (2) The larger background is just the small background 
# completed with zeroes, both for binary and continuous scores. 
# This illustrates how 'raw' and 'ber_s' treat unlabelled 
# and negative labels equally, whereas 'ml', 'gm', 'ber_p', 
# 'mc' and 'z' do not. 

# Examples:

# The input:
lapply(graph_toy$input_list, head)

# 'raw' scores treat equally unlabelled and negative nodes, 
# and can account for continuous inputs
diff_raw <- diffuse(
    graph = graph_toy,
    scores = graph_toy$input_list,
    method = "raw")
lapply(diff_raw, head)

# 'z' scores distinguish unlabelled and negatives and accepts 
# continuous inputs
diff_z <- diffuse(
    graph = graph_toy,
    scores = graph_toy$input_list,
    method = "z")
lapply(diff_z, head)

# 'ml' and 'gm' are the same score if there are no unobserved nodes
diff_compare <- diffuse_grid(
    graph = graph_toy, 
    scores = input_vec, 
    grid_param = expand.grid(method = c("raw", "ml", "gm"))
)
df_compare <- reshape2::acast(
    diff_compare, 
    node_id~method, 
    value.var = "node_score")
head(df_compare)

# 'ml' and 'gm' are different in presence of unobserved nodes
diff_compare <- diffuse_grid(
    graph = graph_toy, 
    scores = head(input_vec, n/2), 
    grid_param = expand.grid(method = c("raw", "ml", "gm"))
)
df_compare <- reshape2::acast(
    diff_compare, 
    node_id~method, 
    value.var = "node_score")
head(df_compare)