knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.align = 'center',
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

Introduction

Network motifs are the building blocks of complex networks, and they can be interpreted as small genetic circuits. Identifying and counting motifs in gene regulatory networks can reveal important aspects of the evolution of transcriptional regulation. In particular, they can be used to explore the impact of gene duplication in the rewiring of regulatory interactions [@mottes2021impact]. magrene aims to identify and analyze motifs in (duplicated) gene regulatory networks to better comprehend the role of gene duplication in network evolution. The figure below shows the networks motifs users can identify with magrene.

knitr::include_graphics("motifs_vignette.png")

Installation

You can install magrene with:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("magrene")

Then, you can load the package with:

library(magrene)
set.seed(123) # for reproducibility

Data description

For this vignette, we will use three example data sets:

  1. gma_grn: A gene regulatory network inferred with BioNERO [@almeida2022bionero] using soybean RNA-seq data from the Soybean Expression Atlas [@machado2020systematic].

  2. gma_ppi: A protein-protein interaction (PPI) network for soybean obtained from the STRING database [@szklarczyk2021string], filtered to contain only physical interactions with confidence score > 0.4.

  3. gma_paralogs: Soybean paralogous gene pairs derived by whole-genome, tandem, proximal, transposed, and dispersed duplications (WGD, TD, PD, TRD, and DD, respectively). This data set was obtained from @almeida2020exploring.

Networks are represented as edge lists. Let's take a look at the data.

data(gma_grn)
head(gma_grn)

data(gma_ppi)
head(gma_ppi)

data(gma_paralogs)
head(gma_paralogs)

Finding motifs

Motifs can be found using find_* motifs, as shown in Figure 1. Each function returns a character vector of motifs, and each motif has its own character representation. Let's demonstrate how they work. For the sake of demonstration, we will only use WGD-derived paralogs. For GRN motifs, we will only use a smaller subset of the edge list.

# Include only WGD-derived paralogs
paralogs <- gma_paralogs[gma_paralogs$type == "WGD", 1:2]

# Keep only the top 30k edges of the GRN, remove "Weight" variable
grn <- gma_grn[1:30000, 1:2]

PPI V

PPI V motifs are paralogous proteins that share an interaction partner. To find them, you will use find_ppi_v(). The character representation of PPI V motifs is:

$$ \text{paralog1-partner-paralog2} $$

# Find PPI V motifs
ppi_v <- find_ppi_v(gma_ppi, paralogs)
head(ppi_v)

V

V motifs are paralogous regulators that regulate the same target. These motifs can be created when a regulator undergoes a small-scale duplication. To find them, you will use find_v(). The character representation of V motifs is:

$$ \text{regulator->target<-regulator} $$

# Find V motifs
v <- find_v(grn, paralogs)
head(v)

Lambda

Lambda motifs are the opposite of V motifs: a single regulator that regulates two target genes that are paralogous. These motifs can be created when an ancestral target gene undergoes a small-scale duplication. To find them you will use find_lambda(). The character representation of lambda motifs is:

$$ \text{target1<-regulator->target2} $$

lambda <- find_lambda(grn, paralogs)
head(lambda)

Delta

Delta motifs are pretty similar to lambda motifs, but here we take protein-protein interactions between targets into account. Thus, they are represented by a regulator that regulates two targets that interact at the protein level. They can be created by the same evolutionary mechanism of lambda motifs. To find them, you will use find_delta(). The character representation of delta motifs is:

$$ \text{target1<-regulator->target2} $$

To find delta motifs, you have two options:

  1. Pass PPI edge list + a vector of previously identified lambda motifs (recommended).

  2. Pass PPI edge list + GRN edge list + paralogs data frame. In this option, find_delta() will find lambda motifs first, then use the lambda vector to find delta motifs. If you have identified lambda motifs beforehand, it is way faster to pass the lambda vector to find_delta(), so you don't have to do double work.

# Find delta motifs from lambda motifs
delta <- find_delta(edgelist_ppi = gma_ppi, lambda_vec = lambda)
head(delta)

Bifan

Bifan motifs are the most complex: they are represented by two paralogous regulators that regulate the same set of two paralogous targets. They can be created when both the ancestral regulator and the ancestral target are duplicated by small-scale duplications, or when the genome undergoes a whole-genome duplication event. To find these motifs, you will use find_bifan(). The character representation of bifan motifs is:

$$ \text{regulator1,regulator2->target1,target2} $$

Under the hood, what find_bifan() does it to find lambda motifs involving the same targets and check if their regulators are paralogs. Thus, if you have identified lambda motifs beforehand, it is much faster to simply give them to find_bifan(), so it doesn't have to find them again.

# Find bifans from lambda motifs
bifan <- find_bifan(paralogs = paralogs, lambda_vec = lambda)
head(bifan)

Counting motifs and evaluating significance

As motifs are simple character vectors, one can count their frequencies with the base R length() function. For example, let's count the frequency of each motif in our example data set:

count_df <- data.frame(
    Motif = c("PPI V", "V", "Lambda", "Delta", "Bifan"),
    Count = c(
        length(ppi_v), length(v), length(lambda), length(delta), length(bifan)
    )
)

count_df

However, unless you have another data set to which you can compare your frequencies, counting is not enough. You need to evaluate the significance of your motif frequencies. One way to do that is by comparing your observed frequencies to a null distribution generated by counting motifs in N (e.g., 1000) simulated networks [^1]. magrene allows you to generate null distributions of motif frequencies for each motif type with the function generate_nulls(). As generating the null distributions takes a bit of time, we will demonstrate generate_nulls() with 5 permutations only. As a rule of thumb, you would probably want N >= 1000.

generate_nulls(grn, paralogs, gma_ppi, n = 5)

[^1]: NOTE: Simulated networks are created by node label permutation (i.e., resampling node labels without replacement). This method allows you to have random networks that preserve the same degree of the original network. Hence, networks are called degree-preserving simulated networks.

As you can see, the output of generate_nulls() is a list of numeric vectors with the frequency of each motif type in the simulated networks[^2]. You can use the null distribution to calculate Z-scores for your observed frequencies, which are defined as below:

[^2]: Note on performance: The function generate_nulls() can be parallelized thanks to the Bioconductor package r BiocStyle::Biocpkg("BiocParallel"). However, keep in mind that parallelization is not always the best choice, because it may take longer to create multiple copies of your data to split into multiple cores than it takes to find motifs with a single core.

$$ Z = \frac{ n_{observed} - \bar{n}{null} }{ \sigma{null} } $$

To calculate Z-scores, you can use the function calculate_Z(). As input, you need to give a list of observed frequencies and a list of nulls. Here, we will load pre-computed null distributions of N = 100.

# Load null distros
data(nulls)
head(nulls)

# Create list of observed frequencies
observed <- list(
    lambda = length(lambda), 
    bifan = length(bifan), 
    V = length(v),
    PPI_V = length(ppi_v),
    delta = length(delta)
)
calculate_Z(observed, nulls)

Now that you have Z-scores, you can use a cut-off of your choice to define significance.

Evaluting interaction similarity

Finally, another interesting pattern you may want to analyze is the interaction similarity between paralogous gene pairs. Previous studies have demonstrated that the Sorensen-Dice similarity is a suitable index for interaction similarity [@defoort2019evolution; @mottes2021impact], which is defined as:

$$ S(A,B) = \frac{2 \left| A \cap B \right|}{ \left|A \right| + \left| B \right|} $$

where A and B are the interacting partners of nodes A and B. To calculate the Sorensen-Dice similarity for paralogous gene pairs, you can use the function sd_similarity(). Let's demonstrate it by calculating the similarity between paralogs in the PPI network.

sim <- sd_similarity(gma_ppi, paralogs)
head(sim)

summary(sim$sorensen_dice)

Session information {.unnumbered}

This document was created under the following conditions:

sessioninfo::session_info()

References {.unnumbered}



almeidasilvaf/magrene documentation built on March 16, 2023, 4:29 a.m.