knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(knitr)
library(ggplot2)

Why an application programming interface?

In addition to the Graphical User Interface, DIANE also comes with all it's server-side functions. They can be called from any R environment or scripts, after loading DIANE.

Using DIANE's functions outside of the interface can be an interesting option for several reasons, among which :

We present in this document how to use those functions on our demo data.

Input data

DIANE's companion dataset is meant to illustrate its functionalities and provide an explicit example of the expected input data.

The chosen dataset contains the transcriptome of Arabidopsis thaliana plants exposed to global warming related conditions. It was generated for the article "Molecular plant responses to combined abiotic stresses put a spotlight on unknown and abundant genes", by Sewelam et Al. in Journal of experimental Botany, 2020. (https://academic.oup.com/jxb/advance-article/doi/10.1093/jxb/eraa250/5842162#204457392).

The experimental perturbations studied are heat, salinity and drought in culture soil. Each factors have two levels, one of them considered as the reference, and the other one as the stress level.

library(DIANE)
data("abiotic_stresses")
data("gene_annotations")
data("regulators_per_organism")

To use DIANE's functions on your data, you should provide R variables that describe your dataset the following formats.

Raw expression counts

It must be an R dataframe with gene IDs as rownames, and condition_replicate as columns. It contains row or normalized expression levels.

In order for your input to be compatible with the proposed organisms for GO terms and annotation, it should contain gene IDs as follow :

Our demo data have raw counts presented as follow :

kable(head(abiotic_stresses$raw_counts))

Experimental design

This is optional, and only used to fit Poisson Generalized linear models to gene clusters. It is shown here to explain our companion dataset.

For each condition name, the level of the factors corresponding to that condition in your study are to be provided.

The experiement we chose for our demo included 3 factors, heat stress (H), mannitol stress(M), as well as salinity stress (S).

You should first identify which level of each factor can be considered as the control level, and which is a perturbation. In our demo, the control condition is referred to as C. Then, the perturbation are specified by their corresponding letter, from simple to triple stress combination. For example, SH corresponds to salt and heat stresses in the control level of mannitol. As a consequence, its levels would be 1,0,1.

The design file is thus a matrix with condition names as rownames, and factor names as columns:

kable(abiotic_stresses$design)

Gene annotation

We provided gene description for Arabidopsis :

kable(tail(gene_annotations[["Arabidopsis thaliana"]], n = 10))

This is not required, but it can be convenient to use this information when displaying lists of genes, to have their associated meaning.

The annotations for other organisms known to DIANE can be retrieved for any list of genes as follows :

knitr::kable(get_gene_information(c("WBGene00000013", "WBGene00000035"), organism = "Caenorhabditis elegans"))

Data pre-processing

Visualization of raw counts

For each condition, we can visualize the distributions of gene counts with boxplots or distributions. To display the distributions, you can use the argument boxplot = FALSE.

DIANE::draw_distributions(abiotic_stresses$raw_counts, boxplot = TRUE)

Normalization

The TCC R package is used for the normalization step here, to make samples comparable by correcting for their differences in sequencing depths. This step is mandatory before further statistical analysis.

You can choose to normalize using the methods implemented in edgeR, referenced as 'tmm', or the one used in DESeq, referenced as 'deseq2'.

Those normalization methods rely on the hypothesis that a very small proportion of genes are differentially expressed between your samples. If you suspect a lot of genes could be differentially expressed in your data, TCC offers the possibility to proceed to a first detection of potential differentially expressed genes, to remove them, and then provide a final less biased normalization.

In that case, enable "prior removal of deferentially expressed genes". TCC will perform the following setp, depending on the normalization method you chose :

+ tmm/deseq2 temporary normalization

+ potential DEG identification and removal using edgeR test method

+ tmm/deseq2 definitive normalization

We use here the default parameters (no prior removal of differentially expressed genes, and TMM method) :

tcc_object <- DIANE::normalize(abiotic_stresses$raw_counts, norm_method = 'tmm', iteration = FALSE)

Note : here, our data should not be normalized, as it is already given as TPMs. Normalization is shown for the purpose of demonstration.

Low counts removal

Removing genes with very low abundance is a common practice in RNA-Seq analysis pipelines for several reasons :

There is no absolute and commonly accepted threshold value, but it is recommended to allow only genes with more than 10 counts per sample in average. DIANE thus proposes a threshold at 10*sampleNumber, but feel free to experiment with other values depending on your dataset and intensions.

threshold = 10*length(abiotic_stresses$conditions)
tcc_object <- DIANE::filter_low_counts(tcc_object, threshold)
normalized_counts <- TCC::getNormalizedData(tcc_object)

We can see the difference with the violin plot view of the effect of low count genes removal :

pre_process <- DIANE::draw_distributions(abiotic_stresses$raw_counts, boxplot = FALSE) + ggtitle("Before")
post_process <- DIANE::draw_distributions(normalized_counts, boxplot = FALSE)+ ggtitle("After")

gridExtra::grid.arrange(pre_process, post_process, ncol = 2)
genes <- sample(abiotic_stresses$heat_DEGs,4)
DIANE::draw_expression_levels(abiotic_stresses$normalized_counts, genes = genes)

Sample homogeneity and global analysis

Principal component analysis

Performing PCA on normalized RNA-Seq counts can be really informative about the conditions that impact gene expression the most. During PCA, new variables are computed, as linear combinations of your initial variables (e.g. experimental conditions). Those new variables, also called principal components, are designed to carry the maximum of the data variability.

We can plot the correlations of the initial variables to the principal components, to see which ones contribute the most to those principal components, and as a consequence, to the overall expression changes.

Each principal component explains a certain amount of the total variability, and those relative percentages are shown in what is called the screeplot.

DIANE::draw_PCA(data = normalized_counts)

The first plane, showing the first two components, is striking in the sense that the conditions are segregated by their heat level. Indeed, we find here that 57.3% of gene expression variability can be linked to heat stress.

The second principal axis, is more correlated to osmotic stress, as conditions with mannitol stress are opposed along this axis, carrying 12.2% of gene expression variance.

Differential expression analysis

Let's say we want to preform differential expression analysis between the conditions C and H, to get genes responding to a simple heat stress :

fit <- DIANE::estimateDispersion(tcc = tcc_object, conditions = abiotic_stresses$conditions)
topTags <- DIANE::estimateDEGs(fit, reference = "C", perturbation = "H", p.value = 0.01, lfc = 2)

# adding annotations
DEgenes <- topTags$table
DEgenes[,c("name", "description")] <- gene_annotations$`Arabidopsis thaliana`[
  match(get_locus(DEgenes$genes, unique = FALSE), rownames(gene_annotations$`Arabidopsis thaliana`)),
  c("label", "description")]

knitr::kable(head(DEgenes, n = 10))

# plots
tags <- DIANE::estimateDEGs(fit, reference = "C", perturbation = "H", p.value = 1)
DIANE::draw_DEGs(tags)
DIANE::draw_DEGs(tags, MA = FALSE)

genes <- topTags$table$genes

DIANE::draw_heatmap(normalized_counts, subset = genes, 
                    title = "Log expression for DE genes under heat stress")

# if we only want the conditions used for differential expression analysis :
DIANE::draw_heatmap(normalized_counts, subset = genes, 
                    title = "Log expression for DE genes under heat stress",
                    conditions = c("C", "H"))

Genes lists comparison

If we want to compare genes differentially expressed between Control and Heat stress, to genes differentially expressed between control and double heat and mannitol stresses :

genes_double_stress <- DIANE::estimateDEGs(fit, reference = "C", perturbation = "MH", p.value = 0.01, lfc = 2)$table$genes

genes_lists <- list("C - H" = genes, "C - HM" = genes_double_stress)
# if we only want the conditions used for differential expression analysis :
DIANE::draw_venn(genes_lists)

Gene Ontology enrichment analysis

Given any set of genes, we can compute which ontologies are significantly enriched. We compare here our list of heat responsive genes to all the genes present in our expression matrix, referred to as background genes.

If the gene IDs have splicing information (e.g. the transcript number), they must be first transformed to classic AGI terms, with the function get_locus, which is the case here.

The package used behind the enrichment analysis function is clusterProfiler. The method uses Fischer test to detect significant GO terms, and they can be either seen as a result dataframe, or in an interactive plot :

genes <- get_locus(topTags$table$genes)

background <- get_locus(rownames(normalized_counts))

genes <- convert_from_agi(genes)
background <- convert_from_agi(background)


go <- enrich_go(genes, background)
DIANE::draw_enrich_go(go, max_go = 30)

DIANE::draw_enrich_go_map(go)

We can limit the number of plotted GO terms with the max go parameter.

As expected, the GO term with the higher gene count is "response to heat". Then, we also observe GO terms linked to oxygen levels and hypoxia, protein folding and degradation, and responses to various other stresses or compounds.

Expression based clustering

The coseq package relies on the statistical framework of mixture models. Each cluster of genes is represented by a Poisson or Gaussian distribution, which parameters is estimated using Expectation-Maximisation algorithms. In the end of the procedure, the number of clusters that maximizes the clustering quality is chosen.

The input genes of a clustering can't be all the genes of the data, we use the output of differential expression analysis instead.

You can also specify a subset of the conditions to be used during the clustering if we're not interested in all the conditions.

Do not forget to use the seed argument to that results can be reproduced.

genes <- topTags$table$genes

clustering <- DIANE::run_coseq(conds = unique(abiotic_stresses$conditions), data = normalized_counts, 
                               genes = genes, K = 6:9, transfo = "arcsin", model = "Normal", seed = 123)
DIANE::draw_coseq_run(clustering$model, plot = "barplots")
DIANE::draw_coseq_run(clustering$model, plot = "ICL")

Profiles visualization

The user can display either a view of all the clusters, or focus on one in particular :

DIANE::draw_profiles(data = normalized_counts, clustering$membership, conds = unique(abiotic_stresses$conditions)) 
DIANE::draw_profiles(data = normalized_counts, clustering$membership, conds = unique(abiotic_stresses$conditions), k = 3) 

All genes in one cluster can be retrieved with :

DIANE::get_genes_in_cluster(membership = clustering$membership, cluster = 3) 

The named vector membership gives, for each gene, its cluster:

head(clustering$membership, n = 40)

Generalized Poisson regression on a cluster of genes

The idea is to extract the importance and effect of each factor. To do so, the expression of each gene is modeled by a Poisson distribution. The log of its parameter (the expected value) is approximated by a linear combination of the factors in the experiment. The coefficients associated to each factors are estimated to fit gene expression, and can be insightful to characterize genes behavior in a particular cluster. The model with interactions is considered automatically. If your design in not a complete crossed design, the interaction term will be null.

The absolute value of a coefficient gives information about the intensity of its effect on gene expression. The highest coefficient(s) are thus the one(s) having the greater impact on a cluster's expression profile.

The sign of a coefficient gives information about the way it impacts expression. If it is positive, it increases the expression when the factor is in its perturbation level. If negative, it decreases it.

genes_cluster <- DIANE::get_genes_in_cluster(
clustering$membership, cluster = 3)
glm <- DIANE::fit_glm(normalized_counts, genes_cluster, abiotic_stresses$design)
summary(glm)

draw_glm(glm)

Network inference

We want to build a Gene Regulatory Network, a graph connecting regulator genes to any other genes (targets or other regulators).

Regulators for your organism

A list of regulators are provided in DIANE for a number of model organisms. For example, here are the ones for Human and Arabidopsis :

print(head(regulators_per_organism[["Arabidopsis thaliana"]]))
print(head(regulators_per_organism[["Homo sapiens"]]))

If your gene IDs are splicing aware, this is not of any use to the inference method, so we want to aggregate the data first. For each gene, we sum all its transcripts to have the final expression levels.

aggregated_data <- aggregate_splice_variants(data = normalized_counts)

Grouping highly correlated regulators

This step is optional, but highly recommended, before the network inference step. Indeed, to leave strong correlations between the inference variables (the regulators), can lead to lose edges. For example, if two regulators are extremely correlated, one could "steal" targets from the other one, that would lose connections to potential target genes, that would be made if the first regulator was not in the dataset.

The information of strongly correlated regulators, is already, in itself, quite insightful and can lead to interesting biological interpretation.

genes <- get_locus(topTags$table$genes)
regressors <- intersect(genes, regulators_per_organism[["Arabidopsis thaliana"]])

# use normalized counts if you did not aggregate splice variants
grouping <- DIANE::group_regressors(aggregated_data, genes, regressors)

grouped_counts <- grouping$counts

grouped_targets <- grouping$grouped_genes

grouped_regressors <- grouping$grouped_regressors

# visNetwork::visNetwork(grouping$correlated_regressors_graph$nodes, 
# grouping$correlated_regressors_graph$edges)

Inference and visualization

The inference can now be performed.

Classic GENIE3 inference

Network inference takes as input a list of genes, generated by differential expression tests, that will be the nodes of the inferred network. Among those genes, some must be identified as potential transcriptional regulators. It is also possible to use a subset of genes for network inference, as clusters determined in the expression based clustering tab.

The GENIE3 package is a method based on Machine Learning (Random Forests) to infer regulatory links between target genes and regulator genes. The advantages of this method is that it returns oriented edges from regulators to targets, which is desired in the context of regulatory networks, and can capture high order interactions between regulators.

For each target gene, the methods uses Random Forests to provide a ranking of all regulators based on their influence on the target expression. This ranking is then merged across all targets, giving a global regulatory links ranking. Settings relative to regulatory weights inference are presented in the left pane.

The idea is then to keep the strongest links to build the gene regulatory network. The way of determining the right number of final edges in the network can be based on a desired network density alone, or followed by edges statistical testing.

Again, we set the seed before using the network inference procedure which is based on random processes :

set.seed(123)
mat <- DIANE::network_inference(grouped_counts, conds = abiotic_stresses$conditions, 
                                targets = grouped_targets, regressors = grouped_regressors, 
                                nCores = 1, verbose = FALSE)

network <- DIANE::network_thresholding(mat, n_edges = length(genes))

data <- network_data(network, regulators_per_organism[["Arabidopsis thaliana"]], 
                     gene_annotations$`Arabidopsis thaliana`)

knitr::kable(head(data$nodes))

DIANE::draw_network(data$nodes, data$edges)

network is an igraph object, and can be analyzed with all the possibilities offered by this complete package about graphs algorithms.

DIANE's edges testing procedure

A matrix of importances is computed, just as in the previous section, but we prefer to use another importance metric for consistency reasons.

set.seed(123)
mat <- DIANE::network_inference(grouped_counts, 
                                conds = abiotic_stresses$conditions, 
                                targets = grouped_targets,
                                regressors = grouped_regressors, 
                                importance_metric = "MSEincrease_oob", 
                                verbose = TRUE)
mat <- abiotic_stresses$heat_DEGs_regulatory_links

Then, we are going to generate a first network, by considering only the number of pairs offering a state of the art density value. The density is usually between 0.1 and 0.001, as reviewed in several papers about biological network topology. This step is crucial, as it prevents from testing all possible pairs, which would be useless for a huge majority of pairs, as well as extremely costly.

nGenes = length(grouped_targets)
nRegulators = length(grouped_regressors)

res <- data.frame(density = seq(0.001, 0.1, length.out = 20),
                  nEdges = sapply(seq(0.001, 0.1, length.out = 20),
                                  get_nEdges, nGenes, nRegulators))
ggplot(res, aes(x = density, y = nEdges)) + geom_line(size = 1) + 
  ggtitle("Number of network edges as a function of its density") + 
  geom_vline(xintercept = 0.03, color = "darkgreen", lty = 2, size = 1.2)

Given this graph, we can consider that a density value of 0.03 would be a good compromise between a small enough subset of edges to test, and a coherent density value, not too restrictive. This step will take a certain time to run, from a few minutes to tens of minutes depending on your hardware.

set.seed(123)
res <- DIANE::test_edges(mat, normalized_counts = grouped_counts, density = 0.03,
                         nGenes = nGenes, 
                         nRegulators = nRegulators, 
                         nTrees = 1000, verbose = TRUE)
res <- abiotic_stresses$heat_edge_tests

The elements contained in the result variable are the network edges associated to their adjusted pvalue, as well as graphics to guide the user in the choice of the adjusted pvalue (fdr) threshold to use for the final network.

res$fdr_nEdges_curve

res$pvalues_distributions + xlim(0,0.1)

kable(head(res$links))

net <- DIANE::network_from_tests(res$links, fdr = 0.05)

All right, the network is ready to be analyzed!

Visualization

The first graph shows, in red, the edges that were not significant according to our perumtation tests.

net_data <- network_data(net, regulators_per_organism[["Arabidopsis thaliana"]], gene_annotations$`Arabidopsis thaliana`)
draw_discarded_edges(res$links, net_data)

The second graph shows to final infered network :

draw_network(net_data$nodes, net_data$edges)

A community discovery is performed using the Louvain algorithm, so we can extract highly connected genes modules. You can see those communities on the graph and plot their profiles:

data$nodes$group <- data$nodes$community

louvain_membership <- data$nodes$community
names(louvain_membership) <- data$nodes$id

knitr::kable(head(louvain_membership, n = 10))

DIANE::draw_network(data$nodes, data$edges)
draw_profiles(aggregated_data, membership = louvain_membership, conds = abiotic_stresses$conditions)

Gene ontology enrichment can also be performed on gene communities (However, if there is too few genes, GO enrichment tests may not be significant).

Analysis

We can plot the network connectivity statistics, and get the regulators and targets of a gene as follows :

DIANE::draw_network_degrees(net_data$nodes, net)

nodes <- data$nodes[order(-data$nodes$degree),]
gene_of_interest <- nodes$id[1] 

print(DIANE::describe_node(network, node = gene_of_interest))


OceaneCsn/DIANE documentation built on Jan. 10, 2024, 6:43 p.m.