knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

This is an example for analyzing a user-provided microarray dataset.

Data preparation

Load in required libraries.

library("ADAGEpath")
library("DT")
library("knitr")

Before any analysis, we need to specify the ADAGE model and the data compendium we want to use.

model <- eADAGEmodel
compendium <- PAcompendium
probe_dist <- probedistribution

Let's load in a sample dataset that come with the package. It's expression data from Pseudomonas aeruginosa wild type and ∆anr grown as biofilms on ∆F508 cystic fibrosis bronchial epithelial cells (CFBEs). Detailed information about the dataset can be found here GSE67006. All its CEL files are stored in the folder "./inst/extdata/anr". To load your own dataset, simply modify this input path.

input_path <- system.file("extdata", "anr/", package = "ADAGEpath")
data_raw <- load_dataset(input = input_path, isProcessed = FALSE,
                         isRNAseq = FALSE, model = model,
                         compendium = compendium, quantile_ref = probe_dist,
                         norm01 = FALSE)

ADAGE only accepts expression values in the (0,1) range. We linearly transform expression values to be between 0 and 1 using the Pa compendium as the reference.

data_normed <- zeroone_norm(input_data = data_raw, use_ref = TRUE,
                            ref_data = compendium)

Now let's specify the phenotypes for each sample. It needs to be a character vector and has the same sample order as the expression data loaded above.

data_pheno <- c("mt", "mt", "mt", "wt", "wt", "wt")

ADAGE signature analysis

Activity calculation

We calculate the activity of each signature for each sample in the dataset.

data_activity <- calculate_activity(input_data = data_normed, model = model)

The returned data_activity is a data.frame with signature names in the first column and activity values per sample starting from the second column.

Active signature detection

We want to find signatures that are differentially active between anr mutant and wildtype samples. We use limma to perform a differential activation test. limma is more robust than a simple t test when sample size is small. A two-group limma analysis is provided in the function build_limma(). You can also build other limma models to test signatures' activities when the experimental design is more complex.

In the limma test, we will use "wt" as the control phenotype. Because there are a lot of signatures passing the significance cutoff, here we use the more stringent Bonferroni procedure instead of the Benjamini–Hochberg procedure for multiple hypothesis correction.

limma_result <- build_limma(data_activity, phenotypes = data_pheno,
                            control_pheno = "wt",
                            use.bonferroni = TRUE)

To take both absolute activity difference and significance into account, we use pareto fronts to pick the most differentially active signatures. We extract differentially active signatures in the first 10 layers of pareto fronts. Modify N_fronts to get more or fewer signatures.

active_sigs <- get_active_signatures(limma_result = limma_result,
                                     pheno_group = "both", 
                                     method = "pareto", N_fronts = 10)

Signatures that are differentially active between anr mutant and wildtype are:

active_sigs

Plot each signature's activity change and significance in the limma test.

plot_volcano(limma_result, highlight_signatures = active_sigs,
             interactive = TRUE)

Look at how the activities of active signature vary across samples.

plot_activity_heatmap(activity = data_activity, signatures = active_sigs)

Combining the volcano plot and the activity heatmap, Node35pos is the most active signature in anr mutant, followed by Node233pos, Node140pos. On the other side, Node38neg, Node31pos, Node269pos, Node205neg are most active in wildtype.

Overlapping signature removal

To reduce the number of signatures to look at, we can check whether these active signatures overlap with each other.

plot_signature_overlap creates a heatmap of odds ratios. The odds ratio represents the odds that two signatures share a specific number of genes.

signature_similarity <- plot_signature_overlap(selected_signatures = active_sigs,
                                               model = model)

Next we calculate the marginal activities of similar signatures. Marginal activity is defined as the activity of signature A after removing genes that it shares with signature B.

marginal_activity <- calculate_marginal_activity(input_data = data_normed,
                                                 selected_signatures = active_sigs,
                                                 model = model)

Again, we build a limma model to test whether these marginal activities are still strongly different between two conditions.

marginal_limma <- build_limma(input_data = marginal_activity,
                              phenotypes = data_pheno, control_pheno = "wt")

Let's visualize the marginal activities in a matrix heatmap. The value in this matrix represents the -log10 transformed adjusted p value in the activation test when the effect of the column signature is removed from the row signature. Values in the diagonal of the heatmap are the activation significance of signatures themselves. Activation significance below the cutoff is marked by a cross sign.

plot_marginal_activation(marginal_limma_result = marginal_limma,
                         signature_order = colnames(signature_similarity),
                         sig_cutoff = 0.05)

Based on this plot, we can see that Node119pos, Node214neg, and Node299pos are completely masked by Node191pos, because after removing Node191pos from them, they become non-significant. Node130pos and Node250neg are masked by Node31pos. Node285neg is masked by Node205neg. Node154pos is masked by Node57neg. Node63neg, Node39neg, Node228pos, Node158neg, Node140pos, Node269neg and Node31neg are masked by Node35pos. Node185neg and Node275pos are masked by Node67pos. Node278pos is masked by Node9pos. We can see that Node67pos, Node35pos, and Node233pos each has unique genes that make them still significant even after removing the effect of another signature. We can safely remove a signature being masked by another signature as long as we keep the second signature.

unique_active_sigs <- remove_redundant_signatures(marginal_limma,
                                                  sig_cutoff = 0.05)
unique_active_sigs

Check out each signature's activity change and significance after removing overlapping signatures.

plot_volcano(limma_result, highlight_signatures = unique_active_sigs,
             interactive = TRUE)

Look at how the activities of active signature vary across samples after removing overlapping signatures.

plot_activity_heatmap(activity = data_activity, signatures = unique_active_sigs)

Active signature interpretation

Associated pathways

We download Pseudomonas aeruginosa KEGG pathway terms from the TRIBE webserver. TRIBE also provides Gene Ontology terms. If you want to evaluate signatures with GO terms, simply replace "KEGG" with "GO" in the following steps. You can also retrieve public genesets created by a user on TRIBE.

KEGG <- fetch_geneset(type = "KEGG")
# we only consider KEGG pathways with more than 5 genes and less than 100 genes
# as meaningful pathways
KEGG_subset <- KEGG[lengths(KEGG) >= 5 & lengths(KEGG) <= 100]

We associate active signatures to known KEGG pathways.

pathway_association <- annotate_signatures_with_genesets(
  selected_signatures = unique_active_sigs, model = model,
  genesets = KEGG_subset)

Calculate the activity of associated pathways inside active signatures. This help differentiate active vs. non-active pathways assoicated with active signatures.

pathway_activity <- signature_geneset_activity(
  signature_geneset_df = pathway_association[, c("signature", "geneset")],
  gene_set_list = KEGG_subset, model = model, input_data = data_normed)
plot_activity_heatmap(pathway_activity)

We run a limma test on pathway activities and find pathways that are truly active in this dataset.

pathway_limma <- build_limma(pathway_activity, phenotypes = data_pheno,
                             control_pheno = "wt", use.bonferroni = TRUE)
# combine pathway association and pathway activation test results
combined_result <- combine_geneset_outputs(
  signature_geneset_association = pathway_association,
  geneset_limma_result = pathway_limma)
knitr::kable(combined_result, digits = 4, row.names = FALSE, align = "c")

There are also signatures uncharacterized by KEGG.

uncharacterized_sigs <- setdiff(unique_active_sigs, pathway_association$signature)
uncharacterized_sigs

Check the signature similarity of the uncharacterized signatures.

plot_signature_overlap(selected_signatures = uncharacterized_sigs, model = model)

Genes in signatures

To review a signature, we can list all genes in it.

DT::datatable(annotate_genes_in_signatures(selected_signatures = "Node35neg",
                                           model = model,
                                           curated_pathways = KEGG))

We can also review a group of signatures

DT::datatable(annotate_genes_in_signatures(selected_signatures = uncharacterized_sigs,
                                           model = model,
                                           curated_pathways = KEGG))

Gene-gene network

We check how genes in the active signatures cluster in the ADAGE gene-gene network.

We calculate an expression fold change for each gene and pass it to the gene-gene network to show as node color. Again, we use limma to test differential expression and get the logFC.

data_raw_limma <- build_limma(input_data = data_raw, phenotypes = data_pheno,
                              control_pheno = "wt")
# build a gene:fold change table from limma result
gene_logFC <- data.frame(geneID = rownames(data_raw_limma),
                         logFC = data_raw_limma$logFC)

Visualize the ADAGE gene-gene network of the active signatures.

visualize_gene_network(selected_signatures = unique_active_sigs,
                       gene_color_value = gene_logFC,
                       model = model, cor_cutoff = 0.5,
                       curated_pathways = KEGG)


greenelab/ADAGEpath documentation built on May 25, 2022, 7:11 a.m.