README.md

# METACLUSTER - an R package for context-specific expression analysis of metabolic gene clusters

Check out our hands-on tutorial on

## About

METACLUSTER facilitates comprehensive condition and tissue-specific expression analysis of metabolic gene clusters based on a probabilistic framework for characterizing metabolic gene clusters using context-specific gene expression information

Alt text A) The METACLUSTER framework. B) Cluster diagram and transcriptional activity map of the arabidiol/baruol cluster (Yu et al. 2016) (C463 based on the prediction in Schlapfer et al. 2017). Colors indicate the inferred p-value of the cluster to be transcriptionally active per condition and tissue. Gray tiles indicate condition-tissue combinations that are missing in the differential expression dataset. C) Transcriptional activity map of the 317 inferred context-specific gene clusters. Color values denote the number of the transcriptionally active gene clusters per condition-tissue. Black tiles indicate condition-tissue combinations with no inferred transcriptionally active clusters.

Contact for questions.

## Installation

METACLUSTER is based on R version 3.6.1. The easiest way to install METACLUSTER is through devtools (see OS specific notes on installing devtools at the end)

``` library(devtools)

install_github("https://github.com/mbanf/METACLUSTER", build_vignettes=TRUE, repos=c("http://cran.rstudio.org", "http://bioconductor.org/packages/release/bioc"), dependencies=TRUE)

```

## Usage

To run the METACLUSTER with the Schlapfer et al. 2017 A.thaliana gene cluster predictions data you can download all neccessary datasets from onedrive: datasets_athaliana. If you are using personal datasets, see the required data format for "custom" datasets in section Notes.

``` library(METACLUSTER) # load package

setwd("/User/home/METACLUSTER_athaliana_datasets") # set working directory to the dataset files

```

Load individual datasets based on their filenames:

l.data = load_datasets(input_format = "PCF2017_enzymes_only", filename.geneCluster = "data/ath_geneInCluster_3_aracyc.txt-labeled_NoHypoGenes.txt", filename.genes = "data/genes.txt", filename.sample_ids_differentialExpression = "data/sample_ids_differentialExpression.txt", filename.foldChange_differentialExpression = "data/m.foldChange_differentialExpression.txt", filename.pvalue_differentialExpression = "data/m.pvalue_differentialExpression.txt", filename.experiment_condition_tissue_annotation ="data/experiment_annotation.txt")

METACLUSTER Parameter sets:

!We set b.load_codifferentialAnalysis_monteCarloSimulation = "yes" for the Schlapfer et al. 2017 A.thaliana gene cluster predictions data, as we have pre-computed and provided all co-differential expression datasets - for other datasets, set to "no"!

df.cluster_annotations = run_METACLUSTER(m.foldChange_differentialExpression = l.data$m.foldChange_differentialExpression, m.pvalue_differentialExpression = l.data$m.pvalue_differentialExpression, df.experiment_condition_annotation = l.data$df.experiment_condition_annotation, df.geneCluster = l.data$df.geneCluster, tb.condition_treatments = l.data$tb.condition_treatments, tb.condition_tissues = l.data$tb.condition_tissues, n.cpus = 3, b.load_codifferentialAnalysis_monteCarloSimulation = "yes", pvalue_DifferentialExpression = 0.05, probability_codifferentialExpression_MonteCarloSimulation = 0.95, pvalue_coexpression_distribution = 0.05, pvalue_geneClusterPrediction = 0.05, pvalue_geneClusterConsistency = 0.05, pvalue_treatment_per_condition = 0.05, pvalue_tissue_per_condition = 0.05, number_codifferentialExpression_MonteCarloSimulations = 1, number_conditionSpecificCoexpressionBackgroundGenePairs = 100, min_number_condition_samples = 1, seed = 1234, heatmap_width = 10, heatmap_height = 5, foldername.results = "results/", foldername.tmp = "tmp/")

Next evaluate and store the results ``` evaluate_and_store_results(df.cluster_annotations=df.cluster_annotations, df.experiment_condition_annotation = l.data$df.experiment_condition_annotation, tb.condition_treatments = l.data$tb.condition_treatments, tb.condition_tissues = l.data$tb.condition_tissues, min_number_of_genes = 3, heatmap_width = 4, heatmap_height = 7, fontsize = 7, fontsize_row = 10, fontsize_col = 10, foldername.results = "results/")

```

## Notes

Installation of devtools dependencies under Ubuntu (prior to installing devtools): sudo apt-get install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

Then install.packages("devtools")

Custom gene cluster data format: "Cluster.ID", "Gene.ID", "Gene.Name", see custom_example_data

as a pre-requisite, our algorithm needs two matrices: m.pvalue_differentialExpression and m.foldChange_differentialExpression.

The format of the experimental annotation should be: "series_id" "condition_treatment_1" "condition_treatment_2" "condition_tissue" "unique_ID".

References

Banf M, Zhao K.M., and Rhee S. METACLUSTER - an R package for context-specific expression analysis of metabolic gene clusters, Bioinformatics, 2019

Genome-wide prediction of metabolic enzymes, pathways, and gene clusters in plants, Schläpfer P, Zhang P, Wang C, Kim T, Banf M, Chae L, Dreher K, Chavali A K, Nilo-Poyanco, Bernhard T, Kahn D, and Rhee S. - Plant physiology, 2017



mbanf/METACLUSTER documentation built on Feb. 27, 2020, 1:32 p.m.