knitr::opts_chunk$set(collapse = TRUE, comment = "#>") knitr::opts_chunk$set(fig.width=12, fig.height=8)
This package consists of modularized wrapper functions for multiple genomics analysis packages. It is designed for multi-omics analysis of expression, methylation and/or genotyping data. All modules can also be used individually. This includes (although not all described in this vignette yet):
A flow chart of the package is given here.
The ouput of one function serves directly as input for the next function. For parametrization of the individual functions please see the detailed description for every function accessible via help(function_name)
. Since some of these functions are quite comprehensive, need a certain amount of processing time and generate multiple figures and tables, most figures of this vignette are not produced during vignette compiling but are included as external files. Nevertheless, the function calls presented here will produce the shown figures (plus addititional results), when you run the script with the provided example dataset. All result tables and figures are stored in the designated project folder. Required Bioconductor packages will be installed by the respective function if necessary. For needed packages not yet attached when a function is called, the function will attach them for data processing and detach them afterwards.
Install it from github using the devtools
package:
devtools::install_github("frankRuehle/systemsbio", build_vignettes=TRUE)
library("systemsbio") library("knitr")
There are two functions to load gene expression data into an ExpressionSet object. If you are analyzing Illumina microarrays, use read_Illu_expr_array()
for importing the probe sample profile and the control probe profile exported from GenomeStudio.
gex <- read_Illu_expr_array( dataFile = "SampleProbeProfile.txt", qcFile = "ControlProbeProfile.txt", sampleSheet = "sample_sheet.csv", sampleColumn = "Sample_Name", exprchip= "HumanHT-12 v4", org= "human", covarfile = "covariates.txt", # optional for additional phenotypes method_norm = "none", transform= "log2" )
For other array formats, use readData2eset()
to load numeric expression values and phenotype data. As for read_Illu_expr_array()
, the expression data can directly be log2-transformed.
gex <- readData2eset(exprsData, phenoData, featureData=NULL, ProbeID = "PROBE_ID", sampleColumn = "Sample_Name", groupColumn = "Sample_Group", experimentData = MIAME(), arrayAnnotation= "Humanv4", transform = log2, organism= "human" )
The example dataset gex
was imported using read_Illu_expr_array
and contains 12 samples belonging to 3 experimental groups. The data is log2-transformed but not normalized yet. This is an artificial dataset for demonstration purposes and does not reflect any biological insights. You can access the expression data, phenotype data and feature data using the accessors exprs(gex)
, pData(gex)
and fData(gex)
, respectively.
library(beadarray) library(Biobase) beadarray::exprs(gex)[1:5, 1:5] Biobase::pData(gex)[,1:5] Biobase::fData(gex)[1:5, ]
The function QC_expressionset()
performs quality control of the expression data utilizing Illumina internal control probes (if available) as well as quality control functions of the arrayqualitymetrics
and WGCNA
packages. The result files are stored in the directory given in projectfolder
. Failed samples should be removed from the dataset.
qcMetrics <- QC_expressionset(gex, projectfolder="example_analysis/GEX/QC", projectname="example", phDendro=c("Sample_Group", "Sentrix_ID"), groupColumn = "Sample_Group", cex.dendroLabels = 1)
If internal Illmumina quality control probes are present, boxplots are prepared for all categories of control probes (see below) and are stored in the project folder. A technical note about Illumina control probes is available here.
All results from arrayqualitymetrics
(distance plots, principal component analysis, array intensity distribution, density and MA plots) including outlier detection and further description are stored in the subfolder quality_metrics
.
Finally, a sample dendrogram including outlier detection is prepared using the WGCNA
package.
deg_images<-c("figures/example_boxplot_controlprofile.png", "figures/example_SampleDendrogram_noNorm_adjacency.png") include_graphics(deg_images)
For quality control, look at different quality characteristics and make a decision which samples to remove. Usually, low quality samples look bad in several ways. Here, no sample had to be removed. If you have low quality samples in your dataset, remove them as indicated here:
gex <- gex[, !sampleNames(gex) %in% c("samplenamesToRemove")]
Next, the dataset will be quantile-normalized using the process_eset()
function. log2-transformation is not necessary,
since the data has already been log2-transformed.
gexn <- process_eset(gex, method_norm="quantile", transform="none")
Use DESeq2::DESeqDataSetFromHTSeqCount
function to import count data e.g. from HTSEQ-count into a DESeqDataSet
object. Here, "Sample_Group" given in the design formula is the phenotype used for differential expression analysis
dds <- DESeqDataSetFromHTSeqCount(sampleTable = data.frame(sampleName = your_sample_names, fileName = your_file_names, your_sample_phenotypes, directory = your_directory, design = ~ Sample_Group)
DESeq2 uses un-normalized count data for differential expression analysis. To obtain an expression matrix to be used for PCA, etc, you may process the count data e.g. by filtering low abundance transcripts and transform the data e.g. by applying Variance stabilizing transformation. The function process_dds
processes the DESeqDataSet
and returns a SummarizedExperiment
object to be used for further analysis.
sumexp.vst <- process_dds(dds, min_rowsum = 10, transform.function = varianceStabilizingTransformation, return_SummarizedExperiment = TRUE)
The pcaGoPromoter
package provides functions for PCA analysis as well as for annotated PCA plots of the given expression data. Additionally, the top PC loadings (features most strongly correlated with PCs) are used for enrichment analysis with respect to gene ontology terms (GO) and transcription factor binding sites (TFBS). The wrapper function used here prepares the respective 2d and 3d PCA plots with respect to sample group assignment for the principal components indicated in the function call. The loadings of each PC are used for enrichment analysis and added to the annotated PCA plot. All plots and result tables are stored in the project folder.
pca.gexn <- wrapPCAgoprom(gexn, groupsoi=c('case1', 'case2', 'control'), projectname="example", projectfolder = "example_analysis/GEX/pcaGoPromoter", sample.name.column = "Sample_Name", inputType = "geneSymbol", PCs4table = 3, PCs2plot = c(1,2,3), probes2enrich = 0.025)
deg_images<-c("figures/example_pcainfoplot_PC1_2.png", "figures/example_pcaplot3d_PC1_2_3.png") include_graphics(deg_images)
The WGCNA
package is used for for preparing an unscaled co-expression network and for identifying modules of co-expressed genes. The wrapper function wrapWGCNA()
used here performes all steps in the process automatically.
First, you need to determine the adequate soft threshold power to prepare a scale free network. The function wrapWGCNA()
can be run with parameter softThresholdPower = NULL
to prepare an overview plot for selecting a soft threshold power with r2 > 0.85. Otherwise, you can let the function choose an appropriate threshold with softThresholdPower= "auto"
or set the value directly.
wgcnaSFT.gexn <- wrapWGCNA(gexn, projectfolder= "example_analysis/GEX/WGCNA", softThresholdPower = NULL)
However, since we have a very small and artificial dataset in gexn
, the resulting network does not have a scalefree topology. For this example, we choose a softThresholdPower of 12. In the next run of wrapWGCNA()
, the network is constructed and co-expressed modules are identified. Additional parameters for finetuning the network (e.g. detectCutHeight
, deepSplit
, minModuleSize
) can be transfered to WGCNA::blockwiseModules
via the ...
-parameter. See help(blockwiseModules)
for details. Genes not assigned to a module are collected in the default 'grey' module. A sample dendrogram, a gene dendrogramm(s) with phenotype information as well as boxplots of all modules are prepared and stored in the project folder.
wgcnaNET.gexn <- wrapWGCNA(gexn, projectfolder= "example_analysis/GEX/WGCNA", softThresholdPower= 12, corType= "bicor", networkType = "signed", # inversely correlated genes not in one module TOMType = "signed", maxBlockSize = 5000, # analysis devided into blocks TOMplot= F, MDSplot= F, phDendro= c("Sample_Group"), phModule= c("Sample_Group", "Age"), sampleColumn = "Sample_Name", groupColumn = "Sample_Group", groupsets= c("case1-control", "case2-control"), symbolColumn = "SYMBOL", flashClustMethod = "average", dendroRowText=T, autoColorHeight = FALSE, colorHeight=0.2, cex.labels = 1, detectCutHeight = 0.8, deepSplit = 0, minModuleSize = 75)
deg_images<-c("figures/module_boxplot_1_of_3.png", "figures/ModuleDendrogram_Block.png") include_graphics(deg_images)
The modules are correlated with phenotypes of interest given in phModule
as well as with group assignments given in groupsets
. The correlation results are illustrated in heatmaps. Next, correlated genes of high connectivity (hub genes) within their module are identified and are illustrated in scatter plots (correlation with phenotype/groupset vs. module membership). All result data is also stored as tables within the project folder.
deg_images<-c("figures/Heatmap_Module-Groupset_relationship.png", "figures/Intramodular_analysis_case1-control.png") include_graphics(deg_images)
wrapLimma
uses the limma
package to determine differentially expressed genes according to the given p-value and fold change thresholds for all group comparisons given in comparisons
. Tables containing the filtered as well as unfiltered gene lists are stored in the designated projectfolder
. Heatmaps for each group comparison containg the top differentially expressed genes will be generated automatically. Venn-Diagrams will show the overlap of differentially expressed genes from different group comparisons.
deg <- wrapLimma(gexn, comparisons = c("case1-control", "case2-control"), p.value.threshold = 0.05, FC.threshold = log2(1.5), adjust.method = "BH", projectfolder = "example_analysis/GEX/limma", projectname = "example", Symbol.column = "SYMBOL", geneAnno2keep = c("SYMBOL", "ENTREZID"), venn_comparisons = list(allcomp= c("case1-control", "case2-control")), # Heatmap parameter: maxHM = 50, scale = c("none"), HMincludeRelevantSamplesOnly = TRUE )
deg_images<-c("figures/Heatmap_example_case1-control.png", "figures/Venn_Diagram_example_allcomp.png", "figures/Volcano_example_case1-control.png") include_graphics(deg_images)
Similarily, count data from RNA-Seq can be analysed for differential expression using the DESeq2
package. The wrapper function wrapDESeq2
works analogously to diffLimma
as described above.
deg <- wrapDESeq2(dds, comparisons = c("case1-control", "case2-control"), min_rowsum = 10, p.value.threshold = 0.05, adjust.method ="BH", FC.threshold = log2(1.5), projectfolder = "example_analysis/GEX/deseq", projectname = "example", Symbol.column = NULL, sampleColumn = "Sample_Name", groupColumn= "Sample_Group", venn_comparisons= list(allcomp= c("case1-control", "case2-control")), # Heatmap parameter: maxHM=50, scale = c("none"), HMincludeRelevantSamplesOnly=TRUE )
The function wrapClusterProfiler
takes genelist and optionally background gene list as input to perform enrichment analyses using the clusterProfiler
package. By default, overrepresentation analysis and gene set enrichment analysis (GSEA) is performed for all categories given in enrichmentCat
. While overrepresentation analysis applies a significance threshold to test the top genes, GSEA uses all genes of the list e.g. ranked by p-value. All result tables and plots (cnet plots, enrichment maps, dot plots, GOgraphs and KEGG pathway maps) are stored in the project folder.
clusterp.deg <- wrapClusterProfiler(deg[["DEgenes.unfilt"]], backgroundlist = NULL, projectfolder = "example_analysis/GEX/clusterProfiler", projectname = "example", enrichmentCat = c("GO", "KEGG", "Reactome", "DO"), maxInputGenes = 100, id.type = "SYMBOL", id.column = "SYMBOL", sortcolumn = "adj.P.Val", highValueHighPriority = FALSE, sortcolumn.threshold = 0.05, fun.transform = function(x) {-log10(x)}, FCcolumn = "logFC", threshold_FC = log2(1.5), org = "human", pAdjustMethod = "BH", enrich.p.valueCutoff = 0.05, enrich.q.valueCutoff = 0.05) # access results e.g. for GO term (Biological Process) overrepresentation analysis of the # first entry of deg[["DEgenes.unfilt"]], i.e. comparison case1-control head(as.data.frame(clusterp.gexn[["case1-control"]]["Overrep_BP"]))
Overrep_BP <- read.table("figures/example_case1-control_Overrep_BP_resulttable.txt", sep="\t", header=T) head(Overrep_BP) deg_images<-c("figures/example_case1-control_Overrep_BP_cnetPlot.png", "figures/example_case1-control_Overrep_BP_enrichmentMap.png") include_graphics(deg_images)
The function wrapPWMEnrich
uses the PWMEnrich
package to search for enriched TFBS within an input gene list. It looks up all promotor sequences for theses genes defined by indicated distance from trancription start site (TSS) refering to human genome build hg19. If gene symbols are given instead of entrez IDs, symbols are converted to entrez IDs first. If PromLookup == FALSE
, look up of promotor sequences is omitted and sequences of interest must be given as gene coordinates (hg19) in sequences
instead.
Optionally, preselected motivs given in SearchSelMotifs
are looked up in sequences
. All result tables and plots are stored in the project folder.
tf.deg <- wrapPWMEnrich(deg[["DEgenes"]], annoColumn = NULL, name.organism = "hsapiens", projectfolder= "example_analysis/GEX/TFBS", projectname="example", applyFilter = FALSE, PromLookup = TRUE, id.type = "SYMBOL", id.column = "SYMBOL", PromSeqUpstreamTSS = 2000, PromSeqDownstreamTSS = 200, SearchSelMotifs = NULL, motif.min.score = 0.9)
tf <- read.table("figures/example_case1-control_PWMEnrich_report.txt", sep="\t", header=T) head(tf) deg_images<-c("figures/example_case1-control_PWMEnrich_top.png", "figures/example_case2-control_PWMEnrich_top.png") include_graphics(deg_images)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.