knitr::opts_chunk$set(collapse = TRUE, comment = "#>") knitr::opts_chunk$set(fig.width=12, fig.height=8)
This package consists of modularized wrapper functions for multiple genomics analysis packages. It is designed for multi-omics analysis of expression, methylation and/or genotyping data. All modules can also be used individually. This includes (although possibly not all described in this vignette yet):
A flow chart of the package is given here.
The ouput of one function serves directly as input for the next function. For parametrization of the individual functions please see the comprehensive description for every function accessible via
help(function_name). Required Bioconductor packages will be installed by the respective function if necessary. For needed packages not yet attached when a function is called, the function will attach them for data processing and detach them afterwards. Results are stores as tables or figures within the designated project folder.
Install it from github:
library("systemsbio") library("devtools") library("knitr")
There are two functions to load gene expression data into an ExpressionSet object. If you are analyzing Illumina microarrays, use
read_Illu_expr_array() for importing the probe sample profile and the control probe profile exported from GenomeStudio.
gex <- read_Illu_expr_array( dataFile = "SampleProbeProfile.txt", qcFile = "ControlProbeProfile.txt", sampleSheet = "sample_sheet.csv", sampleColumn = "Sample_Name", groupColumn = "Sample_Group", exprchip= "HumanHT-12 v4", org= "human", covarfile = "covariates.txt", # optionally for additional phenotypes method_norm = "none", transform= "log2" )
For other array formats, use
readData2eset() to load numeric expression values and phenotype data. As for
read_Illu_expr_array(), the expression data can directly be log2-transformed.
gex <- readData2eset(exprsData, phenoData, featureData=NULL, ProbeID = "PROBE_ID", sampleColumn = "Sample_Name", groupColumn = "Sample_Group", experimentData = MIAME(), arrayAnnotation= "Humanv4", transform = log2, organism= "human" )
The example dataset
gex was imported using
read_Illu_expr_array and contains 12 samples belonging to 3 experimental groups. The data is log2-transformed but not normalized yet. This is an artificial dataset for demonstration purposes and does not reflect any biological insights. You can access the expression data, phenotype data and feature data using the accessors
beadarray::exprs(gex)[1:5, 1:5] Biobase::pData(gex)[,1:5] Biobase::fData(gex)[1:5, ]
QC_expressionset() performs quality control of the expression data utilizing Illumina internal control probes (if available) as well as quality control functions of the
WGCNA packages. The result files are stored in the directory given in
projectfolder. Failed samples should be removed from the dataset.
qcMetrics <- QC_expressionset(gex, projectfolder="example_analysis/GEX/QC", projectname="example", phDendro=c("Sample_Group", "Sentrix_ID"), groupColumn = "Sample_Group", cex.dendroLabels = 1)
If internal Illmumina quality control probes are present, boxplots are prepared for all categories of control probes (see below) and are stored in the project folder. A technical note about Illumina control probes is available here.
All results from
arrayqualitymetrics (distance plots, principal component analysis, array intensity distribution, density and MA plots) including outlier detection and further description are given here.
Finally, a sample dendrogram including outlier detection is prepared using the
deg_images<-c("../inst/doc/figures/example_boxplot_controlprofile.png", "../inst/doc/figures/example_SampleDendrogram_noNorm_adjacency.png") include_graphics(deg_images)
For quality control, look at different quality characteristics and make a decision which samples to remove. Usually, low quality samples look bad in several ways. Here, no sample had to be removed. If you have low quality samples in your dataset, remove them as indicated here:
gex <- gex[, !sampleNames(gex) %in% c("samplenamesToRemove")]
Next, the dataset will be quantile-normalized using the
norm_exprs() function. log2-transformation is not necessary,
since the data has already been log2-transformed.
gexn <- norm_exprs(gex, method_norm="quantile", transform="none")
pcaGoPromoter package provides functions for PCA analysis as well as for annotated PCA plots of the given expression data. Additionally, the top PC loadings (features most strongly correlated with PCs) are used for enrichment analysis with respect to gene ontology terms (GO) and transcription factor binding sites (TFBS). The wrapper function used here prepares the respective 2d and 3d PCA plots with respect to sample group assignment for the principal components indicated in the function call. The loadings of each PC are used for enrichment analysis and added to the annotated PCA plot. All plots and result tables are stored in the project folder.
pca.gexn <- wrapPCAgoprom(gexn, groupsoi=c('case1', 'case2', 'control'), projectname="example", projectfolder = "example_analysis/GEX/pcaGoPromoter", sample.name.column = "Sample_Name", inputType = "geneSymbol", PCs4table = 3, PCs2plot = c(1,2,3), probes2enrich = 0.025)
deg_images<-c("../inst/doc/figures/example_pcainfoplot_PC1_2.png", "../inst/doc/figures/example_pcaplot3d_PC1_2_3.png") include_graphics(deg_images)
WGCNA package is used for for preparing an unscaled co-expression network and for identifying modules of co-expressed genes. The wrapper function
wrapWGCNA() used here performes all steps in the process automatically.
First, you need to determine the adequate soft threshold power to prepare a scale free network. The function
wrapWGCNA() can be run with parameter
softThresholdPower = NULL to prepare an overview plot for selecting a soft threshold power with r2 > 0.85. Otherwise, you can let the function choose an appropriate threshold with
softThresholdPower= "auto"or set the value directly.
wgcnaSFT.gexn <- wrapWGCNA(gexn, projectfolder= "example_analysis/GEX/WGCNA", softThresholdPower = NULL)
However, since we have a very small and artificial dataset in
gexn, the resulting network does not have a scalefree topology. For this example, we choose a softThresholdPower of 12. In the next run of
wrapWGCNA(), the network is constructed and co-expressed modules are identified. Additional parameters for finetuning the network (e.g.
minModuleSize) can be transfered to
WGCNA::blockwiseModules via the
help(blockwiseModules) for details. Genes not assigned to a module are collected in the default 'grey' module. A sample dendrogram, a gene dendrogramm(s) with phenotype information as well as boxplots of all modules are prepared and stored in the project folder.
wgcnaNET.gexn <- wrapWGCNA(gexn, projectfolder= "example_analysis/GEX/WGCNA", softThresholdPower= 12, corType= "bicor", networkType = "signed", # inversely correlated genes not in one module TOMType = "signed", maxBlockSize = 5000, # analysis devided into blocks TOMplot= F, MDSplot= F, phDendro= c("Sample_Group"), phModule= c("Sample_Group", "Age"), sampleColumn = "Sample_Name", groupColumn = "Sample_Group", groupsets= c("case1-control", "case2-control"), symbolColumn = "SYMBOL", flashClustMethod = "average", dendroRowText=T, autoColorHeight = FALSE, colorHeight=0.2, cex.labels = 1, detectCutHeight = 0.8, deepSplit = 0, minModuleSize = 75)
deg_images<-c("../inst/doc/figures/module_boxplot_1_of_3.png", "../inst/doc/figures/ModuleDendrogram_Block.png") include_graphics(deg_images)
The modules are correlated with phenotypes of interest given in
phModule as well as with group assignments given in
groupsets. The correlation results are illustrated in heatmaps. Next, correlated genes of high connectivity (hub genes) within their module are identified and are illustrated in scatter plots (correlation with phenotype/groupset vs. module membership). All result data is also stored as tables within the project folder.
deg_images<-c("../inst/doc/figures/Heatmap_Module-Groupset_relationship.png", "../inst/doc/figures/Intramodular_analysis_case1-control.png") include_graphics(deg_images)
diffLimma() uses the
limma package to determine differentially expressed genes according to the given p-value and fold change thresholds for all group comparisons given in
comparisons. Tables containing the filtered as well as unfiltered gene lists are stored in the designated
projectfolder. Heatmaps for each group comparison containg the top differentially expressed genes will be generated automatically. Venn-Diagrams will show the overlap of differentially expressed genes from different group comparisons.
deg <- diffLimma(gexn, comparisons= c("case1-control", "case2-control"), p.value.threshold= 0.05, FC.threshold= log2(1.5), adjust.method="BH", projectfolder = "example_analysis/GEX/Diff_limma", projectname = "example", Symbol.column = "SYMBOL", geneAnno2keep=c("SYMBOL", "ENTREZID"), venn_comparisons=list(allcomp= c("case1-control", "case2-control")), # Heatmap parameter: maxHM=50, HMcexCol=1.5, HMincludeRelevantSamplesOnly=TRUE )
deg_images<-c("../inst/doc/figures/Heatmap_example_case1-control.png", "../inst/doc/figures/Venn_Diagram_example_allcomp.png", "../inst/doc/figures/Volcano_example_case1-control.png") include_graphics(deg_images)
clusterprof takes genelist and optionally background gene list as input to perform enrichment analyses using the
clusterProfiler package. By default, overrepresentation analysis and gene set enrichment analysis (GSEA) is performed for all categories given in
enrichmentCat. While overrepresentation analysis applies a significance threshold to test the top genes, GSEA uses all genes of the list e.g. ranked by p-value. All result tables and plots (cnet plots and enrichment maps) are stored in the project folder.
clusterp.deg <- clusterprof(deg[["DEgenes.unfilt"]], backgroundlist = NULL, projectfolder = "example_analysis/GEX/clusterProfiler", projectname = "example", enrichmentCat = c("GO", "KEGG", "Reactome", "DO"), maxInputGenes = 100, id.type = "SYMBOL", id.column = "SYMBOL", sortcolumn = "adj.P.Val", sortdecreasing = FALSE, sortcolumn.threshold = 0.05, FCcolumn = "logFC", threshold_FC = log2(1.5), org = "human", pAdjustMethod = "BH", enrich.p.valueCutoff = 0.05, enrich.q.valueCutoff = 0.05) # access results e.g. for GO term (Biological Process) overrepresentation analysis of the first entry of deg[["DEgenes.unfilt"]], i.e. comparison case1-control head(as.data.frame(clusterp.gexn[["case1-control"]]["Overrep_BP"]))
Overrep_BP <- read.table("../inst/doc/figures/example_case1-control_Overrep_BP_resulttable.txt", sep="\t", header=T) head(Overrep_BP) deg_images<-c("../inst/doc/figures/example_case1-control_Overrep_BP_cnetPlot.png", "../inst/doc/figures/example_case1-control_Overrep_BP_enrichmentMap.png") include_graphics(deg_images)
TFsearch uses the
PWMEnrich package to search for enriched TFBS within an input gene list. It looks up all promotor sequences for theses genes defined by indicated distance from trancription start site (TSS) refering to human genome build hg19. If gene symbols are given instead of entrez IDs, symbols are converted to entrez IDs first. If
PromLookup == FALSE, look up of promotor sequences is omitted and sequences of interest must be given as gene coordinates (hg19) in
Optionally, preselected motivs given in
SearchSelMotifs are looked up in
sequences. All result tables and plots are stored in the project folder.
tf.deg <- TFsearch(deg[["DEgenes"]], annoColumn = NULL, name.organism = "hsapiens", projectfolder= "example_analysis/GEX/TFBS", projectname="example", applyFilter = FALSE, PromLookup = TRUE, id.type = "SYMBOL", id.column = "SYMBOL", PromSeqUpstreamTSS = 2000, PromSeqDownstreamTSS = 200, SearchSelMotifs = NULL, motif.min.score = 0.9)
tf <- read.table("../inst/doc/figures/example_case1-control_PWMEnrich_report.txt", sep="\t", header=T) head(tf) deg_images<-c("../inst/doc/figures/example_case1-control_PWMEnrich_top.png", "../inst/doc/figures/example_case2-control_PWMEnrich_top.png") include_graphics(deg_images)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.