knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
knitr::opts_chunk$set(fig.width=12, fig.height=8)

Streamlined Analysis and Integration of Systems Biology Data

This package consists of modularized wrapper functions for multiple genomics analysis packages. It is designed for multi-omics analysis of expression, methylation and/or genotyping data. All modules can also be used individually. This includes (although not all described in this vignette yet):

A flow chart of the package is given here. The ouput of one function serves directly as input for the next function. For parametrization of the individual functions please see the detailed description for every function accessible via help(function_name). Since some of these functions are quite comprehensive, need a certain amount of processing time and generate multiple figures and tables, most figures of this vignette are not produced during vignette compiling but are included as external files. Nevertheless, the function calls presented here will produce the shown figures (plus addititional results), when you run the script with the provided example dataset. All result tables and figures are stored in the designated project folder. Required Bioconductor packages will be installed by the respective function if necessary. For needed packages not yet attached when a function is called, the function will attach them for data processing and detach them afterwards.

Installation

Install it from github using the devtools package: devtools::install_github("frankRuehle/systemsbio", build_vignettes=TRUE)

library("systemsbio")
library("knitr")

Expression Data

Import microarray expression data

There are two functions to load gene expression data into an ExpressionSet object. If you are analyzing Illumina microarrays, use read_Illu_expr_array() for importing the probe sample profile and the control probe profile exported from GenomeStudio.

gex <- read_Illu_expr_array(
            dataFile = "SampleProbeProfile.txt", 
            qcFile = "ControlProbeProfile.txt", 
            sampleSheet = "sample_sheet.csv",
            sampleColumn = "Sample_Name", 
            exprchip= "HumanHT-12 v4",
            org= "human",
            covarfile = "covariates.txt",  # optional for additional phenotypes
            method_norm = "none", 
            transform= "log2"  
            ) 

For other array formats, use readData2eset() to load numeric expression values and phenotype data. As for read_Illu_expr_array(), the expression data can directly be log2-transformed.

gex <- readData2eset(exprsData, phenoData, featureData=NULL,
                          ProbeID = "PROBE_ID",
                          sampleColumn = "Sample_Name", 
                          groupColumn  = "Sample_Group",  
                          experimentData = MIAME(),
                          arrayAnnotation= "Humanv4",
                          transform = log2,
                          organism= "human"
                          ) 

The example dataset gex was imported using read_Illu_expr_array and contains 12 samples belonging to 3 experimental groups. The data is log2-transformed but not normalized yet. This is an artificial dataset for demonstration purposes and does not reflect any biological insights. You can access the expression data, phenotype data and feature data using the accessors exprs(gex), pData(gex) and fData(gex), respectively.

library(beadarray)
library(Biobase)
beadarray::exprs(gex)[1:5, 1:5]
Biobase::pData(gex)[,1:5]
Biobase::fData(gex)[1:5, ]

Quality control

The function QC_expressionset() performs quality control of the expression data utilizing Illumina internal control probes (if available) as well as quality control functions of the arrayqualitymetrics and WGCNA packages. The result files are stored in the directory given in projectfolder. Failed samples should be removed from the dataset.

qcMetrics <- QC_expressionset(gex, projectfolder="example_analysis/GEX/QC", 
                    projectname="example", phDendro=c("Sample_Group", "Sentrix_ID"), 
                    groupColumn = "Sample_Group", cex.dendroLabels = 1)

If internal Illmumina quality control probes are present, boxplots are prepared for all categories of control probes (see below) and are stored in the project folder. A technical note about Illumina control probes is available here.

All results from arrayqualitymetrics (distance plots, principal component analysis, array intensity distribution, density and MA plots) including outlier detection and further description are stored in the subfolder quality_metrics. Finally, a sample dendrogram including outlier detection is prepared using the WGCNA package.

deg_images<-c("figures/example_boxplot_controlprofile.png",
                         "figures/example_SampleDendrogram_noNorm_adjacency.png")
include_graphics(deg_images)

For quality control, look at different quality characteristics and make a decision which samples to remove. Usually, low quality samples look bad in several ways. Here, no sample had to be removed. If you have low quality samples in your dataset, remove them as indicated here:

gex <- gex[, !sampleNames(gex) %in% c("samplenamesToRemove")]

Normalize expression set

Next, the dataset will be quantile-normalized using the process_eset() function. log2-transformation is not necessary, since the data has already been log2-transformed.

gexn <- process_eset(gex, method_norm="quantile", transform="none")

Import RNA-Seq count data

Use DESeq2::DESeqDataSetFromHTSeqCount function to import count data e.g. from HTSEQ-count into a DESeqDataSet object. Here, "Sample_Group" given in the design formula is the phenotype used for differential expression analysis

dds <- DESeqDataSetFromHTSeqCount(sampleTable = data.frame(sampleName = your_sample_names, 
                                                           fileName = your_file_names, 
                                                           your_sample_phenotypes,
                                  directory = your_directory,
                                  design = ~ Sample_Group)

DESeq2 uses un-normalized count data for differential expression analysis. To obtain an expression matrix to be used for PCA, etc, you may process the count data e.g. by filtering low abundance transcripts and transform the data e.g. by applying Variance stabilizing transformation. The function process_dds processes the DESeqDataSet and returns a SummarizedExperiment object to be used for further analysis.

sumexp.vst <- process_dds(dds,
                          min_rowsum = 10,
                          transform.function = varianceStabilizingTransformation,
                          return_SummarizedExperiment = TRUE)

Principal component analysis (PCA) and enrichment analysis of PC loadings

The pcaGoPromoter package provides functions for PCA analysis as well as for annotated PCA plots of the given expression data. Additionally, the top PC loadings (features most strongly correlated with PCs) are used for enrichment analysis with respect to gene ontology terms (GO) and transcription factor binding sites (TFBS). The wrapper function used here prepares the respective 2d and 3d PCA plots with respect to sample group assignment for the principal components indicated in the function call. The loadings of each PC are used for enrichment analysis and added to the annotated PCA plot. All plots and result tables are stored in the project folder.

pca.gexn <- wrapPCAgoprom(gexn, groupsoi=c('case1', 'case2', 'control'), 
                           projectname="example", projectfolder = "example_analysis/GEX/pcaGoPromoter",
                           sample.name.column = "Sample_Name", inputType = "geneSymbol",
                           PCs4table = 3, PCs2plot = c(1,2,3), probes2enrich = 0.025)
deg_images<-c("figures/example_pcainfoplot_PC1_2.png",
              "figures/example_pcaplot3d_PC1_2_3.png")
include_graphics(deg_images)

Weighted Gene Co-expression Network Analysis (WGCNA)

The WGCNA package is used for for preparing an unscaled co-expression network and for identifying modules of co-expressed genes. The wrapper function wrapWGCNA() used here performes all steps in the process automatically. First, you need to determine the adequate soft threshold power to prepare a scale free network. The function wrapWGCNA() can be run with parameter softThresholdPower = NULL to prepare an overview plot for selecting a soft threshold power with r2 > 0.85. Otherwise, you can let the function choose an appropriate threshold with softThresholdPower= "auto"or set the value directly.

wgcnaSFT.gexn <- wrapWGCNA(gexn, projectfolder= "example_analysis/GEX/WGCNA", softThresholdPower = NULL)

However, since we have a very small and artificial dataset in gexn, the resulting network does not have a scalefree topology. For this example, we choose a softThresholdPower of 12. In the next run of wrapWGCNA(), the network is constructed and co-expressed modules are identified. Additional parameters for finetuning the network (e.g. detectCutHeight, deepSplit, minModuleSize) can be transfered to WGCNA::blockwiseModules via the ...-parameter. See help(blockwiseModules) for details. Genes not assigned to a module are collected in the default 'grey' module. A sample dendrogram, a gene dendrogramm(s) with phenotype information as well as boxplots of all modules are prepared and stored in the project folder.

wgcnaNET.gexn <- wrapWGCNA(gexn, 
                  projectfolder= "example_analysis/GEX/WGCNA",
                  softThresholdPower= 12, 
                  corType= "bicor", 
                  networkType = "signed", # inversely correlated genes not in one module
                  TOMType = "signed", 
                  maxBlockSize = 5000, # analysis devided into blocks
                  TOMplot= F, MDSplot= F,
                  phDendro= c("Sample_Group"), 
                  phModule= c("Sample_Group", "Age"), 
                  sampleColumn = "Sample_Name", 
                  groupColumn  = "Sample_Group",  
                  groupsets= c("case1-control", "case2-control"),  
                  symbolColumn = "SYMBOL", 
                  flashClustMethod = "average", 
                  dendroRowText=T, autoColorHeight = FALSE, 
                  colorHeight=0.2, cex.labels = 1,
                  detectCutHeight = 0.8,
                  deepSplit = 0,
                  minModuleSize = 75)
deg_images<-c("figures/module_boxplot_1_of_3.png",
              "figures/ModuleDendrogram_Block.png")
include_graphics(deg_images)

The modules are correlated with phenotypes of interest given in phModule as well as with group assignments given in groupsets. The correlation results are illustrated in heatmaps. Next, correlated genes of high connectivity (hub genes) within their module are identified and are illustrated in scatter plots (correlation with phenotype/groupset vs. module membership). All result data is also stored as tables within the project folder.

deg_images<-c("figures/Heatmap_Module-Groupset_relationship.png",
              "figures/Intramodular_analysis_case1-control.png")
include_graphics(deg_images)

Differential gene expression analysis with Limma

wrapLimma uses the limma package to determine differentially expressed genes according to the given p-value and fold change thresholds for all group comparisons given in comparisons. Tables containing the filtered as well as unfiltered gene lists are stored in the designated projectfolder. Heatmaps for each group comparison containg the top differentially expressed genes will be generated automatically. Venn-Diagrams will show the overlap of differentially expressed genes from different group comparisons.

deg <- wrapLimma(gexn, 
                 comparisons = c("case1-control", "case2-control"),
                 p.value.threshold = 0.05, 
                 FC.threshold = log2(1.5), 
                 adjust.method = "BH", 
                 projectfolder = "example_analysis/GEX/limma", 
                 projectname = "example",
                 Symbol.column = "SYMBOL",
                 geneAnno2keep = c("SYMBOL", "ENTREZID"),
                 venn_comparisons = list(allcomp= c("case1-control", "case2-control")),
                 # Heatmap parameter:
                 maxHM = 50,
                 scale = c("none"),
                 HMincludeRelevantSamplesOnly = TRUE
                 )
deg_images<-c("figures/Heatmap_example_case1-control.png",
              "figures/Venn_Diagram_example_allcomp.png",
              "figures/Volcano_example_case1-control.png")
include_graphics(deg_images)

Differential gene expression analysis with DESeq2

Similarily, count data from RNA-Seq can be analysed for differential expression using the DESeq2 package. The wrapper function wrapDESeq2 works analogously to diffLimma as described above.

deg <- wrapDESeq2(dds, 
                  comparisons = c("case1-control", "case2-control"),
                  min_rowsum = 10,
                  p.value.threshold = 0.05, 
                  adjust.method ="BH", 
                  FC.threshold = log2(1.5),
                  projectfolder = "example_analysis/GEX/deseq",
                  projectname = "example", 
                  Symbol.column = NULL,
                  sampleColumn = "Sample_Name",   
                  groupColumn= "Sample_Group", 
                  venn_comparisons= list(allcomp= c("case1-control", "case2-control")), 
                  # Heatmap parameter:
                  maxHM=50, 
                  scale = c("none"),
                  HMincludeRelevantSamplesOnly=TRUE
                  )

Gene Enrichment analysis with clusterProfiler

The function wrapClusterProfiler takes genelist and optionally background gene list as input to perform enrichment analyses using the clusterProfiler package. By default, overrepresentation analysis and gene set enrichment analysis (GSEA) is performed for all categories given in enrichmentCat. While overrepresentation analysis applies a significance threshold to test the top genes, GSEA uses all genes of the list e.g. ranked by p-value. All result tables and plots (cnet plots, enrichment maps, dot plots, GOgraphs and KEGG pathway maps) are stored in the project folder.

clusterp.deg <- wrapClusterProfiler(deg[["DEgenes.unfilt"]], 
                      backgroundlist = NULL, 
                      projectfolder = "example_analysis/GEX/clusterProfiler",
                      projectname = "example", 
                      enrichmentCat = c("GO", "KEGG", "Reactome", "DO"),
                      maxInputGenes = 100,  
                      id.type = "SYMBOL",
                      id.column = "SYMBOL",   
                      sortcolumn = "adj.P.Val",  
                      highValueHighPriority = FALSE, 
                      sortcolumn.threshold = 0.05,
                      fun.transform = function(x) {-log10(x)},
                      FCcolumn = "logFC",
                      threshold_FC = log2(1.5),
                      org = "human",
                      pAdjustMethod = "BH", 
                      enrich.p.valueCutoff = 0.05, 
                      enrich.q.valueCutoff = 0.05) 

# access results e.g. for GO term (Biological Process) overrepresentation analysis of the 
# first entry of deg[["DEgenes.unfilt"]], i.e. comparison case1-control
head(as.data.frame(clusterp.gexn[["case1-control"]]["Overrep_BP"]))
Overrep_BP <- read.table("figures/example_case1-control_Overrep_BP_resulttable.txt", sep="\t", header=T)
head(Overrep_BP)

deg_images<-c("figures/example_case1-control_Overrep_BP_cnetPlot.png",
              "figures/example_case1-control_Overrep_BP_enrichmentMap.png")
include_graphics(deg_images)

Enrichment of Transcription Factor Binding sites (TFBS)

The function wrapPWMEnrich uses the PWMEnrich package to search for enriched TFBS within an input gene list. It looks up all promotor sequences for theses genes defined by indicated distance from trancription start site (TSS) refering to human genome build hg19. If gene symbols are given instead of entrez IDs, symbols are converted to entrez IDs first. If PromLookup == FALSE, look up of promotor sequences is omitted and sequences of interest must be given as gene coordinates (hg19) in sequences instead.
Optionally, preselected motivs given in SearchSelMotifs are looked up in sequences. All result tables and plots are stored in the project folder.

tf.deg <- wrapPWMEnrich(deg[["DEgenes"]], 
                     annoColumn = NULL,  
                     name.organism = "hsapiens", 
                     projectfolder= "example_analysis/GEX/TFBS",
                     projectname="example",
                     applyFilter = FALSE,
                     PromLookup = TRUE,
                     id.type = "SYMBOL",
                     id.column = "SYMBOL",   
                     PromSeqUpstreamTSS = 2000,
                     PromSeqDownstreamTSS = 200,
                     SearchSelMotifs = NULL,
                     motif.min.score = 0.9)
tf <- read.table("figures/example_case1-control_PWMEnrich_report.txt", sep="\t", header=T)
head(tf)

deg_images<-c("figures/example_case1-control_PWMEnrich_top.png",
              "figures/example_case2-control_PWMEnrich_top.png")
include_graphics(deg_images)


frankRuehle/systemsbio documentation built on Sept. 14, 2020, 1:18 a.m.