knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The R package sigident.func
provides functional analysis and is part of the sigident
package framework: https://gitlab.miracum.org/clearly/sigident
In order to use this R package and its functions, you need to prepare a merged gene expression dataset. The workflow to achieve this is presented in the following by making use of the R package sigident.preproc
.
For a detailed background and description of the following steps please view the sigident.preproc
package's vignette.
library(sigident.preproc) library(sigident.func) library(knitr) # initialize filePath: filePath <- tempdir() # define datadir maindir <- "./geodata/" datadir <- paste0(maindir, "data/") dir.create(maindir) dir.create(datadir) # define plotdir plotdir <- "./plots/" dir.create(plotdir) # define idtype idtype = "affy"
studiesinfo <- list( "GSE18842" = list( setid = 1, targetcolname = "source_name_ch1", targetlevelname = "Human Lung Tumor", controllevelname = "Human Lung Control" ), "GSE19804" = list( setid = 1, targetcolname = "source_name_ch1", targetlevelname = "frozen tissue of primary tumor", controllevelname = "frozen tissue of adjacent normal" ), "GSE19188" = list( setid = 1, targetcolname = "characteristics_ch1", controllevelname = "tissue type: healthy", targetlevelname = "tissue type: tumor", use_rawdata = TRUE ) )
All downloaded datasets will be assigned to the global environment.
sigident.preproc::load_geo_data( studiesinfo = studiesinfo, datadir = datadir, plotdir = plotdir, idtype = idtype, viz_batch_boxp = F, viz_batch_gpca = F )
Before using the sigident.func
package, these variables need to be defined properly. One could use them also as arguments directly in the respective function. However, to keep the code clearer we define them here at the beginning and refer at each function to the respective variable.
idtype
and plotdir
have already been defined during the preprocessing steps.
# general csvdir <- "./csv/" dir.create(csvdir) # enrichment analysis species <- "Hs" OrgDb <- "org.Hs.eg.db" organism <- "hsa" #pathwayid <- "hsa04110" # Cell Cycle pathwayid <- "hsa04151" # PI3K-Akt
A common task preceeding expression data analysis is to perform statistical analysis to discover quantitative changes in expression levels between experimental groups. For this reason we here offer the identification of differentially expressed genes (DEGs) based on the limma package. In order correct for multiple testing we considered the p-value adjustment by conducting the "BH" method, which controls the expected false discovery rate (FDR).
A heatmap based on the selected DEGs is created and clear differences regarding the expression profiles between the groups ("Controls" [blue] and "Lung Cancer" [red]) can be recognized.
# define the false discovery rate here fdr <- 0.05 genes <- sigident.func::identify_degs(mergeset = mergeset, diagnosis = diagnosis, q_value = fdr) # heatmap creation filename <- paste0(plotdir, "DEG_heatmap.png") # create colors for map ht_colors <- sigident.func::color_heatmap(sample_metadata = sample_metadata) sigident.func::plot_deg_heatmap(mergeset = mergeset, genes = genes, patientcolors = ht_colors, filename = filename)
knitr::include_graphics(filename)
Some investigations may have been finished at this point yielding the DEGs. We provide two tables containing annotations, like gene symbols (DEG_info.csv) and the results from the limma analysis including log2 fold changes and the adjusted p-values (DEG_results.csv).
deg_info <- sigident.func::export_deg_annotations(mergedset = mergedset, genes = genes, idtype = idtype) data.table::fwrite(deg_info, paste0(csvdir, "DEG_info.csv"))
dim(deg_info) knitr::kable(head(deg_info))
deg_results <- sigident.func::limma_top_table(mergeset = mergeset, diagnosis = diagnosis, q_value = fdr) data.table::fwrite(deg_results, paste0(csvdir, "DEG_results.csv"))
dim(deg_results) knitr::kable(head(deg_results))
rm(deg_info, deg_results, studiesinfo, affinity.spline.coefs) gc()
A common investigation regarding differentially expressed genes analysis is the functional annotation of the DEGs. Furthermore, it is useful to find out if the DEGs are associated with biological processes or molecular functions. The functional analysis supports GO annotation from OrgDb object.
deg_entrez <- unique(mergedset@featureData@data$ENTREZ_GENE_ID) deg_entrez <- deg_entrez[deg_entrez != ""]
For the identification of enriched GO terms and KEGG pathways an over-representation analysis is performed based on linear models. As input the Entrez-IDs and the definition of the species are needed. extract_go_terms
and extract_kegg_terms
output tables containing the most significant top GO terms and KEGG terms respectively. In either case, rows are sorted by the minimum p-value with
- Term: GO term
- (Pathway: KEGG pathway)
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- DE: number of genes in the input Entrez Gene IDs (deg_entrez)
- P.DE: p-value for over-representation of the GO term in the set
enr_topgo <- sigident.func::extract_go_terms(gene = deg_entrez, species = species) data.table::fwrite(enr_topgo, paste0(csvdir, "Top_GO.csv"))
dim(enr_topgo) knitr::kable(head(enr_topgo))
enr_topkegg <- sigident.func::extract_kegg_terms(gene = deg_entrez, species = species) data.table::fwrite(enr_topkegg, paste0(csvdir, "Top_KEGG.csv"))
dim(enr_topkegg) knitr::kable(head(enr_topkegg))
The following test for over-representation is based on the same method, but also taking into account for over expression and under expression in enriched terms with
- Term: GO term
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- Up: number of up-regulated DEGs
- Down: number of down-regulated DEGs
- P.Up: p-value for over-representation of GO term in up-regulated genes
- P.Down: p-value for over-representation of GO term in down-regulated genes
enr_fitlm <- sigident.func::go_diff_reg( mergeset = mergeset, idtype = idtype, diagnosis = diagnosis, entrezids = mergedset@featureData@data$ENTREZ_GENE_ID ) enr_fitlm_topgo <- sigident.func::extract_go_terms( gene = enr_fitlm, species = species, fdr = 0.01 ) data.table::fwrite(enr_fitlm_topgo, paste0(csvdir, "Top_GO_fitlm.csv"))
dim(enr_fitlm_topgo) knitr::kable(head(enr_fitlm_topgo))
enr_fitlm_topkegg <- sigident.func::extract_kegg_terms( gene = enr_fitlm, species = species ) data.table::fwrite(enr_fitlm_topkegg, paste0(csvdir, "Top_KEGG_fitlm.csv"))
dim(enr_fitlm_topkegg) knitr::kable(head(enr_fitlm_topkegg))
GO enrichment analysis of a given set of genes (deg_entrez
) with the defined organism is based on the fitted linear models of the object enr_fitlm
as output of the go_diff_reg
function.
goEnrichmentAnalysis
provides the presentation of the desired KEGG pathway ID (in this case hsa04151
- PI3K-Akt-Pathway) including under expressed (red) and over expressed (green) genes inside the pathway as png images in the folder named 'plots' (object 'plotdir').
enr_analysis <- sigident.func::go_enrichment_analysis( gene = deg_entrez, org_db = OrgDb, organism = organism, fitlm = enr_fitlm, pathwayid = pathwayid, species = organism, plotdir = plotdir )
filename <- paste0(plotdir, "/", pathwayid, ".png")
knitr::include_graphics(filename)
filename <- paste0(plotdir, "/", pathwayid, ".pathview.png")
knitr::include_graphics(filename)
Using plot_enriched_barplot
, a bar plot is created depicting the enrichment scores (adj. p-value) as bar color and gene count as bar height. The parameter type
controls whether to depict enriched GO terms or enriched KEGG terms.
filename <- paste0(plotdir, "Enriched_GO.png") sigident.func::plot_enriched_barplot( enrichmentobj = enr_analysis$go, type = "GO", filename = filename )
knitr::include_graphics(filename)
filename <- paste0(plotdir, "Enriched_KEGG.png") sigident.func::plot_enriched_barplot( enrichmentobj = enr_analysis$kegg, type = "KEGG", filename = filename )
knitr::include_graphics(filename)
rm(bods, enr_analysis, enr_fitlm, enr_fitlm_topgo, enr_fitlm_topkegg, enr_topgo, enr_topkegg, gene.idtype.bods, korg, cpd.simtypes, deg_entrez, gene.idtype.list, ht_colors, mergedset) gc()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.