knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The R package sigident.func provides functional analysis and is part of the sigident package framework: https://gitlab.miracum.org/clearly/sigident

Preprocessing

In order to use this R package and its functions, you need to prepare a merged gene expression dataset. The workflow to achieve this is presented in the following by making use of the R package sigident.preproc.

For a detailed background and description of the following steps please view the sigident.preproc package's vignette.

Initialization of important variables

library(sigident.preproc)
library(sigident.func)
library(knitr)

# initialize filePath:
filePath <- tempdir()

# define datadir
maindir <- "./geodata/"
datadir <- paste0(maindir, "data/")
dir.create(maindir)
dir.create(datadir)

# define plotdir
plotdir <- "./plots/"
dir.create(plotdir)

# define idtype
idtype = "affy"

Define list that contains a representation of the studies metadata

studiesinfo <- list(
  "GSE18842" = list(
    setid = 1,
    targetcolname = "source_name_ch1",
    targetlevelname = "Human Lung Tumor",
    controllevelname = "Human Lung Control"
    ),

  "GSE19804" = list(
    setid = 1,
    targetcolname = "source_name_ch1",
    targetlevelname = "frozen tissue of primary tumor",
    controllevelname = "frozen tissue of adjacent normal"
  ),

  "GSE19188" = list(
    setid = 1,
    targetcolname = "characteristics_ch1",
    controllevelname = "tissue type: healthy",
    targetlevelname = "tissue type: tumor",
    use_rawdata = TRUE
  )
)

Load GEO datasets

All downloaded datasets will be assigned to the global environment.

sigident.preproc::load_geo_data(
  studiesinfo = studiesinfo,
  datadir = datadir,
  plotdir = plotdir,
  idtype = idtype,
  viz_batch_boxp = F,
  viz_batch_gpca = F
) 

Preparations for utilizing the sigident.func package

Before using the sigident.func package, these variables need to be defined properly. One could use them also as arguments directly in the respective function. However, to keep the code clearer we define them here at the beginning and refer at each function to the respective variable. idtype and plotdir have already been defined during the preprocessing steps.

# general
csvdir <- "./csv/"
dir.create(csvdir)

# enrichment analysis
species <- "Hs"
OrgDb <- "org.Hs.eg.db"
organism <- "hsa"
#pathwayid <- "hsa04110" # Cell Cycle
pathwayid <- "hsa04151" # PI3K-Akt

DEG analysis

A common task preceeding expression data analysis is to perform statistical analysis to discover quantitative changes in expression levels between experimental groups. For this reason we here offer the identification of differentially expressed genes (DEGs) based on the limma package. In order correct for multiple testing we considered the p-value adjustment by conducting the "BH" method, which controls the expected false discovery rate (FDR).
A heatmap based on the selected DEGs is created and clear differences regarding the expression profiles between the groups ("Controls" [blue] and "Lung Cancer" [red]) can be recognized.

# define the false discovery rate here
fdr <- 0.05

genes <- sigident.func::identify_degs(mergeset = mergeset,
                                      diagnosis = diagnosis,
                                      q_value = fdr)

# heatmap creation
filename <- paste0(plotdir, "DEG_heatmap.png")
# create colors for map
ht_colors <- sigident.func::color_heatmap(sample_metadata = sample_metadata)
sigident.func::plot_deg_heatmap(mergeset = mergeset,
                                genes = genes,
                                patientcolors = ht_colors,
                                filename = filename)
knitr::include_graphics(filename)

Some investigations may have been finished at this point yielding the DEGs. We provide two tables containing annotations, like gene symbols (DEG_info.csv) and the results from the limma analysis including log2 fold changes and the adjusted p-values (DEG_results.csv).

deg_info <- sigident.func::export_deg_annotations(mergedset = mergedset,
                                                  genes = genes,
                                                  idtype = idtype)
data.table::fwrite(deg_info, paste0(csvdir, "DEG_info.csv"))
dim(deg_info)
knitr::kable(head(deg_info))
deg_results <- sigident.func::limma_top_table(mergeset = mergeset,
                                              diagnosis = diagnosis,
                                              q_value = fdr)
data.table::fwrite(deg_results, paste0(csvdir, "DEG_results.csv"))
dim(deg_results)
knitr::kable(head(deg_results))
rm(deg_info, deg_results, studiesinfo, affinity.spline.coefs)
gc()

Functional analysis

Gene enrichment

A common investigation regarding differentially expressed genes analysis is the functional annotation of the DEGs. Furthermore, it is useful to find out if the DEGs are associated with biological processes or molecular functions. The functional analysis supports GO annotation from OrgDb object.

deg_entrez <- unique(mergedset@featureData@data$ENTREZ_GENE_ID)
deg_entrez <- deg_entrez[deg_entrez != ""]

Test for over-representation

For the identification of enriched GO terms and KEGG pathways an over-representation analysis is performed based on linear models. As input the Entrez-IDs and the definition of the species are needed. extract_go_terms and extract_kegg_terms output tables containing the most significant top GO terms and KEGG terms respectively. In either case, rows are sorted by the minimum p-value with
- Term: GO term
- (Pathway: KEGG pathway)
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- DE: number of genes in the input Entrez Gene IDs (deg_entrez)
- P.DE: p-value for over-representation of the GO term in the set

enr_topgo <- sigident.func::extract_go_terms(gene = deg_entrez,
                                             species = species)
data.table::fwrite(enr_topgo, paste0(csvdir, "Top_GO.csv"))
dim(enr_topgo)
knitr::kable(head(enr_topgo))
enr_topkegg <- sigident.func::extract_kegg_terms(gene = deg_entrez,
                                                 species = species)
data.table::fwrite(enr_topkegg, paste0(csvdir, "Top_KEGG.csv"))
dim(enr_topkegg)
knitr::kable(head(enr_topkegg))

The following test for over-representation is based on the same method, but also taking into account for over expression and under expression in enriched terms with
- Term: GO term
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- Up: number of up-regulated DEGs
- Down: number of down-regulated DEGs
- P.Up: p-value for over-representation of GO term in up-regulated genes
- P.Down: p-value for over-representation of GO term in down-regulated genes

enr_fitlm <- sigident.func::go_diff_reg(
  mergeset = mergeset,
  idtype = idtype,
  diagnosis = diagnosis,
  entrezids = mergedset@featureData@data$ENTREZ_GENE_ID
)

enr_fitlm_topgo <- sigident.func::extract_go_terms(
  gene = enr_fitlm,
  species = species,
  fdr = 0.01
)
data.table::fwrite(enr_fitlm_topgo, paste0(csvdir, "Top_GO_fitlm.csv"))
dim(enr_fitlm_topgo)
knitr::kable(head(enr_fitlm_topgo))
enr_fitlm_topkegg <- sigident.func::extract_kegg_terms(
  gene = enr_fitlm,
  species = species
)
data.table::fwrite(enr_fitlm_topkegg, paste0(csvdir, "Top_KEGG_fitlm.csv"))
dim(enr_fitlm_topkegg)
knitr::kable(head(enr_fitlm_topkegg))

GO enrichment analysis

GO enrichment analysis of a given set of genes (deg_entrez) with the defined organism is based on the fitted linear models of the object enr_fitlm as output of the go_diff_reg function.
goEnrichmentAnalysis provides the presentation of the desired KEGG pathway ID (in this case hsa04151 - PI3K-Akt-Pathway) including under expressed (red) and over expressed (green) genes inside the pathway as png images in the folder named 'plots' (object 'plotdir').

enr_analysis <- sigident.func::go_enrichment_analysis(
  gene = deg_entrez,
  org_db = OrgDb,
  organism = organism,
  fitlm = enr_fitlm,
  pathwayid = pathwayid,
  species = organism,
  plotdir = plotdir
)
filename <- paste0(plotdir, "/", pathwayid, ".png")
knitr::include_graphics(filename)
filename <- paste0(plotdir, "/", pathwayid, ".pathview.png")
knitr::include_graphics(filename)

Using plot_enriched_barplot, a bar plot is created depicting the enrichment scores (adj. p-value) as bar color and gene count as bar height. The parameter type controls whether to depict enriched GO terms or enriched KEGG terms.

filename <- paste0(plotdir, "Enriched_GO.png")
sigident.func::plot_enriched_barplot(
  enrichmentobj = enr_analysis$go,
  type = "GO",
  filename = filename
)
knitr::include_graphics(filename)
filename <- paste0(plotdir, "Enriched_KEGG.png")
sigident.func::plot_enriched_barplot(
  enrichmentobj = enr_analysis$kegg,
  type = "KEGG",
  filename = filename
)
knitr::include_graphics(filename)
rm(bods, enr_analysis, enr_fitlm, enr_fitlm_topgo, enr_fitlm_topkegg,
   enr_topgo, enr_topkegg, gene.idtype.bods, korg, cpd.simtypes,
   deg_entrez, gene.idtype.list, ht_colors, mergedset)
gc()


miracum/clearly-sigident.func documentation built on June 28, 2022, 3:16 p.m.