In miracum/clearly-sigident.preproc: Sigident Preprocessing

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The R package sigident.preproc provides preprocessing functionalities and is part of the sigident package framework: https://gitlab.miracum.org/clearly/sigident

In order to apply the R package sigident and its functions on GEO datasets, these datasets have first to be downloaded and preprocessed. The preprocessing includes the creation of a merged gene expression data set, extraction of sample metadata and batch correction. The R package sigident can be found here.

GEO platforms

Please note, that the GEO functionality of this R package has only been tested with the platforms, listed below:

GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array

Setup

Initially, the R package needs to be loaded and the variable datadir and plotdir need to be defined. datadir points to the folder, where the GEO datasets are downloaded and stored. plotdir points to the folder, where the resulting plots created during the preprocessing are stored. idtype can be either "affy" or "entrez", indicating, if either Affy-IDs or Entrez-IDs are to be used as feature names for the subsequent analyses. Selecting "entrez" usually results in a reduction of total features due to the removing of empty strings and duplicate IDs. We here use idtype = "affy" for all subsequent analyses.

library(sigident.preproc)

# initialize filePath:
filePath <- tempdir()

# define datadir
maindir <- "./geodata/"
datadir <- paste0(maindir, "data/")
dir.create(maindir)
dir.create(datadir)

# define plotdir
plotdir <- "./plots/"
dir.create(plotdir)

# define idtype
idtype <- "affy"

Definition of a list that contains a representation of the studie's / studies' metadata

This example uses the GEO lung cancer studies "GSE18842", "GSE19804" and "GSE19188", which contain in total 367 samples (197 tumor; 170 non-tumor) and 54,675 transcripts obtained by using Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (platform GPL570).

The mapping information include the setid (which GEO set should be used, normally '1'), the targetcolname (the name of the column holding the variable of interest), the targetlevelname (the value of the positive outcomes in targetcolname) and the controllevelname (the value of the negative outcomes in targetcolname) need to be represented in the studiesinfo list object. If specifying 'use_rawdata' = TRUE, raw the data is downloaded in CEL file format, uncompressed and subsequently normalized with a GCRMA normalization. The directory specified in datadir will be used for storing downloaded files.

studiesinfo <- list(
  "GSE18842" = list(
    setid = 1,
    targetcolname = "source_name_ch1",
    targetlevelname = "Human Lung Tumor",
    controllevelname = "Human Lung Control"
    ),

  "GSE19804" = list(
    setid = 1,
    targetcolname = "source_name_ch1",
    targetlevelname = "frozen tissue of primary tumor",
    controllevelname = "frozen tissue of adjacent normal"
  ),

  "GSE19188" = list(
    setid = 1,
    targetcolname = "characteristics_ch1",
    controllevelname = "tissue type: healthy",
    targetlevelname = "tissue type: tumor",
    use_rawdata = TRUE
  )
)

In order to find these mapping information in the pheno data, one has probably to extract them manually from the studies, e.g. with the following commands for the study "GSE18842":

setid <- 1
datadir <- "./geodata/data/"
name <- "GSE18842"

# download eset
eset <- GEOquery::getGEO(name, destdir = datadir)[[setid]]

# View eset in Rstudio
View(Biobase::pData(eset))

Please be aware that currently, this package works only with binary classification, meaning, the target variable (specified in targetcolname) may only contain two levels. There are scenarios, where your target contains more than two levels. In such cases, this package supports two possible approaches:

1: You can specify only those two levels of interest with controllevelname and targetlevelname. During the loading process, the GEO sets will be reduced to only contain cases with these levels of interest.
2: You can specify more than one level for controllevelname and targetlevelname, repspectively. During the loading process, the levels specified in those variables will be summarized to one level, either "Control" or "Target", in the merged data set. In order to achieve a such summarization of levels, please separate them by using three vertical lines (e.g. '|||', unicode character 'U+007C'). There MUST NOT be any addional character between the levels to be separated and the three vertical lines. The following code gives an example on how a splitting could look like for the GEO set 'GSE58095':

studiesinfo <- list(
  "GSE58095" <- list(
    setid = 1,
    targetcolname = "characteristics_ch1.1",
    controllevelname = "group: Control|||group: SSc|||time point: Early|||time point: Late",  
    targetlevelname = "tissue: skin biopsy"
  )
)

You can also specify all categories of one GEO set to belong to one group. For example, by setting controllevelname to NULL, all categories specified in targetlevelname are treated as targets and will internally be renamed accordingly.

studiesinfo <- list(
  "GSE58095" <- list(
    setid = 1,
    targetcolname = "characteristics_ch1.1",
    controllevelname = NULL,  
    targetlevelname = "tissue: skin biopsy|||group: Control|||group: SSc|||time point: Early"
  )
)

Caution: categories not specified in either targetlevelname or controllevelname will be excluded from the merging!

Load GEO datasets

The function load_geo_data downloads the GEO datasets specified in the list studiesinfo. Further preprocessing steps are performed, such as the discovering, visualizing and removing of batch effectsa and the merging of the data.

Batch effects are systematic non-biological variation between studies due to experimental and technical artifacts. As first visualization a boxplot is created with the included samples on the x-axis and the expression values on the y-axis. Thereby, considerable discrepancies between the three studies used in this example can already be recognized.

A more powerfull and also here included approach for batch effect detection is conducting a guided Principle Component Analysis (gPCA) implemented in the gPCA package.

In order to correct for occurring batch effects and other unwanted variation in high-throughput experiments, the ComBat function from the sva package is also applied. The ComBat function adjusts for known batches using an empirical Bayesian framework [1].

mergeset results as output of the above described merging approach and represents a matrix containing batch corrected expression data with genes in the rows and samples in the columns, not an ExpressionSet anymore.

load_geo_data(studiesinfo = studiesinfo,
              datadir = datadir,
              plotdir = plotdir,
              idtype = idtype)

Visualize batch effect

These plots are created when executing the function sigident.preproc::load_geo_data and stored in the directory specified in plotdir.

Before batch correction

knitr::include_graphics(paste0(plotdir, "GSE18842_batch_effect_boxplot.jpg"))
knitr::include_graphics(paste0(plotdir, "GSE19804_batch_effect_boxplot.jpg"))
knitr::include_graphics(paste0(plotdir, "GSE19188_batch_effect_boxplot.jpg"))
knitr::include_graphics(paste0(plotdir, "Merged_before_batch_effect_boxplot.jpg"))
knitr::include_graphics(paste0(plotdir, "PCplot_before.png"))

After batch correction

knitr::include_graphics(paste0(plotdir, "Merged_after_batch_effect_boxplot.jpg"))
knitr::include_graphics(paste0(plotdir, "PCplot_after.png"))

List files in .Globalenv

All downloaded datasets and resulting objects, including diagnosis, mergedset and mergeset are assigned to the global environment and are suitable to be used in the subsequent analyses implemented in the R package sigident: https://gitlab.miracum.org/clearly/sigident.

ls()

References

[1] W.E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007.

miracum/clearly-sigident.preproc documentation built on June 28, 2022, 3:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com