knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The R package sigident.preproc
provides preprocessing functionalities and is part of the sigident
package framework: https://gitlab.miracum.org/clearly/sigident
In order to apply the R package sigident
and its functions on GEO datasets, these datasets have first to be downloaded and preprocessed. The preprocessing includes the creation of a merged gene expression data set, extraction of sample metadata and batch correction.
The R package sigident
can be found here.
Please note, that the GEO functionality of this R package has only been tested with the platforms, listed below:
Initially, the R package needs to be loaded and the variable datadir
and plotdir
need to be defined. datadir
points to the folder, where the GEO datasets are downloaded and stored. plotdir
points to the folder, where the resulting plots created during the preprocessing are stored. idtype
can be either "affy" or "entrez", indicating, if either Affy-IDs or Entrez-IDs are to be used as feature names for the subsequent analyses. Selecting "entrez" usually results in a reduction of total features due to the removing of empty strings and duplicate IDs. We here use idtype = "affy"
for all subsequent analyses.
library(sigident.preproc) # initialize filePath: filePath <- tempdir() # define datadir maindir <- "./geodata/" datadir <- paste0(maindir, "data/") dir.create(maindir) dir.create(datadir) # define plotdir plotdir <- "./plots/" dir.create(plotdir) # define idtype idtype <- "affy"
This example uses the GEO lung cancer studies "GSE18842", "GSE19804" and "GSE19188", which contain in total 367 samples (197 tumor; 170 non-tumor) and 54,675 transcripts obtained by using Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (platform GPL570).
The mapping information include the setid
(which GEO set should be used, normally '1'), the targetcolname
(the name of the column holding the variable of interest), the targetlevelname
(the value of the positive outcomes in targetcolname
) and the controllevelname
(the value of the negative outcomes in targetcolname
) need to be represented in the studiesinfo
list object. If specifying 'use_rawdata' = TRUE
, raw the data is downloaded in CEL file format, uncompressed and subsequently normalized with a GCRMA normalization. The directory specified in datadir
will be used for storing downloaded files.
studiesinfo <- list( "GSE18842" = list( setid = 1, targetcolname = "source_name_ch1", targetlevelname = "Human Lung Tumor", controllevelname = "Human Lung Control" ), "GSE19804" = list( setid = 1, targetcolname = "source_name_ch1", targetlevelname = "frozen tissue of primary tumor", controllevelname = "frozen tissue of adjacent normal" ), "GSE19188" = list( setid = 1, targetcolname = "characteristics_ch1", controllevelname = "tissue type: healthy", targetlevelname = "tissue type: tumor", use_rawdata = TRUE ) )
In order to find these mapping information in the pheno data, one has probably to extract them manually from the studies, e.g. with the following commands for the study "GSE18842":
setid <- 1 datadir <- "./geodata/data/" name <- "GSE18842" # download eset eset <- GEOquery::getGEO(name, destdir = datadir)[[setid]] # View eset in Rstudio View(Biobase::pData(eset))
Please be aware that currently, this package works only with binary classification, meaning, the target variable (specified in targetcolname
) may only contain two levels. There are scenarios, where your target contains more than two levels. In such cases, this package supports two possible approaches:
controllevelname
and targetlevelname
. During the loading process, the GEO sets will be reduced to only contain cases with these levels of interest. controllevelname
and targetlevelname
, repspectively. During the loading process, the levels specified in those variables will be summarized to one level, either "Control" or "Target", in the merged data set. In order to achieve a such summarization of levels, please separate them by using three vertical lines (e.g. '|||', unicode character 'U+007C'). There MUST NOT be any addional character between the levels to be separated and the three vertical lines. The following code gives an example on how a splitting could look like for the GEO set 'GSE58095': studiesinfo <- list( "GSE58095" <- list( setid = 1, targetcolname = "characteristics_ch1.1", controllevelname = "group: Control|||group: SSc|||time point: Early|||time point: Late", targetlevelname = "tissue: skin biopsy" ) )
You can also specify all categories of one GEO set to belong to one group. For example, by setting controllevelname
to NULL, all categories specified in targetlevelname
are treated as targets and will internally be renamed accordingly.
studiesinfo <- list( "GSE58095" <- list( setid = 1, targetcolname = "characteristics_ch1.1", controllevelname = NULL, targetlevelname = "tissue: skin biopsy|||group: Control|||group: SSc|||time point: Early" ) )
Caution: categories not specified in either targetlevelname
or controllevelname
will be excluded from the merging!
The function load_geo_data
downloads the GEO datasets specified in the list studiesinfo
. Further preprocessing steps are performed, such as the discovering, visualizing and removing of batch effectsa and the merging of the data.
Batch effects are systematic non-biological variation between studies due to experimental and technical artifacts. As first visualization a boxplot is created with the included samples on the x-axis and the expression values on the y-axis. Thereby, considerable discrepancies between the three studies used in this example can already be recognized.
A more powerfull and also here included approach for batch effect detection is conducting a guided Principle Component Analysis (gPCA) implemented in the gPCA package.
In order to correct for occurring batch effects and other unwanted variation in high-throughput experiments, the ComBat
function from the sva package is also applied. The ComBat
function adjusts for known batches using an empirical Bayesian framework [1].
mergeset
results as output of the above described merging approach and represents a matrix containing batch corrected expression data with genes in the rows and samples in the columns, not an ExpressionSet anymore.
load_geo_data(studiesinfo = studiesinfo, datadir = datadir, plotdir = plotdir, idtype = idtype)
These plots are created when executing the function sigident.preproc::load_geo_data
and stored in the directory specified in plotdir
.
knitr::include_graphics(paste0(plotdir, "GSE18842_batch_effect_boxplot.jpg")) knitr::include_graphics(paste0(plotdir, "GSE19804_batch_effect_boxplot.jpg")) knitr::include_graphics(paste0(plotdir, "GSE19188_batch_effect_boxplot.jpg")) knitr::include_graphics(paste0(plotdir, "Merged_before_batch_effect_boxplot.jpg")) knitr::include_graphics(paste0(plotdir, "PCplot_before.png"))
knitr::include_graphics(paste0(plotdir, "Merged_after_batch_effect_boxplot.jpg")) knitr::include_graphics(paste0(plotdir, "PCplot_after.png"))
All downloaded datasets and resulting objects, including diagnosis
, mergedset
and mergeset
are assigned to the global environment and are suitable to be used in the subsequent analyses implemented in the R package sigident
: https://gitlab.miracum.org/clearly/sigident.
ls()
[1] W.E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.