loadEData: Loading pre-defined and user-defined expression data
In waldronlab/GSEABenchmarkeR: Reproducible GSEA Benchmarking

loadEData

R Documentation

Loading pre-defined and user-defined expression data

Description

This function implements a general interface for loading the pre-defined GEO2KEGG microarray compendium and the TCGA RNA-seq compendium. It also allows loading of user-defined data from file.

Usage

loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)

Arguments

`edata`	Expression data compendium. A character vector of length 1 that must be either 'geo2kegg': to load the GEO2KEGG microarray compendium, 'tcga': to load the TCGA RNA-seq compendium, or an absolute file path pointing to a directory, in which a user-defined compendium has been saved in RDS files. See details.
`nr.datasets`	Integer. Number of datasets that should be loaded from the compendium. This is mainly for demonstration purposes.
`cache`	Logical. Should an already cached version used if available? Defaults to `TRUE`.
`...`	Additional arguments passed to the internal loading routines of the GEO2KEGG and TCGA compendia. This currently includes for loading of the GEO2KEGG compendium `preproc`: logical. Should probe level data automatically be summarized to gene level data? Defaults to `FALSE`. `de.only`: logical. Include only datasets in which differentially expressed genes have been found? Defaults to `FALSE`. `excl.metac`: logical. Exclude datasets for which MetaCore rather than KEGG pathways have been assigned as target pathways? Defaults to `FALSE`. And for loading of the TCGA compendium `mode`: character, determines how TCGA RNA-seq datasets are obtained. To obtain raw read counts from GSE62944 use either `'ehub'` (default, via ExperimentHub) or `'geo'` (direct download from GEO, slow). Alternatively, use `'cTD'` to obtain normalized log2 TPM values from curatedTCGAData. `data.dir`: character. Absolute file path indicating where processed RDS files for each dataset are written to. Defaults to `NULL`, which will then write to `tools::R_user_dir("GSEABenchmarkeR")`. `min.ctrls`: integer. Minimum number of controls, i.e. adjacent normal samples, for a cancer type to be included. Defaults to 9. `paired`: Logical. Should the pairing of samples (tumor and adjacent normal) be taken into account? Defaults to `TRUE`, which reduces the data for each cancer type to patients for which both sample types (tumor and adjacent normal) are available. Use `FALSE` to obtain all samples in an unpaired manner. `min.cpm`: integer. Minimum counts-per-million reads mapped. See the edgeR vignette for details. The default filter is to exclude genes with cpm < 2 in more than half of the samples. `with.clin.vars`: logical. Should clinical variables (>500) be kept to allow for more advanced sample groupings in addition to the default binary grouping (tumor vs. normal)? `map2entrez`: Should human gene symbols be automatically mapped to Entrez Gene IDs? Defaults to `TRUE`.

Details

The pre-defined GEO2KEGG microarray compendium consists of 42 datasets investigating a total of 19 different human diseases as collected by Tarca et al. (2012 and 2013).

The pre-defined TCGA RNA-seq compendium consists of datasets from The Cancer Genome Atlas (TCGA, 2013) investigating a total of 34 different cancer types.

User-defined data can also be loaded, given that datasets, preferably of class SummarizedExperiment, have been saved as RDS files.

Value

A list of datasets, typically of class SummarizedExperiment.

Note that loadEData("geo2kegg", preproc = FALSE) (the default) returns the original microarray probe level data as a list of ExpressionSet objects. Use preproc = TRUE or the maPreproc function to summarize the probe level data to gene level data and to obtain a list of SummarizedExperiment objects.

Author(s)

Ludwig Geistlinger <Ludwig.Geistlinger@sph.cuny.edu>

References

Tarca et al. (2012) Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13:136.

Tarca et al. (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One, 8(11):e79217.

The Cancer Genome Atlas Research Network (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10):1113-20.

Rahman et al. (2015) Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics, 31(22):3666-72.

Examples


    # (1) Loading the GEO2KEGG microarray compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=2)

    # (2) Loading the TCGA RNA-seq compendium
    tcga <- loadEData("tcga", nr.datasets=2)

    # (3) reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)

waldronlab/GSEABenchmarkeR documentation built on June 12, 2025, 7:32 p.m.