suppressMessages(library(knitr)) suppressMessages(library(GenomicRanges)) suppressMessages(library(limma)) suppressMessages(library(minfi)) suppressMessages(library(ExperimentHub)) knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE)
recountmethylation package provides access to databases of DNA
methylation (DNAm) data from over 35,000 sample records with IDATs in the Gene
Expression Omnibus (GEO, available through March 31, 2019) run using the
Illumina HM450K BeadArray platform, including metadata, raw/unnormalized red
and green signals, raw/unnormalized methylated and unmethylated signals, and
normalized DNAm fractions or Beta-values (@maden_human_2020). Normalization
was performed using the out-of-band signal correction (a.k.a. "noob") method,
a type of within-sample normalization (@triche_low-level_2013).
This User's Guide shows how to use the
recountmethylation package to obtain,
load, and query the DNAm databases with 2 small example files. Background about
DNAm arrays, DNAm measurement,
SummarizedExperiment objects, database file
types, and samples metadata is also provided. Further analysis examples are
contained in the
Database access, including downloads and file loads, are managed by the
get_db functions. These download and access the latest available database
?get_db and below examples for details). Note you will need between
90-120 Gb of disk space to store a single database file. Files pair sample
metadata with assays including red and green channel signals, methylated and
unmethylated level signals, and DNAm fractions in 3
entities, and red and green signals in an
The database files are contained at the file server, located at the URL: https://recount.bio/data/. Details about the latest files are as follows.
url = "https://recount.bio/data/" sslver <- FALSE ftpuseopt <- FALSE dirlistopt <- FALSE dn <- RCurl::getURL(url, ftp.use.epsv = ftpuseopt, dirlistonly = dirlistopt, .opts = list(ssl.verifypeer = sslver)) sm <- as.data.frame(servermatrix(dn, sslver), stringsAsFactors = FALSE) sdf <- as.data.frame(sm) tsv <- as.numeric(gsub("(.*_|\\.h5)", "", sdf[,1])) sdff <- sdf[which(tsv == max(tsv, na.rm = TRUE)),] rownames(sdff) <- NULL knitr::kable(sdff, align = "c")
The DNAm array database files are indexed on
ExperimentHub, and are viewable as follows.
hub = ExperimentHub::ExperimentHub() # connect to the hubs rmdat <- AnnotationHub::query(hub, "recountmethylation") # query the hubs rmdat
In addition to using the
getdb functions, the
.h5 extension) files may alternatively be downloaded from the hubs, as follows.
eid <- "EH3778" # h5 test file id fpath <- rmdat[[eid]] # download with default caching rhdf5::h5ls(fpath)
Note that whether downloads use the hubs or
getdb functions, caching is implemented to check for previously downloaded database files.
This section includes essential background about DNAm array platforms, assays and file types, and sample metadata.
Databases include human samples run on the Illumina Infinium HM450K BeadArray platform. HM450K is a popular 2-channel platform that probes over 480,000 CpG loci genome-wide, with enriched coverage at CG islands, genes, and enhancers (@sandoval_validation_2011). Array processing generates 2 intensity files (IDATs) per sample, one each for the red and green color channels. These raw files also contain control signals useful for quality evaluations @noauthor_illumina_2010.
HM450K probes use either of 2 bead technologies, known as Type I and Type II,
where the majority (72%) of probes use the latter. For Type II probes, a single
bead assay informs a single probe, while Type I probes use 2 beads each.
Practically, this means the bead-specific matrices found in
objects are larger than the probe-specific matrices found in derived object
types (e.g. 622,399 assays for red/green signal matrices versus 485,512 assays
for methylated/unmethylated signal, DNAm fractions matrices, see below).
DNAm array sample IDATs can be read into an R session as an object of class
RGChannelSet, a type of
SummarizedExperiment. These objects support
analyses of high-throughput genomics datasets, and they include slots for
assay matrices, sample metadata, and experiment metadata. During a typical
workflow, normalization and preprocessing convert
RGChannelSet objects into
new types like
RatioSet. While not all IDAT information is
accessible from every object type (e.g. only
RGChannelSets can contain
control assays), derived objects like
RatioSets may be
smaller and/or faster to access.
SummarizedExperiment databases are provided as
HDF5-SummarizedExperiment files, including an unnormalized
(red/green signals), an unnormalized
signals) and a normalized
GenomicRatioSet (DNAm fractions). For the latter,
DNAm fractions (logit2 Beta-values, or M-values) were normalized using the
out-of-band signal or "noob" method, an effective within-sample normalization
that removes signal artifacts (@triche_low-level_2013).
Database files are stored as either
most R users, the latter files will be most convenient to work with.
hierarchical data format 5, combines compression and chunking for convenient
handling of large datasets.
HDF5-SummarizedExperiment files combine the
SummarizedExperiment entities using a
DelayedArray-powered backend. Once an
HDF5-SummarizedExperiment file is
loaded, it can be treated similarly to a
SummarizedExperiment object in
active memory. That is, summary and subset operations execute rapidly, and
realization of large data chunks in active memory is delayed until called for
by the script (see examples).
Sample metadata are included with DNAm assays in the database files. Currently,
metadata variables include GEO record IDs for samples (GSM) and studies (GSE),
sample record titles, learned labels for tissue and disease, sample type
predictions from the MetaSRA-pipeline, and DNAm model-based predictions for
age, sex, and blood cell types. Access sample metadata from
SummarizedExperiment objects using the
pData minfi function (see examples).
Examples in the
data_analyses vignette illustrate some ways to utilize the
provided sample metadata.
Provided metadata derives from the GSE-specific SOFT files, which contain
experiment, sample, and platform metadata. Considerable efforts were made to
learn, harmonize, and predict metadata labels. Certain types of info lacking
recountmethylation metadata may be available in the SOFT files,
especially if it is sample non-specific (e.g. methods text, PubMed ID, etc.)
or redundant with DNAm-derived metrics (e.g. DNAm summaries, predicted sex,
It is good practice to validate the harmonized metadata with original metadata records, especially where labels are ambiguous or there is insufficient information for a given query. GEO GSM and GSE records can be viewed from a browser, or SOFT files may be downloaded directly. Packages like GEOmetadb and GEOquery are also useful to query and summarize GEO metadata.
This example shows basic handling for
"h5se") files. For these files, the
getdb function returns the loaded file.
Thanks to a
DelayedArray backend, even full-sized
h5se databases can be
treated as if they were fully loaded into active memory.
The test h5se dataset includes sample metadata and noob-normalized
DNAm fractions (Beta-values) for chromosome 22 probes for 2 samples.
Datasets can be downloaded using the
getdb series of functions
?getdb for details), where the
dfp argument specifies the
download destination. The test h5se file is included in the package
"inst" directory, and can be loaded as follows.
dn <- "remethdb-h5se_gr-test_0-0-1_1590090412" path <- system.file("extdata", dn, package = "recountmethylation") h5se.test <- HDF5Array::loadHDF5SummarizedExperiment(path)
Common characterization functions can be used on the dataset after it has been
loaded. These include functions for
SummarizedExperiment-like objects, such
getAnnotation minfi functions. First, inspect
the dataset using standard functions like
Access the sample metadata for the 2 available samples using
h5se.md <- minfi::pData(h5se.test) dim(h5se.md)
Next get CpG probe-specific DNAm fractions, or "Beta-values", with
(rows are probes, columns are samples).
h5se.bm <- minfi::getBeta(h5se.test) dim(h5se.bm)
colnames(h5se.bm) <- h5se.test$gsm knitr::kable(head(h5se.bm), align = "c")
Access manifest information for probes with
getAnnotation. This includes the
bead addresses, probe type, and genome coordinates and regions. For full details
about the probe annotations, consult the minfi and Illumina platform documentation.
an <- minfi::getAnnotation(h5se.test) dim(an)
ant <- as.matrix(t(an[c(1:4), c(1:3, 5:6, 9, 19, 24, 26)])) knitr::kable(ant, align = "c")
To provide more workflow options, bead-specific red and green signal data have
been provided with sample metadata in an
HDF5/h5 file. This example shows
how to handle objects of this type with
The test h5 file includes metadata and bead-specific signals from
chromosome 22 for the same 2 samples as in the h5se test file.
getdb functions for h5 files simply return the database path.
Since the test h5 file has also been included in the package "inst" folder,
get the path to load the file as follows.
dn <- "remethdb-h5_rg-test_0-0-1_1590090412.h5" h5.test <- system.file("extdata", "h5test", dn, package = "recountmethylation")
Use the file path to read data into an
RGChannelSet with the
all.gsm = TRUE obtains data for all samples in the
database files, while passing a vector of GSM IDs to
will query a subset of available samples. Signals from all available probes are retrieved by default, and probe subsets can be obtained by passing a vector of
valid bead addresses to the
h5.rg <- getrg(dbn = h5.test, all.gsm = TRUE)
To avoid exhausting active memory with the full-sized
h5 dataset, provide
getrg, and set either
?getrg for details).
As in the previous example, use
getAnnotation to get sample
metadata and array manifest information, respectively. Access the green and
red signal matrices in the
RGChannelSet with the
h5.red <- minfi::getRed(h5.rg) h5.green <- minfi::getGreen(h5.rg) dim(h5.red)
knitr::kable(head(h5.red), align = "c")
knitr::kable(head(h5.green), align = "c")
Rows in these signal matrices map to bead addresses rather than probe IDs. These matrices have more rows than the h5se test Beta-value matrix because any type I probes use data from 2 beads each.
This section demonstrates validation using the test databases. As the disclaimer notes, it is good practice to validate data against the latest available GEO files. This step may be most useful for newer samples published close to the end compilation date (end of March 2019 for current version), which may be more prone to revisions at initial publication.
gds_idat2rg function to download IDATs for the 2 test samples
and load these into a new
RGChannelSet object. Do this by passing a vector
of GSM IDs to
gsmv and the download
dlpath <- tempdir() gsmv <- c("GSM1038308", "GSM1038309") geo.rg <- gds_idat2rg(gsmv, dfp = dlpath) colnames(geo.rg) <- gsub("\\_.*", "", colnames(geo.rg))
Extract the red and green signal matrices from
geo.red <- minfi::getRed(geo.rg) geo.green <- minfi::getGreen(geo.rg)
Match indices and labels between the GEO and
h5 test signal matrices.
int.addr <- intersect(rownames(geo.red), rownames(h5.red)) geo.red <- geo.red[int.addr,] geo.green <- geo.green[int.addr,] geo.red <- geo.red[order(match(rownames(geo.red), rownames(h5.red))),] geo.green <- geo.green[order(match(rownames(geo.green), rownames(h5.green))),] identical(rownames(geo.red), rownames(h5.red)) identical(rownames(geo.green), rownames(h5.green)) class(h5.red) <- "integer" class(h5.green) <- "integer"
Finally, compare the signal matrix data.
Before comparing the GEO-downloaded data to data from the
h5se.test database, normalize the data using the same out-of-band or "noob" normalization technique
that was used to generate data in the h5se database.
geo.gr <- minfi::preprocessNoob(geo.rg)
Next, extract the Beta-values.
geo.bm <- as.matrix(minfi::getBeta(geo.gr))
Now match row and column labels and indices.
h5se.bm <- as.matrix(h5se.bm) int.cg <- intersect(rownames(geo.bm), rownames(h5se.bm)) geo.bm <- geo.bm[int.cg,] geo.bm <- geo.bm[order(match(rownames(geo.bm), rownames(h5se.bm))),]
Finally, compare the two datasets.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.