DeepBlueR - DeepBlue Epigenomic Data Server - R package

Introduction

The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.

What is DeepBlue ?

A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.

Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.

Getting started

Installation

Installation of DeepBlueR and its companion packages can be performed using the Bioconductor install method in the BiocManager package:

install.packages("BiocManager") 
BiocManager::install("DeepBlueR")

The package name is DeepBlueR and it can be loaded via:

library(DeepBlueR)

You can test your installation and connectivity by saying hello to the DeepBlue server:

deepblue_info("me")

Overview of DeepBlue commands

DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.

A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:

help(package="DeepBlueR")

In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix 'deepblue_*', e.g. deepblue_select_genes.

| Category | Command | Description | |-----------------|-------------------------|------------------------------------| | Information | info | Information about an entity | | List and search | list_genomes | List registered genomes | | | list_biosources | List registered biosources | | | list_samples | List registered samples | | | list_epigenetic marks | List registered epigenetic marks | | | list_experiments | List available experiments | | | list_annotations | List available annotations | | | search | Perform a full-text search | | Selection | select_regions | Select regions from experiments | | | select_experiments | Select regions from experiments | | | select_annotations | Select regions from annotations | | | select_genes | Select genes as regions | | | select_expressions | Select expression data | | | tiling_regions | Generate tiling regions | | | input_regions | Upload and use a small region-set | | Operation | aggregate | Aggregate and summarize regions | | | filter_regions | Filter regions by theirs attributes| | | flank | Generate flanking regions | | | intersection | Filter for intersecting regions | | | overlap | Filter for regions overlapping by at least a specific size | | | merge_queries | Merge two regions set | | Result | count_regions | Count selected regions | | | score_matrix | Request a score matrix | | | get_regions | Request the selected regions | | | binning | Bin results according to counts | | Request | get_request data | Obtain the requested data |

In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:

| Category | Command | Description | |----------|-----------------------|---------------------------------------------------| | Request | batch_export_results | Download the results for a list of requests | | | download_request_data | Download and convert the requested data (blocking)| | | export_meta_data | Export metadata to a tab delimited file | | | export_tab | Export any result as tab delimited file | | | export_bed | Export GenomicRanges results as BED file |

DeepBlue usage examples

In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.

One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:

Full-text search

In this example, we use the command deepblue_search to find experiments that contain the keywords 'H3k27AC', 'blood', and 'peaks' in their metadata. We put the names in single quotes to show that these names must be in the metadata.

# We are selecting the experiments with terms 'H3k27AC', 'blood', and
# 'peak' in the metadata.
experiments_found = deepblue_search(
    keyword="'H3k27AC' 'blood' 'peak'", type="experiments")

custom_table = do.call("rbind", apply(experiments_found, 1, function(experiment){
  experiment_id = experiment[1]
  # Obtain the information about the experiment_id
  info = deepblue_info(experiment_id)

  # Print the experiment name, project, biosource, and epigenetic mark.
  with(info, { data.frame(name = name, project = project,
    biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark)
      })
}))
  head(custom_table)

Listing experiments

We use the deepblue_list_experiments command to list all experiments with the corresponding values in their metadata.

experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
    biosource=c("inflammatory macrophage", "macrophage"),
    project="BLUEPRINT Epigenome")

Accessing the extra-metadata

The extra-metadata is important because it contains information that is not stored in the mandatory metadata fields. We use the deepblue_info command to access an experiment's metadata- and extra-metadata fields. The following example prints the file_url attribute that is contained in the data imported from the ENCODE project.

info = deepblue_info("e30000")
print(info$extra_metadata$file_url)

Select epigenomic data

We use the deepblue_select_experiments command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions command with the query_id value returned by the deepblue_select_experiments command.

The deepblue_count_regions command is executed asynchronously. This means that the user receives a request_id and should check the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))

Count how many regions where selected

request_id = deepblue_count_regions(query_id=query_id)

Download the request data as soon as processing is finished

requested_data = deepblue_download_request_data(request_id=request_id) print(paste("The selected experiments have", requested_data, "regions."))

### Output with selected columns
We use the ```deepblue_select_experiments``` command to select genomic
regions from the experiments that are in chromosome 1, position 0 to 50,000,000.

We then use the ```deepblue_get_regions``` command with the ```query_id``` value
 returned by the ```deepblue_select_experiments``` command to request the
regions with the selected columns.
Selecting the columns ```@NAME``` and ```@BIOSOURCE``` represent the experiment
name and the experiment biosource.

The ```deepblue_get_regions``` command is executed asynchronously. This means
that the user receives a ```request_id``` to be able to check for the status of
this request. In contrast to the command ```deepblue_get_request_data```,
the DeepBlueR package-specific command ```deepblue_download_request_data```
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
query_id = deepblue_select_experiments (
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
        chromosome="chr1", start=0, end=50000000)

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions

Filter epigenomic data by metadata

We use the deepblue_list_samples command to obtain all samples with the biosource 'myeloid cell' from the BLUEPRINT project. The deepblue_list_samples returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions command to selects genomic regions that are in chromosome 1, position 0 to 50,000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME, SAMPLE_ID, and @BIOSOURCE representing the experiment name, the sample ID, and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} samples = deepblue_list_samples( biosource="myeloid cell", extra_metadata = list("source" = "BLUEPRINT Epigenome")) samples_ids = deepblue_extract_ids(samples) query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids, chromosome="chr1", start=0, end=50000) request_id = deepblue_get_regions(query_id=query_id, output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) head(regions,1)

### Filter epigenomic data by region attributes

We use the ```deepblue_select_experiments``` command for selecting genomic
regions from two specific experiments that are in
chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with
```SIGNAL_VALUE``` > 10 and ```PEAK``` > 1000.

Then, we use the ```deepblue_get_regions``` command with the parameter
```query_id``` returned by the ```deepblue_select_regions``` command and the
columns ```@NAME``` and ```@BIOSOURCE``` representing the
experiment name and the experiment biosource.

The ```deepblue_get_regions``` command is executed asynchronously. This means
that the user receives a ```request_id``` to be able to check for the status of
this request. In contrast to the command ```deepblue_get_request_data```,
the DeepBlueR package-specific command ```deepblue_download_request_data```
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
    query_id=query_id, field="SIGNAL_VALUE", operation=">",
    value="10", type="number")
query_id_filters = deepblue_filter_regions(
    query_id=query_id_filter_signal, field="PEAK", operation=">",
    value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions

Find intersecting regions

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

The command deepblue_intersection filters for all regions of the query_id that intersect with at least one region in promoters_id.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) promoters_id = deepblue_select_annotations(annotation_name="promoters", genome="GRCh38", chromosome="chr1") intersect_id = deepblue_intersection( query_data_id=query_id, query_filter_id=promoters_id) request_id = deepblue_get_regions( query_id=intersect_id, output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) regions

```r
library(Gviz)
atrack <- AnnotationTrack(regions, name = "Intersecting regions",
    group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1")
plotTracks(list(itrack, atrack, gtrack), groupAnnotation="group", fontsize=18,
           background.panel = "#FFFEDB", background.title = "darkblue")

Retrieve DNA sequences

The meta-column @LENGTH contains the genomic region length, and we filter the genomic regions where this value is smaller than 2,000.

The meta-column @SEQUENCE includes the DNA Sequence in the genomic region output.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) query_id_filter_signal = deepblue_filter_regions(query_id=query_id, field="SIGNAL_VALUE", operation=">", value="10", type="number") query_id_filters = deepblue_filter_regions(query_id=query_id_filter_signal, field="PEAK", operation=">", value="1000", type="number") query_id_filter_length = deepblue_filter_regions (query_id=query_id_filters, field="@LENGTH", operation="<", value="2000", type="number") request_id = deepblue_get_regions(query_id=query_id_filter_length, output_format="CHROMOSOME,START,END,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE") regions = deepblue_download_request_data(request_id=request_id) head(regions, 1)

### DNA pattern matching operations
We use the ```deepblue_find_motif``` command to find all position of a given
pattern in the genome. An example is finding all locations of 'TATAA'
in genome assembly GRCh38.

We use the ```deepblue_select_experiments``` command to select genomic regions
that are in chromosome 1, position 0 to 50,000,000 from the selected
experiments.

The command ```deepblue_intersect``` selects all regions of the ```query_id```
that intersect with at least one `tataa_regions` region.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
tataa_regions = deepblue_find_motif(motif="TATAAA", genome="GRCh38", chromosomes="chr1")
query_id = deepblue_select_experiments(
    experiment_name= c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
overlapped = deepblue_intersection(query_data_id=query_id,
                                   query_filter_id=tataa_regions)
request_id=deepblue_get_regions(overlapped,
    "CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE,@PROJECT")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)

Counting motifs in a region

The meta column @COUNT.MOTIF() allows for counting how many times a motif appears in the selected genomic region. For example, the following code return the experiment regions with the DNA sequence length, the counts of G, CG, GC, and the DNA sequence itself.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiment_data = deepblue_select_experiments( "DG-75_c01.ERX297417.H3K27ac.bwa.GRCh38.20150527.bed") fmt = "CHROMOSOME,START,END,@LENGTH,@COUNT.MOTIF(C),@COUNT.MOTIF(G),@COUNT.MOTIF(CG),@COUNT.MOTIF(GC),@SEQUENCE" request_id=deepblue_get_regions(experiment_data, fmt) regions = deepblue_download_request_data(request_id=request_id) head(regions, 3)

### Genes
We use the ```deepblue_select_genes``` command to select the gene `RP11-34P13`
from GENCODE v23.

The selected genes behave like a regular genomic region, which,
for example, can be filtered by their attributes. We use the
```@GENE_ATTRIBUTE``` meta-column to select the genomic regions
that are annotated as lincRNAs.

```{R, echo=TRUE, eval=FALSE, warning=FALSE, message=FALSE}
q_genes = deepblue_select_genes(genes="RP11-34P13", gene_model="gencode v23")
q_filter = deepblue_filter_regions(query_id=q_genes,
    field="@GENE_ATTRIBUTE(gene_type)", operation="==",
    value="lincRNA", type="string")
request_id=deepblue_get_regions(q_filter, "CHROMOSOME,START,END,GTF_ATTRIBUTES")
regions = deepblue_download_request_data(request_id=request_id)
regions

Aggregate and summarize regions

The command deepblue_aggregate summarizes the query_id regions using the cpg_islands regions defined by the corresponding annotation as boundaries.

The aggregated values can be accessed through the @AGG.* meta-columns.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments ( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1", start=0, end=50000000) cpg_islands = deepblue_select_annotations(annotation_name="CpG Islands", genome="GRCh38", chromosome="chr1", start=0, end=50000000)

Aggregate

overlapped = deepblue_aggregate (data_id=query_id, ranges_id=cpg_islands, column="VALUE" )

Retrieve the experiments data (The @NAME meta-column is used to include

the experiment name and @BIOSOURCE for experiment's biosource

request_id = deepblue_get_regions(query_id=overlapped, output_format= "CHROMOSOME,START,END,@AGG.MIN,@AGG.MAX,@AGG.MEAN,@AGG.VAR") regions = deepblue_download_request_data(request_id=request_id)

### Gene expression
In the following example we obtain the gene expression levels of three genes,
i.e., NOX3, NOXA1, and NOX4 from all biosources related to
the ```hematopoietic stem cell``` biosource from the BLUEPRINT project. With
related we refer to children of this biosource term in the ontologies used by
DeepBlue.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
hsc_children = deepblue_get_biosource_children("hematopoietic stem cell")

hsc_children_name = deepblue_extract_names(hsc_children)

hsc_children_samples = deepblue_list_samples(
    biosource = hsc_children_name,
    extra_metadata = list(source="BLUEPRINT Epigenome"))

hsc_samples_ids = deepblue_extract_ids(hsc_children_samples)

# Note that BLUEPRINT uses Ensembl Gene IDs
gene_exprs_query = deepblue_select_expressions(
    expression_type = "gene",
    sample_ids = hsc_samples_ids,
    identifiers = c("ENSG00000074771.3", "ENSG00000188747.7", "ENSG00000086991.11"),
    gene_model = "gencode v22")

request_id = deepblue_get_regions(
    query_id = gene_exprs_query,
    output_format ="@GENE_NAME(gencode v22),CHROMOSOME,START,END,FPKM,@BIOSOURCE")

regions = deepblue_download_request_data(request_id = request_id)
regions

Tiling regions

We use the deepblue_tiling_regions command to generate a set of consecutive genomic regions of size 100,000 from chromosome 1 of the genome assembly GRCh38.

The command deepblue_aggregate summarizes the query_id regions using the column VALUE and the cpg_islands regions as boundaries.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}

Selecting the data from 2 experiments:

GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig

As we already know the experiments names, we keep all others fields empty.

We are selecting all regions of chromosome 1

query_id = deepblue_select_experiments( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1")

Tiling regions of 100.000 base pairs

tiling_id = deepblue_tiling_regions(size=100000, genome="GRCh38", chromosome="chr1")

Aggregate

overlapped = deepblue_aggregate (data_id=query_id, ranges_id=tiling_id, column="VALUE")

Retrieve the experiments data (The @NAME meta-column is used to include the

experiment name and @BIOSOURCE for experiment's biosource

request_id = deepblue_get_regions(query_id=overlapped, output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")

regions = deepblue_download_request_data(request_id=request_id) regions

Such data can now be plotted using any of the common R plotting mechanisms
and packages. An example is shown here:

```r
library(ggplot2)
plot_data <- as.data.frame(regions)
plot_data[,grepl("X.", colnames(plot_data))] <-
    apply(plot_data[,grepl("X.", colnames(plot_data))], 2, as.numeric)
AGG.plot <- ggplot(plot_data, aes(start)) +
    geom_ribbon(aes(ymin = X.AGG.MEAN - (X.AGG.SD / 2),
        ymax = X.AGG.MEAN + (X.AGG.SD / 2)), fill = "grey70") +
    geom_line(aes(y = X.AGG.MEAN))
print(AGG.plot)

Flanking regions

We use the deepblue_select_genes command to generate a set of genes from the gene model GENCODE v19.

The deepblue_flank command derives flanking regions from existing regions. First, we derive regions that start 2500bp before the initially selected regions with a length of 2000bp. Next, we derive the regions that start 1500 base pairs after the initially selected regions with 500 base pairs length. For each region, we consider the DNA strand.

The deepblue_merge_queries command merges the region sets defined by two query IDs. Here, we merge the two flanking regions sets we created based on the initially selected genes.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}

Select the RP11-34P13 gene locations from gencode v23

q_genes = deepblue_select_genes( genes= c("RNU6-1100P", "CICP7", "MRPL20", "ANKRD65", "HES2", "ACOT7", "HES3", "ICMT"), gene_model="gencode v19")

Obtain the regions that starts 2500 bases pair before the regions start and

have 2000 base pairs.

The 4th argument inform that DeepBlue must consider the region strand

(column STRAND) to calculate the new region

before_flank_id = deepblue_flank(query_id=q_genes, start=-2500, length=2000, use_strand=TRUE)

Obtain the regions that starts 1500 bases pair after the

regions end and have 500 base pairs.

The 4th argument inform that DeepBlue must consider the

region strand (column STRAND) to calculate the new region

after_flank_id = deepblue_flank(query_id=q_genes, start=1500, length=500, use_strand=TRUE)

Merge both flanking regions set and genes set

flank_merge_id = deepblue_merge_queries( query_a_id =before_flank_id, query_b_id=after_flank_id) all_merge_id = deepblue_merge_queries( query_a_id=q_genes, query_b_id=flank_merge_id)

Request the regions

request_id = deepblue_get_regions(query_id=all_merge_id, output_format="CHROMOSOME,START,END,STRAND,@LENGTH")

regions = deepblue_download_request_data(request_id=request_id) regions

### Calculated columns

Here, we summarize DNA methylation levels for CpG islands of a specific
experiment. Next, we remove those CpG islands for which no values were
found using ```@AGG.COUNT``` > 0.

We use the ```@CALCULATED``` meta-column to transform the ```@AGG.MEAN```
value to log scale.


```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
# Select the RP11-34P13 gene locations from gencode v23

# Selecting the data from 2 experiments:
#    GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig
# As we already know the experiments names, we keep all others fields empty.
# We are selecting all regions of chromosome 1
query_id = deepblue_select_experiments(
    experiment="GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
    chromosome="chr1")

# Select the CpG Islands annotation from GRCh38
cpg_islands = deepblue_select_annotations(
    annotation="CpG Islands", genome="GRCh38", chromosome="chr1")

# Aggregate
overlapped = deepblue_aggregate(
    data_id=query_id, ranges_id=cpg_islands, column="VALUE")

# Select the aggregated regions that aggregated at least one region from the
# selected experiments (@AGG.COUNT > 0)
filtered = deepblue_filter_regions(query_id=overlapped,
    field="@AGG.COUNT", operation=">", value="0", type="number")

# We remove all regions where the aggregation mean is zero.
filtered_zeros = deepblue_filter_regions(query_id=filtered,
    field="@AGG.MEAN", operation="!=", value="0.0", type="number")

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=filtered_zeros,
    output_format=
    "CHROMOSOME,START,END,@CALCULATED(return math.log(value_of('@AGG.MEAN'))),@AGG.MEAN,@AGG.COUNT")

regions = deepblue_download_request_data(request_id=request_id)

# We have to perform a manual conversion because the
# package can't know the type for calculated columns
regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))` =
    as.numeric(regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))`)

head(regions, 5)

Any numerical values returned by DeepBlue can also be conveniently displayed using, for example, the DataTrack feature of the GViz Bioconductor package as shown here:

library(Gviz)
atrack <- AnnotationTrack(regions,
    name = "CpGs", group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1")
dtrack <- DataTrack(regions,
    data="@AGG.MEAN", name = "Log of average methylation value")
plotTracks(list(itrack, atrack, dtrack, gtrack), type="histogram", fontsize=18,
           background.panel = "#FFFEDB", background.title = "darkblue")

Score matrix

Here, we select a small number of experiments for which we want to build a score matrix based on the column VALUE.

We use CpG islands as aggregated regions boundaries.

The deepblue_score_matrix command expects a named list with the experiments names and columns that will be used for aggregation, the regions' boundaries, and the operation that will be performed (min, max, mean, var, sd, median, count).

The deepblue_score_matrix command is executed asynchronously. The command download_request_data will return a matrix in which the first three columns correspond to the chromosome, start position and end position. The remaining columns will carry the names of the experiments and hold the corresponding aggregated values.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiments = c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "C003N351.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "C005VG51.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S002R551.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "NBC_NC11_41.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "bmPCs-V156.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00BS451.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00D1DA1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig", "S00D39A1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig")

experiments_columns = list() for (experiment_name in experiments) { experiments_columns[[experiment_name]] = "VALUE" }

cpgs = deepblue_select_annotations( annotation_name="Cpg Islands", chromosome="chr22", start=0, end=18000000, genome="GRCh38")

request_id = deepblue_score_matrix( experiments_columns=experiments_columns, aggregation_function="mean", aggregation_regions_id=cpgs)

score_matrix = deepblue_download_request_data(request_id=request_id) head(score_matrix, 5)

```r
library(ggplot2)
score_matrix_plot = tidyr::gather(score_matrix,
    "experiment", "methylation", -CHROMOSOME, -START, -END)
score_matrix_plot$START <- as.factor(score_matrix_plot$START)
ggplot(score_matrix_plot, aes(x=START, y=experiment, fill=methylation)) +
    geom_tile() +
    theme(axis.text.x=element_text(angle=-90))

Data Export

DeepBlueR allows you to conveniently save results to disk. Any result can be saved as tab delimited file using deepblue_export_tab. For example, we can save the score matrix generated in the above example:

deepblue_export_tab(score_matrix, file.name = "my_score_matrix")

Results obtained with deepblue_get_regions are of type GenomicRanges and can be exported as tab delimited files preserving all columns or as BED files, where a specific column can optionally be selected to populate the 'score' column of the BED file. To demonstrate this, we use the result from the tiling regions example further above:

request_id = deepblue_get_regions(query_id=overlapped,
    output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")

regions = deepblue_download_request_data(request_id=request_id)
deepblue_export_bed(regions,
                    file.name = "my_tiling_regions",
                    score.field = "@AGG.MEAN")

Furthermore, metadata associated with any id can be stored locally using the deepblue_export_meta_data command. To this end, we first obtain the experiment id of the file we used in the tiling regions example.

exp_id <- deepblue_name_to_id(
    "GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
    collection = "experiments")$id

deepblue_export_meta_data(exp_id, file.name = "GC_T14")

This command can also handle lists of ids, for instance:

deepblue_export_meta_data(list("e30035", "e30036"),
file.name = "test_export")

In some cases, users will perform a series of requests. We provide the command deepblue_batch_export to save these results and their associated metadata to disk in one go. This method will save each file as it becomes available, i.e. it will be saved once the request is successfully processed by DeepBlue:

experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
    biosource=c("inflammatory macrophage", "macrophage"),
    project="BLUEPRINT Epigenome")
experiment_names = deepblue_extract_names(experiments)

request_ids = foreach(experiment = experiment_names) %do%{
  query_id = deepblue_select_experiments(experiment_name = experiment,
                                        chromosome = "chr21")

  request_id = deepblue_get_regions(query_id =query_id,
    output_format = "CHROMOSOME,START,END")
}
request_data = deepblue_batch_export_results(request_ids,
                             target.directory = "BLUEPRINT macrophages chr21")

Options

DeepBlueR comes with default options that can be changed by the user. To list the current options use the following command:

deepblue_options()

Changing an option works as follows:

deepblue_options(do_not_cache = TRUE)

Another example (replace 'my_user_key' with the actual key):

deepblue_options(user_key = "my_user_key")

In case you wish to restore the default options simply call

deepblue_reset_options()

Caching

DeepBlueR by default creates a file 'DeepBlueR.cache' in the current working directory. Downloaded results / regions are stored there and can be instantly retrieved, which is particularly useful for users with limited network bandwith. However, in case caching is not desired it can be switched off (see [Options])

To check the status of the cache you can use the following command:

deepblue_cache_status()

This will report the cache size and the number of requests currently stored. Alternatively, users can list the request ids for which results are available:

deepblue_list_cached_requests()

Over time, the cache can quickly grow in size. It is possible to remove individual requests from the cache if the request id is known:

deepblue_delete_request_from_cache("r123")

In most cases it will be simpler to simply delete the cache:

deepblue_clear_cache()

Splitting requests for improving the performance in genome-scale requests

DeepBlue has a memory limit on individual requests. As a consequence, some operations may not be executed successfully by the DeepBlue web server. To avoid this, large requests should be split by chromosomes. This has another advantage: if each chromosome is a different request, processing will be parallelized on DeepBlue and finish in a fraction of the time (depending on the queing status) compared to the same operation without splitting. Here is an example for obtaining a score matrix for each chromosome individually:

library(foreach)

chromosomes_mm10 <- deepblue_extract_ids(deepblue_chromosomes(genome = "mm10"))

request_ids <- foreach(chromosome = chromosomes_mm10, .combine = c) %do% {

    tiling_regions = deepblue_tiling_regions(
        size=100000, genome="mm10", chromosome=chromosome)

    deepblue_score_matrix(
        experiments_columns = list(ENCFF721EKA="VALUE", ENCFF781VVH="VALUE"),
        aggregation_function = "mean",
        aggregation_regions_id = tiling_regions
    )
}

Now each chromosome is processed individually and efficiently by DeepBlue. We can use the deepblue_batch_export function to download the individual matrices

list_of_score_matrices <- deepblue_batch_export_results(request_ids)

Next, we can simple merge them to obtain the final matrix

library(data.table)
genome_wide_score_matrix <- data.table::rbindlist(list_of_score_matrices,
                                                  use.names = TRUE)
genome_wide_score_matrix

Large-scale analysis of DNA methylation across 212 samples from the BLUEPRINT consortium

Aim

Here, we will show how DeepBlueR can be used to generate an overview heatmap of variable positions in more than 200 BLUEPRINT DNA methylation experiments. The amount of data considered here would normally be too huge to be processed on a local R installation. However, using DeepBlue and server-side processing of the data, we can facilitate this large-scale analysis easily.

Dependencies

In the first step, we load the DeepBlueR package, as well as packages for data retrieval, matrix operations and plotting.

library(DeepBlueR)
library(gplots)
library(RColorBrewer)
library(matrixStats)
library(stringr)

Select experiments

Next, we list all available BLUEPRINT DNA methylation experiments. (412 files that match the required metadata were available during the edition of this vignette.)

blueprint_DNA_meth <- deepblue_list_experiments(genome = "GRCh38",
                          epigenetic_mark = "DNA Methylation",
                          technique = "Bisulfite-Seq",
                          project = "BLUEPRINT EPIGENOME")

blueprint_DNA_meth

We are only interested in a subset of those files and filter for call files (opposed to coverage files).

blueprint_DNA_meth <- blueprint_DNA_meth[grep("CPG_methylation_calls.bs_call",
    deepblue_extract_names(blueprint_DNA_meth)),]

blueprint_DNA_meth

Select experiment column

Each of these files has a column, named VALUE, that holds the DNA methylation beta values. There are two possibilities to select this column across several files.

First, we assume that the column in question has a different name in each file. We thus have to create a list that holds the column name for each of them. Such a list can be generated using standard R commands:

exp_columns <- list(nrow(blueprint_DNA_meth))

for(i in 1:nrow(blueprint_DNA_meth)){
    exp_columns[[i]] <- "VALUE"
}
names(exp_columns) <- deepblue_extract_names(blueprint_DNA_meth)

In most cases, the same column name will apply for each file. We thus implemented a short hand function for generating the above list with a single column name for all files.

exp_columns <- deepblue_select_column(blueprint_DNA_meth, "VALUE")

Filter for genomic regions of interest using annotations

In the next operation, we consider that not all methylation sites will be informative for clustering the data. We thus filter for those regions that are part of the BLUEPRINT regulatory build, a modified version of the ENSEMBL regulatory build that contains promoters, promoter flanking regions, enhancers, CTCF binding sites, transcription factor binding sites and open chromatin regions. As we can see in this example, DeepBlueR returns a vector of query ids (one for each chromosome) which we store for later use.

#list all available chromosomes in GRCh38
chromosomes_GRCh38 <- deepblue_extract_ids(
    deepblue_chromosomes(genome = "GRCh38")
)

#keep only the essential ones
chromosomes_GRCh38 <- 
    grep(pattern = "chr([0-9]{1,2}|X)$", chromosomes_GRCh38, value = TRUE)

#we split the request by chromosome to avoid hitting the memory limit of 
#DeepBlue
blueprint_regulatory_regions <- 
    foreach(chr = chromosomes_GRCh38, .combine = c) %do%
        deepblue_select_annotations(
            annotation_name = "Blueprint Ensembl Regulatory Build",
            chromosome = chr,
            genome = "GRCh38"
        )

blueprint_regulatory_regions

DeepBlue has several annotations that can be used to filter for informative sites. We could, for example, also filter for CpG islands.

deepblue_select_annotations(annotation_name = "Cpg Islands",
                            genome = "GRCh38")

A list of all annotations currently available for a genome is given by the following command.

deepblue_list_annotations(genome = "GRCh38")

New annotations may be included upon users request.

In case we want to include the entire genome in an aggregated version DeepBlue supports the concept of tiling regions. In this process, the genomic range of interest will be binned into tiles of a given size (here 5kb).

tiling_regions <- deepblue_tiling_regions(size=5000,
                                          genome="GRCh38")

Generate a score matrix

In the above step we have defined a set of regions of interest that we want to interrogate in R to, for example, cluster samples. To this end, DeepBlue can build a score matrix, in which the selected genomic regions are aggregated on the server to reduce the complexity and size of the data. We request such a score matrix in which regulatory regions are aggregated by the mean as follows. Note that we use the variables 'exp_columns' and 'blueprint_regulatory_regions' that we have defined above. Since we had to split our request by chromosome, we need to make multiple requests.

request_ids <- foreach(query_id = blueprint_regulatory_regions,
                       .combine = c) %do% 
    deepblue_score_matrix(
        experiments_columns = exp_columns,
        aggregation_function = "mean",
        aggregation_regions_id = query_id)

request_ids

After triggering this function, DeepBlue queues our task and will execute it when resources become available. We also observe that DeepBlue returns a request id, which we can use to query the status of the operation.

foreach(request = request_ids, .combine = c) %do% {
  deepblue_info(request)$state
}

When the operation is finished, we can download the score matrix and store it in a local variable. For DeepBlueR, we implemented several strategies to improve the performance of data retrieval. For instance, we modified the existing XML-RPC package to be more efficient in the context of DeepBlue when it comes to parsing nested XML data. Moreover, we retrieve tabular data directly in a tab separated file format, which can be processed much faster in R. Finally, we also compress data on the server side to reduce download time. Here, we only show the first five columns out of 215.

score_matrix <- data.table::rbindlist(
    deepblue_batch_export_results(request_ids),
    use.names = TRUE)
score_matrix[,1:5, with=FALSE]

The download is 212.8 MB in size. The size of the data we handled on DeepBlue to extract this information is roughly 212 x ~450 MB ~= 95 GB and thus more than can be handled in R on most desktop computers. We next show how this score matrix can be used to plot a heatmap where samples are clustered by the Pearson correlation coefficient, revealing that samples originating from the same cell type are more similar in DNA methylation.

Generating a heatmap

Metadata and colors

In preparation of the heatmap plot, we need to generate an RColorBrewer palette. This allows us to create a color palette for more than 9 colors.

getPalette <- colorRampPalette(brewer.pal(9, "Set1"))

For each experiment, we collect metadata.

experiments_info <- deepblue_info(deepblue_extract_ids(blueprint_DNA_meth))

All metadata is parsed to a nested R list. We refer to the DeepBlue paper for a description of available metadata. Here, we show the metadata associated with just one of the samples.

head(experiments_info[[1]], 10)

For this analysis, we are only interested in the biosource name, i.e. the cell type. We can retrieve this information using standard R syntax. Note that we show only the first 6 entries here.

biosource <- unlist(lapply(experiments_info, function(x){ x$sample_info$biosource_name}))
head(biosource)

To save some space on the plot, we replace positive with + and negative with -.

biosource <- str_replace_all(biosource, "-positive", "+")
biosource <- str_replace_all(biosource, "-negative", "-")

For the same reason, we remove the words 'terminally differentiated' from one of the cell types.

biosource <- str_replace(biosource, ", terminally differentiated", "")

Using above color palette, we can now assign a unique color to each cell type.

color_map <- data.frame(biosource = unique(biosource),
                        color = getPalette(length(unique(biosource))))

head(color_map)

Using above table, we can now assign the colors to each experiment according to its cell type / biosource.

exp_names <- unlist(lapply(experiments_info, function(x){ x$name}))

biosource_colors <- data.frame(name = exp_names, biosource = biosource)
biosource_colors <- dplyr::left_join(biosource_colors, color_map, by = "biosource")
head(biosource_colors)

Finally, we transform this data frame into a vector that is compatible with the heatmap function.

color_vector <- as.character(biosource_colors$color)
names(color_vector) <-  biosource_colors$biosource
head(color_vector)

Processing the input data

We remove the first three columns (CHROMOSOME, START, END) and convert the data frame to a numeric matrix.

filtered_score_matrix <- as.matrix(score_matrix[,-c(1:3), with=FALSE])
head(filtered_score_matrix[,1:3])

Next, we compute the variance of each row and retain only genomic regions with variance > 0.05 for plotting. Plotting all regions would consume too much memory and more importantly, regions that do not show variance also do not allow us to spot differences between cell types in the heatmap.

message("regions before: ", nrow(filtered_score_matrix))
filtered_score_matrix_rowVars <- rowVars(filtered_score_matrix, na.rm = TRUE)
filtered_score_matrix <- filtered_score_matrix[which(filtered_score_matrix_rowVars > 0.05),]
message("regions after: ", nrow(filtered_score_matrix))

To be able to cluster samples, we remove regions that have missing values in at least one of the experiments.

message("regions before: ", nrow(filtered_score_matrix))
filtered_score_matrix <- filtered_score_matrix[which(complete.cases(filtered_score_matrix)),]
message("regions after: ", nrow(filtered_score_matrix))

IMPORTANT: The order of columns in the score matrix is not the same as in the exp_columns list used in the request. We thus have to order the matrix by the experiment names in the color map. This is crucial to make sure we assign the correct cell type to each sample!

filtered_score_matrix <- filtered_score_matrix[,exp_names]

Plotting

We plot a heatmap in which the variable regions are shown across all samples. On top of the columns, we create a dendrogram based on Pearson correlation More precisely, we convert the Pearson correlation, a similarity measure, to a distance, such that it can be used with hierarchical clustering.

 heatmap.2(filtered_score_matrix,labRow = NA, labCol = NA,
          trace = "none", ColSideColors = color_vector,
          hclust=function(x) hclust(x,method="complete"),
          distfun=function(x) as.dist(1-cor(t(x), method = "pearson")),
          Rowv = TRUE, dendrogram = "column",
          key.xlab = "beta value", denscol = "black", keysize = 1.5,
          key.title = NA)
  plot.new()

  legend(x = 0, y = 1,
       legend = color_map$biosource,
       col = as.character(color_map$color),
       text.width = 0.6,
       lty= 1,
       lwd = 6,
       cex = 0.7,
       y.intersp = 0.7,
       x.intersp = 0.7,
       inset=c(-0.21,-0.11))

Further reading material

To obtain a general overview of DeepBlue, we recommend starting with the DeepBlue publication and a list of all DeepBlue commands is available in its API page.

You can have a look at the other use cases included in the R package and list them with

demo(package = "DeepBlueR")

Individual use cases can be triggered with

demo("use_case1", package = "DeepBlueR")

Note that the example presented here corresponds to use case 4 in the R package.

Final remarks

Finally, we encourage you to try to reproduce Python examples in R and to read the DeepBlue manual.

Finally, we want to highlight the possibility to browse and access existing data in DeepBlue conveniently in the web interface. The web interface also allows you to select experiments in a grid like view.

Should you encounter any problems with DeepBlueR, we kindly ask you to create an issue in the BioConductor DeepBlueR support page.

The R code in the DeepBlueR package is under the GPLv3 license and we welcome contributions of other developers. Finally, we would like to thank the Bioconductor team for its support in making DeepBlueR available to a wide audience of users.



Try the DeepBlueR package in your browser

Any scripts or data that you put into this service are public.

DeepBlueR documentation built on Nov. 8, 2020, 6:25 p.m.