README.md
In MPIIComputationalEpigenetics/DeepBlue-R: DeepBlueR

title: "DeepBlueR - DeepBlue Epigenomic Data Server - R package" author: "Felipe Albrecht, Markus List" date: "r Sys.Date()" output: html_document: toc: true number_of_sections: true vignette: > %\VignetteIndexEntry{The DeepBlue epigenomic data server - R package} %\VignetteDepends{DeepBlueR,ggplot2,Gviz,tidyr} %\VignettePackage{DeepBlueR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

Introduction

The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.

A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.

Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.

Getting started

Installation of DeepBlueR and its companion packages can be performed using the Bioconductor install method in the BiocManager package:

```{r, eval = FALSE, echo=TRUE, warning=FALSE, message=FALSE} install.packages("BiocManager") BiocManager::install("DeepBlueR")


The package name is ```DeepBlueR``` and it can be loaded via:
```{r, echo = TRUE, warning=FALSE, message=FALSE, error=FALSE}
library(DeepBlueR)

You can test your installation and connectivity by saying hello to the DeepBlue server: ```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE} deepblue_info("me")


## Overview of DeepBlue commands

DeepBlue provides a comprehensive programmatic interface for finding, selecting,
filtering, summarizing and downloading annotated genomic region sets. Downloaded
 region sets are stored using the GenomicRanges R package, which allows for
  downloaded region sets to be further processed, visualized and analyzed with
  existing R packages such as LOLA or GViz.

A list of all commands available by DeepBlue is provided in its
 [API page](http://deepblue.mpi-inf.mpg.de/api.php). The vast majority of
 these commands is also available through this R package and can be listed as follows:

 ```
 help(package="DeepBlueR")
 ```

In the following we listed the most frequently used DeepBlue commands. The full
list of commands is available [here](http://deepblue.mpi-inf.mpg.de/api.php).
Note that each command in the following two tables has the prefix
'deepblue_*', e.g. deepblue_select_genes.

| Category        | Command                 | Description                        |
|-----------------|-------------------------|------------------------------------|
| Information     | info                    | Information about an entity        |
| List and search | list_genomes            | List registered genomes            |
|                 | list_biosources         | List registered biosources         |
|                 | list_samples            | List registered samples            |
|                 | list_epigenetic marks   | List registered epigenetic marks   |
|                 | list_experiments        | List available experiments         |
|                 | list_annotations        | List available annotations         |
|                 | search                  | Perform a full-text search         |
| Selection       | select_regions          | Select regions from experiments    |
|                 | select_experiments      | Select regions from experiments    |
|                 | select_annotations      | Select regions from annotations    |
|                 | select_genes            | Select genes as regions            |
|                 | select_expressions      | Select expression data             |
|                 | tiling_regions          | Generate tiling regions            |
|                 | input_regions           | Upload and use a small region-set  |
| Operation       | aggregate               | Aggregate and summarize regions    |
|                 | filter_regions          | Filter regions by theirs attributes|
|                 | flank                   | Generate flanking regions          |
|                 | intersection            | Filter for intersecting regions    |
|                 | overlap                 | Filter for regions overlapping by at least a specific size |
|                 | merge_queries           | Merge two regions set              |
| Result          | count_regions           | Count selected regions             |
|                 | score_matrix            | Request a score matrix             |
|                 | get_regions             | Request the selected regions       |
|                 | binning                 | Bin results according to counts    |
| Request         | get_request data        | Obtain the requested data          |


In addition, this package provides a set of convenience functions not part of
the DeepBlue API, such as:

| Category | Command               | Description                                       |
|----------|-----------------------|---------------------------------------------------|
| Request  | batch_export_results  | Download the results for a list of requests       |
|          | download_request_data | Download and convert the requested data (blocking)|
|          | export_meta_data      | Export metadata to a tab delimited file           |
|          | export_tab            | Export any result as tab delimited file           |
|          | export_bed            | Export GenomicRanges results as BED file          |

# DeepBlue usage examples

In the following we give a number of increasingly complex examples illustrating
what DeepBlue can achieve in your epigenomic data analysis work-flow.
We go beyond the online description of these examples by showing how the
retrieved information can be further used in R.

One of the first tasks in DeepBlue is finding the data of interest.
This can be achieved in three ways:

* Using full-text search with the ```deepblue_search``` command
* Listing the available data with the
```deepblue_list_{experiments, annotations, ...}``` commands
* Using the companion [DeepBlue web interface](http://deepblue.mpi-inf.mpg.de/)
site for listing the data

### Full-text search

In this example, we use the command ```deepblue_search``` to
find experiments that contain the keywords 'H3k27AC', 'blood', and 'peaks' in
their metadata. We put the names in single quotes to show that these names must
be in the metadata.

```{r, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
# We are selecting the experiments with terms 'H3k27AC', 'blood', and
# 'peak' in the metadata.
experiments_found = deepblue_search(
    keyword="'H3k27AC' 'blood' 'peak'", type="experiments")

custom_table = do.call("rbind", apply(experiments_found, 1, function(experiment){
  experiment_id = experiment[1]
  # Obtain the information about the experiment_id
  info = deepblue_info(experiment_id)

  # Print the experiment name, project, biosource, and epigenetic mark.
  with(info, { data.frame(name = name, project = project,
    biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark)
      })
}))
  head(custom_table)

We use the deepblue_list_experiments command to list all experiments with the corresponding values in their metadata.

```{r, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3", biosource=c("inflammatory macrophage", "macrophage"), project="BLUEPRINT Epigenome")


### Accessing the extra-metadata
The extra-metadata is important because it contains information that is not
stored in the mandatory metadata fields. We use the ```deepblue_info``` command
to access an experiment's metadata- and extra-metadata fields.
The following example prints the ```file_url``` attribute that is contained
in the data imported from the ENCODE project.

```{r, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
info = deepblue_info("e30000")
print(info$extra_metadata$file_url)

We use the deepblue_select_experiments command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions command with the query_id value returned by the deepblue_select_experiments command.

The deepblue_count_regions command is executed asynchronously. This means that the user receives a request_id and should check the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

Count how many regions where selected

request_id = deepblue_count_regions(query_id=query_id)

Download the request data as soon as processing is finished

requested_data = deepblue_download_request_data(request_id=request_id) print(paste("The selected experiments have", requested_data, "regions."))


### Output with selected columns
We use the ```deepblue_select_experiments``` command to select genomic
regions from the experiments that are in chromosome 1, position 0 to 50,000,000.

We then use the ```deepblue_get_regions``` command with the ```query_id``` value
 returned by the ```deepblue_select_experiments``` command to request the
regions with the selected columns.
Selecting the columns ```@NAME``` and ```@BIOSOURCE``` represent the experiment
name and the experiment biosource.

The ```deepblue_get_regions``` command is executed asynchronously. This means
that the user receives a ```request_id``` to be able to check for the status of
this request. In contrast to the command ```deepblue_get_request_data```,
the DeepBlueR package-specific command ```deepblue_download_request_data```
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
query_id = deepblue_select_experiments (
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
        chromosome="chr1", start=0, end=50000000)

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions

We use the deepblue_list_samples command to obtain all samples with the biosource 'myeloid cell' from the BLUEPRINT project. The deepblue_list_samples returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions command to selects genomic regions that are in chromosome 1, position 0 to 50,000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME, SAMPLE_ID, and @BIOSOURCE representing the experiment name, the sample ID, and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} samples = deepblue_list_samples( biosource="myeloid cell", extra_metadata = list("source" = "BLUEPRINT Epigenome")) samples_ids = deepblue_extract_ids(samples) query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids, chromosome="chr1", start=0, end=50000) request_id = deepblue_get_regions(query_id=query_id, output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) head(regions,1)


### Filter epigenomic data by region attributes

We use the ```deepblue_select_experiments``` command for selecting genomic
regions from two specific experiments that are in
chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with
```SIGNAL_VALUE``` > 10 and ```PEAK``` > 1000.

Then, we use the ```deepblue_get_regions``` command with the parameter
```query_id``` returned by the ```deepblue_select_regions``` command and the
columns ```@NAME``` and ```@BIOSOURCE``` representing the
experiment name and the experiment biosource.

The ```deepblue_get_regions``` command is executed asynchronously. This means
that the user receives a ```request_id``` to be able to check for the status of
this request. In contrast to the command ```deepblue_get_request_data```,
the DeepBlueR package-specific command ```deepblue_download_request_data```
will wait for the processing to finish, before downloading the data. Moreover,
this command will convert any regions to a GRanges object.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
    query_id=query_id, field="SIGNAL_VALUE", operation=">",
    value="10", type="number")
query_id_filters = deepblue_filter_regions(
    query_id=query_id_filter_signal, field="PEAK", operation=">",
    value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

The command deepblue_intersection filters for all regions of the query_id that intersect with at least one region in promoters_id.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) promoters_id = deepblue_select_annotations(annotation_name="promoters", genome="GRCh38", chromosome="chr1") intersect_id = deepblue_intersection( query_data_id=query_id, query_filter_id=promoters_id) request_id = deepblue_get_regions( query_id=intersect_id, output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE") regions = deepblue_download_request_data(request_id=request_id) regions


```{r, message=FALSE, error=FALSE, warning=FALSE}
library(Gviz)
atrack <- AnnotationTrack(regions, name = "Intersecting regions",
    group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1")
plotTracks(list(itrack, atrack, gtrack), groupAnnotation="group", fontsize=18,
           background.panel = "#FFFEDB", background.title = "darkblue")

The meta-column @LENGTH contains the genomic region length, and we filter the genomic regions where this value is smaller than 2,000.

The meta-column @SEQUENCE includes the DNA Sequence in the genomic region output.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments( experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed", "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"), chromosome="chr1", start=0, end=50000000) query_id_filter_signal = deepblue_filter_regions(query_id=query_id, field="SIGNAL_VALUE", operation=">", value="10", type="number") query_id_filters = deepblue_filter_regions(query_id=query_id_filter_signal, field="PEAK", operation=">", value="1000", type="number") query_id_filter_length = deepblue_filter_regions (query_id=query_id_filters, field="@LENGTH", operation="<", value="2000", type="number") request_id = deepblue_get_regions(query_id=query_id_filter_length, output_format="CHROMOSOME,START,END,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE") regions = deepblue_download_request_data(request_id=request_id) head(regions, 1)


### DNA pattern matching operations
We use the ```deepblue_find_motif``` command to find all position of a given
pattern in the genome. An example is finding all locations of 'TATAA'
in genome assembly GRCh38.

We use the ```deepblue_select_experiments``` command to select genomic regions
that are in chromosome 1, position 0 to 50,000,000 from the selected
experiments.

The command ```deepblue_intersect``` selects all regions of the ```query_id```
that intersect with at least one `tataa_regions` region.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
tataa_regions = deepblue_find_motif(motif="TATAAA", genome="GRCh38", chromosomes="chr1")
query_id = deepblue_select_experiments(
    experiment_name= c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
overlapped = deepblue_intersection(query_data_id=query_id,
                                   query_filter_id=tataa_regions)
request_id=deepblue_get_regions(overlapped,
    "CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE,@PROJECT")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)

The meta column @COUNT.MOTIF() allows for counting how many times a motif appears in the selected genomic region. For example, the following code return the experiment regions with the DNA sequence length, the counts of G, CG, GC, and the DNA sequence itself.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiment_data = deepblue_select_experiments( "DG-75_c01.ERX297417.H3K27ac.bwa.GRCh38.20150527.bed") fmt = "CHROMOSOME,START,END,@LENGTH,@COUNT.MOTIF(C),@COUNT.MOTIF(G),@COUNT.MOTIF(CG),@COUNT.MOTIF(GC),@SEQUENCE" request_id=deepblue_get_regions(experiment_data, fmt) regions = deepblue_download_request_data(request_id=request_id) head(regions, 3)


### Genes
We use the ```deepblue_select_genes``` command to select the gene `RP11-34P13`
from GENCODE v23.

The selected genes behave like a regular genomic region, which,
for example, can be filtered by their attributes. We use the
```@GENE_ATTRIBUTE``` meta-column to select the genomic regions
that are annotated as lincRNAs.

```{R, echo=TRUE, eval=FALSE, warning=FALSE, message=FALSE}
q_genes = deepblue_select_genes(genes="RP11-34P13", gene_model="gencode v23")
q_filter = deepblue_filter_regions(query_id=q_genes,
    field="@GENE_ATTRIBUTE(gene_type)", operation="==",
    value="lincRNA", type="string")
request_id=deepblue_get_regions(q_filter, "CHROMOSOME,START,END,GTF_ATTRIBUTES")
regions = deepblue_download_request_data(request_id=request_id)
regions

The command deepblue_aggregate summarizes the query_id regions using the cpg_islands regions defined by the corresponding annotation as boundaries.

The aggregated values can be accessed through the @AGG.* meta-columns.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} query_id = deepblue_select_experiments ( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1", start=0, end=50000000) cpg_islands = deepblue_select_annotations(annotation_name="CpG Islands", genome="GRCh38", chromosome="chr1", start=0, end=50000000)

Aggregate

overlapped = deepblue_aggregate (data_id=query_id, ranges_id=cpg_islands, column="VALUE" )

Retrieve the experiments data (The @NAME meta-column is used to include

the experiment name and @BIOSOURCE for experiment's biosource

request_id = deepblue_get_regions(query_id=overlapped, output_format= "CHROMOSOME,START,END,@AGG.MIN,@AGG.MAX,@AGG.MEAN,@AGG.VAR") regions = deepblue_download_request_data(request_id=request_id)


### Gene expression
In the following example we obtain the gene expression levels of three genes,
i.e., NOX3, NOXA1, and NOX4 from all biosources related to
the ```hematopoietic stem cell``` biosource from the BLUEPRINT project. With
related we refer to children of this biosource term in the ontologies used by
DeepBlue.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
hsc_children = deepblue_get_biosource_children("hematopoietic stem cell")

hsc_children_name = deepblue_extract_names(hsc_children)

hsc_children_samples = deepblue_list_samples(
    biosource = hsc_children_name,
    extra_metadata = list(source="BLUEPRINT Epigenome"))

hsc_samples_ids = deepblue_extract_ids(hsc_children_samples)

# Note that BLUEPRINT uses Ensembl Gene IDs
gene_exprs_query = deepblue_select_expressions(
    expression_type = "gene",
    sample_ids = hsc_samples_ids,
    identifiers = c("ENSG00000074771.3", "ENSG00000188747.7", "ENSG00000086991.11"),
    gene_model = "gencode v22")

request_id = deepblue_get_regions(
    query_id = gene_exprs_query,
    output_format ="@GENE_NAME(gencode v22),CHROMOSOME,START,END,FPKM,@BIOSOURCE")

regions = deepblue_download_request_data(request_id = request_id)
regions

We use the deepblue_tiling_regions command to generate a set of consecutive genomic regions of size 100,000 from chromosome 1 of the genome assembly GRCh38.

The command deepblue_aggregate summarizes the query_id regions using the column VALUE and the cpg_islands regions as boundaries.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}

Selecting the data from 2 experiments:

GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig

As we already know the experiments names, we keep all others fields empty.

We are selecting all regions of chromosome 1

query_id = deepblue_select_experiments( experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"), chromosome="chr1")

Tiling regions of 100.000 base pairs

tiling_id = deepblue_tiling_regions(size=100000, genome="GRCh38", chromosome="chr1")

Aggregate

overlapped = deepblue_aggregate (data_id=query_id, ranges_id=tiling_id, column="VALUE")

Retrieve the experiments data (The @NAME meta-column is used to include the

experiment name and @BIOSOURCE for experiment's biosource

request_id = deepblue_get_regions(query_id=overlapped, output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")

regions = deepblue_download_request_data(request_id=request_id) regions


Such data can now be plotted using any of the common R plotting mechanisms
and packages. An example is shown here:

```{r}
library(ggplot2)
plot_data <- as.data.frame(regions)
plot_data[,grepl("X.", colnames(plot_data))] <-
    apply(plot_data[,grepl("X.", colnames(plot_data))], 2, as.numeric)
AGG.plot <- ggplot(plot_data, aes(start)) +
    geom_ribbon(aes(ymin = X.AGG.MEAN - (X.AGG.SD / 2),
        ymax = X.AGG.MEAN + (X.AGG.SD / 2)), fill = "grey70") +
    geom_line(aes(y = X.AGG.MEAN))
print(AGG.plot)

We use the deepblue_select_genes command to generate a set of genes from the gene model GENCODE v19.

The deepblue_flank command derives flanking regions from existing regions. First, we derive regions that start 2500bp before the initially selected regions with a length of 2000bp. Next, we derive the regions that start 1500 base pairs after the initially selected regions with 500 base pairs length. For each region, we consider the DNA strand.

The deepblue_merge_queries command merges the region sets defined by two query IDs. Here, we merge the two flanking regions sets we created based on the initially selected genes.

```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}

Select the RP11-34P13 gene locations from gencode v23

q_genes = deepblue_select_genes( genes= c("RNU6-1100P", "CICP7", "MRPL20", "ANKRD65", "HES2", "ACOT7", "HES3", "ICMT"), gene_model="gencode v19")

Obtain the regions that starts 2500 bases pair before the regions start and

have 2000 base pairs.

The 4th argument inform that DeepBlue must consider the region strand

(column STRAND) to calculate the new region

before_flank_id = deepblue_flank(query_id=q_genes, start=-2500, length=2000, use_strand=TRUE)

Obtain the regions that starts 1500 bases pair after the

regions end and have 500 base pairs.

The 4th argument inform that DeepBlue must consider the

region strand (column STRAND) to calculate the new region

after_flank_id = deepblue_flank(query_id=q_genes, start=1500, length=500, use_strand=TRUE)

Merge both flanking regions set and genes set

flank_merge_id = deepblue_merge_queries( query_a_id =before_flank_id, query_b_id=after_flank_id) all_merge_id = deepblue_merge_queries( query_a_id=q_genes, query_b_id=flank_merge_id)

Request the regions

request_id = deepblue_get_regions(query_id=all_merge_id, output_format="CHROMOSOME,START,END,STRAND,@LENGTH")

regions = deepblue_download_request_data(request_id=request_id) regions


### Calculated columns

Here, we summarize DNA methylation levels for CpG islands of a specific
experiment. Next, we remove those CpG islands for which no values were
found using ```@AGG.COUNT``` > 0.

We use the ```@CALCULATED``` meta-column to transform the ```@AGG.MEAN```
value to log scale.


```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
# Select the RP11-34P13 gene locations from gencode v23

# Selecting the data from 2 experiments:
#    GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig
# As we already know the experiments names, we keep all others fields empty.
# We are selecting all regions of chromosome 1
query_id = deepblue_select_experiments(
    experiment="GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
    chromosome="chr1")

# Select the CpG Islands annotation from GRCh38
cpg_islands = deepblue_select_annotations(
    annotation="CpG Islands", genome="GRCh38", chromosome="chr1")

# Aggregate
overlapped = deepblue_aggregate(
    data_id=query_id, ranges_id=cpg_islands, column="VALUE")

# Select the aggregated regions that aggregated at least one region from the
# selected experiments (@AGG.COUNT > 0)
filtered = deepblue_filter_regions(query_id=overlapped,
    field="@AGG.COUNT", operation=">", value="0", type="number")

# We remove all regions where the aggregation mean is zero.
filtered_zeros = deepblue_filter_regions(query_id=filtered,
    field="@AGG.MEAN", operation="!=", value="0.0", type="number")

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=filtered_zeros,
    output_format=
    "CHROMOSOME,START,END,@CALCULATED(return math.log(value_of('@AGG.MEAN'))),@AGG.MEAN,@AGG.COUNT")

regions = deepblue_download_request_data(request_id=request_id)

# We have to perform a manual conversion because the
# package can't know the type for calculated columns
regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))` =
    as.numeric(regions$`@CALCULATED(return math.log(value_of('@AGG.MEAN')))`)

head(regions, 5)

Any numerical values returned by DeepBlue can also be conveniently displayed using, for example, the DataTrack feature of the GViz Bioconductor package as shown here:

``{r, warning=FALSE, error=FALSE} library(Gviz) atrack <- AnnotationTrack(regions, name = "CpGs", group = regions$@BIOSOURCE`, genome="hg38") gtrack <- GenomeAxisTrack() itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1") dtrack <- DataTrack(regions, data="@AGG.MEAN", name = "Log of average methylation value") plotTracks(list(itrack, atrack, dtrack, gtrack), type="histogram", fontsize=18, background.panel = "#FFFEDB", background.title = "darkblue")



### Score matrix

Here, we select a small number of experiments for
which we want to build a score matrix based on the column ```VALUE```.

We use CpG islands as aggregated regions boundaries.

The ```deepblue_score_matrix``` command expects a named list with the
experiments names and columns that will be used for aggregation,
the regions' boundaries, and the operation that will be performed
(min, max, mean, var, sd, median, count).

The ```deepblue_score_matrix``` command is executed asynchronously. The
command ```download_request_data``` will return a matrix in which the first
three columns correspond to the chromosome, start position and end position.
The remaining columns will carry the names of the experiments and hold the
corresponding aggregated values.


```{R, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
experiments =
    c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "C003N351.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "C005VG51.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "S002R551.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "NBC_NC11_41.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "bmPCs-V156.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "S00BS451.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "S00D1DA1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
        "S00D39A1.CPG_methylation_calls.bs_call.GRCh38.20160531.wig")

experiments_columns = list()
for (experiment_name in experiments) {
    experiments_columns[[experiment_name]] = "VALUE"
}

cpgs = deepblue_select_annotations(
    annotation_name="Cpg Islands",
    chromosome="chr22", start=0, end=18000000, genome="GRCh38")

request_id = deepblue_score_matrix(
    experiments_columns=experiments_columns,
    aggregation_function="mean", aggregation_regions_id=cpgs)

score_matrix = deepblue_download_request_data(request_id=request_id)
head(score_matrix, 5)

```{r, fig.width = 11, warning=FALSE, echo=TRUE, error=FALSE, eval=TRUE} library(ggplot2) score_matrix_plot = tidyr::gather(score_matrix, "experiment", "methylation", -CHROMOSOME, -START, -END) score_matrix_plot$START <- as.factor(score_matrix_plot$START) ggplot(score_matrix_plot, aes(x=START, y=experiment, fill=methylation)) + geom_tile() + theme(axis.text.x=element_text(angle=-90))


### Data Export

DeepBlueR allows you to conveniently save results to disk. Any result can be
saved as tab delimited file using ```deepblue_export_tab```. For example, we can
save the score matrix generated in the above example:

```{r, eval = FALSE}
deepblue_export_tab(score_matrix, file.name = "my_score_matrix")

Results obtained with deepblue_get_regions are of type GenomicRanges and can be exported as tab delimited files preserving all columns or as BED files, where a specific column can optionally be selected to populate the 'score' column of the BED file. To demonstrate this, we use the result from the tiling regions example further above:

```{r, eval=FALSE} request_id = deepblue_get_regions(query_id=overlapped, output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")

regions = deepblue_download_request_data(request_id=request_id) deepblue_export_bed(regions, file.name = "my_tiling_regions", score.field = "@AGG.MEAN")


Furthermore, metadata associated with any id can be stored locally using the
```deepblue_export_meta_data``` command. To this end, we first obtain the
experiment id of the file we used in the tiling regions example.

```{r, eval=FALSE}
exp_id <- deepblue_name_to_id(
    "GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig",
    collection = "experiments")$id

deepblue_export_meta_data(exp_id, file.name = "GC_T14")

This command can also handle lists of ids, for instance:

```{r, eval=FALSE} deepblue_export_meta_data(list("e30035", "e30036"), file.name = "test_export")


In some cases, users will perform a series of requests. We provide the
command ```deepblue_batch_export``` to save these results and their associated
metadata to disk in one go. This method will save each file as it becomes
available, i.e. it will be saved once the request is successfully
processed by DeepBlue:

```{r, eval=FALSE}
experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
    biosource=c("inflammatory macrophage", "macrophage"),
    project="BLUEPRINT Epigenome")
experiment_names = deepblue_extract_names(experiments)

request_ids = foreach(experiment = experiment_names) %do%{
  query_id = deepblue_select_experiments(experiment_name = experiment,
                                        chromosome = "chr21")

  request_id = deepblue_get_regions(query_id =query_id,
    output_format = "CHROMOSOME,START,END")
}
request_data = deepblue_batch_export_results(request_ids,
                             target.directory = "BLUEPRINT macrophages chr21")

DeepBlueR comes with default options that can be changed by the user. To list the current options use the following command:

```{r, echo=TRUE, eval = TRUE, warning=FALSE, message=FALSE} deepblue_options()


- ```url``` This is the URL of the DeepBlue application server and should not be changed
- ```user_key``` This option can be replaced by the personal key of the user after
successful registration at http://deepblue.mpi-inf.mpg.de. The key can be found
by logging into the web application and clicking on the user name in the top
left corner. Registered users have access to advanced features of DeepBlue, e.g.
they can review previous requests.
- ```do_not_cache``` Allows users to switch off the caching functionality of
DeepBlueR. See [Caching].
- ```force_download``` If the users wishes to overwrite cached results for the
following requests, this option can be switched on. See [Caching].
- ```debug``` Switching on this option enables verbose output only useful for
debugging.

Changing an option works as follows:

```{r, echo=TRUE, eval = TRUE, warning=FALSE, message=FALSE}
deepblue_options(do_not_cache = TRUE)

Another example (replace 'my_user_key' with the actual key):

```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE} deepblue_options(user_key = "my_user_key")


In case you wish to restore the default options simply call

```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE}
deepblue_reset_options()

DeepBlueR by default creates a file 'DeepBlueR.cache' in the current working directory. Downloaded results / regions are stored there and can be instantly retrieved, which is particularly useful for users with limited network bandwith. However, in case caching is not desired it can be switched off (see [Options])

To check the status of the cache you can use the following command:

```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE} deepblue_cache_status()


This will report the cache size and the number of requests currently stored.
Alternatively, users can list the request ids for which results are available:

```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE}
deepblue_list_cached_requests()

Over time, the cache can quickly grow in size. It is possible to remove individual requests from the cache if the request id is known:

```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE} deepblue_delete_request_from_cache("r123")


In most cases it will be simpler to simply delete the cache:
```{r, echo=TRUE, eval = FALSE, warning=FALSE, message=FALSE}
deepblue_clear_cache()

DeepBlue has a memory limit on individual requests. As a consequence, some operations may not be executed successfully by the DeepBlue web server. To avoid this, large requests should be split by chromosomes. This has another advantage: if each chromosome is a different request, processing will be parallelized on DeepBlue and finish in a fraction of the time (depending on the queing status) compared to the same operation without splitting. Here is an example for obtaining a score matrix for each chromosome individually:

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} library(foreach)

chromosomes_mm10 <- deepblue_extract_ids(deepblue_chromosomes(genome = "mm10"))

request_ids <- foreach(chromosome = chromosomes_mm10, .combine = c) %do% {

tiling_regions = deepblue_tiling_regions(
    size=100000, genome="mm10", chromosome=chromosome)

deepblue_score_matrix(
    experiments_columns = list(ENCFF721EKA="VALUE", ENCFF781VVH="VALUE"),
    aggregation_function = "mean",
    aggregation_regions_id = tiling_regions
)

}


Now each chromosome is processed individually and efficiently by DeepBlue. We
can use the deepblue_batch_export function to download the individual matrices

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
list_of_score_matrices <- deepblue_batch_export_results(request_ids)

Next, we can simple merge them to obtain the final matrix

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} library(data.table) genome_wide_score_matrix <- data.table::rbindlist(list_of_score_matrices, use.names = TRUE) genome_wide_score_matrix



# Large-scale analysis of DNA methylation across 212 samples from the BLUEPRINT consortium

## Aim

Here, we will show how DeepBlueR can be used to generate an
overview heatmap of variable positions in more than 200
BLUEPRINT DNA methylation experiments. The amount of data considered here would
normally be too huge to be processed on a local R installation. However, using
DeepBlue and server-side processing of the data, we can facilitate this
large-scale analysis easily.

## Dependencies

In the first step, we load the DeepBlueR package, as well as packages for data
retrieval, matrix operations and plotting.

```{r dependencies, message=FALSE, warning=FALSE}
library(DeepBlueR)
library(gplots)
library(RColorBrewer)
library(matrixStats)
library(stringr)

Next, we list all available BLUEPRINT DNA methylation experiments. (412 files that match the required metadata were available during the edition of this vignette.)

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} blueprint_DNA_meth <- deepblue_list_experiments(genome = "GRCh38", epigenetic_mark = "DNA Methylation", technique = "Bisulfite-Seq", project = "BLUEPRINT EPIGENOME")

blueprint_DNA_meth


We are only interested in a subset of those files and filter for call files
(opposed to coverage files).

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
blueprint_DNA_meth <- blueprint_DNA_meth[grep("CPG_methylation_calls.bs_call",
    deepblue_extract_names(blueprint_DNA_meth)),]

blueprint_DNA_meth

Each of these files has a column, named VALUE, that holds the DNA methylation beta values. There are two possibilities to select this column across several files.

First, we assume that the column in question has a different name in each file. We thus have to create a list that holds the column name for each of them. Such a list can be generated using standard R commands:

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} exp_columns <- list(nrow(blueprint_DNA_meth))

for(i in 1:nrow(blueprint_DNA_meth)){ exp_columns[[i]] <- "VALUE" } names(exp_columns) <- deepblue_extract_names(blueprint_DNA_meth)


In most cases, the same column name will apply for each
file. We thus implemented a short hand function for generating the above list
with a single column name for all files.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
exp_columns <- deepblue_select_column(blueprint_DNA_meth, "VALUE")

In the next operation, we consider that not all methylation sites will be informative for clustering the data. We thus filter for those regions that are part of the BLUEPRINT regulatory build, a modified version of the ENSEMBL regulatory build that contains promoters, promoter flanking regions, enhancers, CTCF binding sites, transcription factor binding sites and open chromatin regions. As we can see in this example, DeepBlueR returns a vector of query ids (one for each chromosome) which we store for later use.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}

list all available chromosomes in GRCh38

chromosomes_GRCh38 <- deepblue_extract_ids( deepblue_chromosomes(genome = "GRCh38") )

keep only the essential ones

chromosomes_GRCh38 <- grep(pattern = "chr([0-9]{1,2}|X)$", chromosomes_GRCh38, value = TRUE)

we split the request by chromosome to avoid hitting the memory limit of

DeepBlue

blueprint_regulatory_regions <- foreach(chr = chromosomes_GRCh38, .combine = c) %do% deepblue_select_annotations( annotation_name = "Blueprint Ensembl Regulatory Build", chromosome = chr, genome = "GRCh38" )

blueprint_regulatory_regions


DeepBlue has several annotations that can be used to filter for informative
sites. We could, for example, also filter for CpG islands.

```{r, eval=FALSE}
deepblue_select_annotations(annotation_name = "Cpg Islands",
                            genome = "GRCh38")

A list of all annotations currently available for a genome is given by the following command.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} deepblue_list_annotations(genome = "GRCh38")


New annotations may be included upon users request.

In case we want to include the entire genome in an aggregated version
DeepBlue supports the concept of tiling regions. In this process, the genomic
range of interest will be binned into tiles of a given size (here 5kb).

```{r, eval=FALSE}
tiling_regions <- deepblue_tiling_regions(size=5000,
                                          genome="GRCh38")

In the above step we have defined a set of regions of interest that we want to interrogate in R to, for example, cluster samples. To this end, DeepBlue can build a score matrix, in which the selected genomic regions are aggregated on the server to reduce the complexity and size of the data. We request such a score matrix in which regulatory regions are aggregated by the mean as follows. Note that we use the variables 'exp_columns' and 'blueprint_regulatory_regions' that we have defined above. Since we had to split our request by chromosome, we need to make multiple requests.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} request_ids <- foreach(query_id = blueprint_regulatory_regions, .combine = c) %do% deepblue_score_matrix( experiments_columns = exp_columns, aggregation_function = "mean", aggregation_regions_id = query_id)

request_ids


After triggering this function, DeepBlue queues our task and will execute it
when resources become available. We also observe that DeepBlue returns a
request id, which we can use to query the status of the operation.

```{r eval=TRUE, echo=TRUE, message=FALSE, warning=FALSE}
foreach(request = request_ids, .combine = c) %do% {
  deepblue_info(request)$state
}

When the operation is finished, we can download the score matrix and store it in a local variable. For DeepBlueR, we implemented several strategies to improve the performance of data retrieval. For instance, we modified the existing XML-RPC package to be more efficient in the context of DeepBlue when it comes to parsing nested XML data. Moreover, we retrieve tabular data directly in a tab separated file format, which can be processed much faster in R. Finally, we also compress data on the server side to reduce download time. Here, we only show the first five columns out of 215.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} score_matrix <- data.table::rbindlist( deepblue_batch_export_results(request_ids), use.names = TRUE) score_matrix[,1:5, with=FALSE]


The download is 212.8 MB in size. The size of the data we handled on DeepBlue
to extract this information is roughly 212 x ~450 MB ~= 95 GB and thus more
than can be handled in R on most desktop computers. We next show how this score
matrix can be used to plot a heatmap where samples are clustered by the
Pearson correlation coefficient, revealing that samples originating from
the same cell type are more similar in DNA methylation.

## Generating a heatmap

### Metadata and colors

In preparation of the heatmap plot, we need to generate an RColorBrewer palette.
This allows us to create a color palette for more than 9 colors.
```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
getPalette <- colorRampPalette(brewer.pal(9, "Set1"))

For each experiment, we collect metadata. ```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} experiments_info <- deepblue_info(deepblue_extract_ids(blueprint_DNA_meth))


All metadata is parsed to a nested R list. We refer to the DeepBlue paper for
a description of available metadata. Here, we show the metadata associated
with just one of the samples.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
head(experiments_info[[1]], 10)

For this analysis, we are only interested in the biosource name, i.e. the cell type. We can retrieve this information using standard R syntax. Note that we show only the first 6 entries here.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} biosource <- unlist(lapply(experiments_info, function(x){ x$sample_info$biosource_name})) head(biosource)


To save some space on the plot, we replace positive with + and negative with -.
```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
biosource <- str_replace_all(biosource, "-positive", "+")
biosource <- str_replace_all(biosource, "-negative", "-")

For the same reason, we remove the words 'terminally differentiated' from one of the cell types.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} biosource <- str_replace(biosource, ", terminally differentiated", "")


Using above color palette, we can now assign a unique color to each cell type.
```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
color_map <- data.frame(biosource = unique(biosource),
                        color = getPalette(length(unique(biosource))))

head(color_map)

Using above table, we can now assign the colors to each experiment according to its cell type / biosource.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} exp_names <- unlist(lapply(experiments_info, function(x){ x$name}))

biosource_colors <- data.frame(name = exp_names, biosource = biosource) biosource_colors <- dplyr::left_join(biosource_colors, color_map, by = "biosource") head(biosource_colors)


Finally, we transform this data frame into a vector that is compatible with the
heatmap function.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
color_vector <- as.character(biosource_colors$color)
names(color_vector) <-  biosource_colors$biosource
head(color_vector)

We remove the first three columns (CHROMOSOME, START, END) and convert the data frame to a numeric matrix.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} filtered_score_matrix <- as.matrix(score_matrix[,-c(1:3), with=FALSE]) head(filtered_score_matrix[,1:3])


Next, we compute the variance of each row and retain only genomic regions
with variance > 0.05 for plotting. Plotting all regions would consume too much
memory and more importantly, regions that do not show variance also do not
allow us to spot differences between cell types in the heatmap.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
message("regions before: ", nrow(filtered_score_matrix))
filtered_score_matrix_rowVars <- rowVars(filtered_score_matrix, na.rm = TRUE)
filtered_score_matrix <- filtered_score_matrix[which(filtered_score_matrix_rowVars > 0.05),]
message("regions after: ", nrow(filtered_score_matrix))

To be able to cluster samples, we remove regions that have missing values in at least one of the experiments.

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} message("regions before: ", nrow(filtered_score_matrix)) filtered_score_matrix <- filtered_score_matrix[which(complete.cases(filtered_score_matrix)),] message("regions after: ", nrow(filtered_score_matrix))


IMPORTANT: The order of columns in the score matrix is not the same as in the
exp_columns list used in the request. We thus have to order the matrix by
the experiment names in the color map. This is crucial to make
sure we assign the correct cell type to each sample!

```{r echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
filtered_score_matrix <- filtered_score_matrix[,exp_names]

We plot a heatmap in which the variable regions are shown across all samples. On top of the columns, we create a dendrogram based on Pearson correlation More precisely, we convert the Pearson correlation, a similarity measure, to a distance, such that it can be used with hierarchical clustering.

```{r, fig.width=11, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE} heatmap.2(filtered_score_matrix,labRow = NA, labCol = NA, trace = "none", ColSideColors = color_vector, hclust=function(x) hclust(x,method="complete"), distfun=function(x) as.dist(1-cor(t(x), method = "pearson")), Rowv = TRUE, dendrogram = "column", key.xlab = "beta value", denscol = "black", keysize = 1.5, key.title = NA)


```{r, fig.height=6, echo=TRUE, eval=TRUE, warning=FALSE, message=FALSE}
  plot.new()

  legend(x = 0, y = 1,
       legend = color_map$biosource,
       col = as.character(color_map$color),
       text.width = 0.6,
       lty= 1,
       lwd = 6,
       cex = 0.7,
       y.intersp = 0.7,
       x.intersp = 0.7,
       inset=c(-0.21,-0.11))

Final remarks

Finally, we encourage you to try to reproduce Python examples in R and to read the DeepBlue manual.

Finally, we want to highlight the possibility to browse and access existing data in DeepBlue conveniently in the web interface. The web interface also allows you to select experiments in a grid like view.

Should you encounter any problems with DeepBlueR, we kindly ask you to create an issue in the BioConductor DeepBlueR support page.

The R code in the DeepBlueR package is under the GPLv3 license and we welcome contributions of other developers. Finally, we would like to thank the Bioconductor team for its support in making DeepBlueR available to a wide audience of users.

MPIIComputationalEpigenetics/DeepBlue-R documentation built on Aug. 11, 2021, 3:18 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MPIIComputationalEpigenetics/DeepBlue-R DeepBlueR

README.md In MPIIComputationalEpigenetics/DeepBlue-R: DeepBlueR

Introduction

What is DeepBlue ?

Getting started

Installation

Listing experiments

Select epigenomic data

Count how many regions where selected

Download the request data as soon as processing is finished

Filter epigenomic data by metadata

Find intersecting regions

Retrieve DNA sequences

Counting motifs in a region

Aggregate and summarize regions

Aggregate

Retrieve the experiments data (The @NAME meta-column is used to include

the experiment name and @BIOSOURCE for experiment's biosource

Tiling regions

Selecting the data from 2 experiments:

GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig

As we already know the experiments names, we keep all others fields empty.

We are selecting all regions of chromosome 1

Tiling regions of 100.000 base pairs

Aggregate

Retrieve the experiments data (The @NAME meta-column is used to include the

experiment name and @BIOSOURCE for experiment's biosource

Flanking regions

Select the RP11-34P13 gene locations from gencode v23

Obtain the regions that starts 2500 bases pair before the regions start and

have 2000 base pairs.

The 4th argument inform that DeepBlue must consider the region strand

(column STRAND) to calculate the new region

Obtain the regions that starts 1500 bases pair after the

regions end and have 500 base pairs.

The 4th argument inform that DeepBlue must consider the

region strand (column STRAND) to calculate the new region

Merge both flanking regions set and genes set

Request the regions

Options

Caching

Splitting requests for improving the performance in genome-scale requests

Select experiments

Select experiment column

Filter for genomic regions of interest using annotations

list all available chromosomes in GRCh38

keep only the essential ones

we split the request by chromosome to avoid hitting the memory limit of

DeepBlue

Generate a score matrix

Processing the input data

Plotting

Further reading material

Final remarks

R Package Documentation

Browse R Packages

We want your feedback!

MPIIComputationalEpigenetics/DeepBlue-R
DeepBlueR

README.md
In MPIIComputationalEpigenetics/DeepBlue-R: DeepBlueR