suppressMessages({
suppressPackageStartupMessages({
library(AnnotationHub)
library(rtracklayer)
library(DT)
library(tibble)
library(BiocStyle)
library(lubridate)
library(ensembldb)
library(EnsDb.Hsapiens.v79)
library(GenomicFiles)
library(ontoProc)
})
})

Introduction

The Encyclopedia of DNA Elements (ENCODE) is a collection of experimental results and computing resources devoted to elaboration of mechanisms of gene regulation. The project main page is a good overview of its scope, involving four main organisms (human, mouse, worm and fly) and dozens of biotechnological innovations that are used to analyze gene structure and regulation. We will illustrate ways of using Bioconductor to acquire and interpret ENCODE resources.

Main aims for learners using this document:

This document is part 1 of three parts -- the second part addresses ATAC-seq, and the third addresses RNA-seq preceded by CRISPR-based interference.

Built-in surveys through ENCODExplorer metadata

We can use AnnotationHub to get information on ENCODE results that have been curated for direct use with Bioconductor. The r Biocpkg("ENCODExplorer") package provides a metadata table, that we retrieve using r Biocpkg("AnnotationHub"). We use the "full" table because it provides links to raw and processed data files curated in the cloud by the ENCODE project.

Using a local metadata cache

We'll begin with a query to AnnotationHub to learn about resources whose metadata mentions "ENCODExplorerData".

library(AnnotationHub)
ah = AnnotationHub()
AnnotationHub::query(ah, "ENCODExplorerData")

The recent 'Full' metadata table is over 90MB in size, so it takes a few moments to download from cloud (or load from cache, in case you have retrieved it previously).

fm = ah[["AH75132"]] # implicitly retrieves if needed, or loads
dim(fm)
table(fm$organism)

We are working with r nrow(fm) records on r length(table(fm$organism)) different organisms.

Filtering the metadata table

We focus on human samples, and limit to ChIP-seq experiments for now.

# filter
hfm = fm[which(fm$organism == "Homo sapiens"),]
tail(sort(table(hfm$assay)))
hfmt = hfm[which(hfm$assay == "TF ChIP-seq"),]

The proteins for which binding patterns can be assessed are in the antibody_target field:

tail(sort(table(hfmt$antibody_target)))

Managing the information with a GenomicFiles instance

We'll manage the ChIP-seq data references with a r Biocpkg("GenomicFiles") object. For convenience, we'll focus on bigWig files.

hfmtbw = hfmt[hfmt$file_type == "bigWig",]
dim(hfmtbw)
table(hfmtbw$output_type)
library(GenomicFiles)
gf1 = GenomicFiles(files=hfmtbw$cloud_metadata.url)
colData(gf1) = as(as.data.frame(hfmtbw), "DataFrame")
gf1

The gf1 object reports only the "basename" component of the names of files managed. The actual filename (URL) for the first file is

GenomicFiles::files(gf1)[1]

A virtues of the GenomicFiles discipline is that we keep together the file identifiers and metadata about them. The following information is crucial to downstream use of the files.

table(gf1$assembly)

This table is very important, indicating that we are managing information for which two genomic coordinate systems are in play. In the introduction to Bioconductor, it is shown how liftOver can be used to translate between coordinate systems if needed.

We'll focus on data collected with GRCh38 coordinates.

gf1 = gf1[, which(gf1$assembly == "GRCh38")]
gf1
gf = GenomicFiles(files=hfmtbw$cloud_metadata.url)
colData(gf) = as(as.data.frame(hfmtbw), "DataFrame")

Anatomic sources of samples

We'll define a helper function to report on discrete properties of samples. This gives us clues on the contents of listings of biosample type and associated ontology labels.

tls = function(x) t(t(head(sort(table(x), decreasing=TRUE))))
tls(gf1$biosample_name)
tls(gf1$biosample_ontology)

The use of ontology labeling helps us to organize information on samples. Bioconductor's r Biocpkg("ontoProc") package can be used to develop a simple hierarchical display. We focus on the Experimental Factor Ontology (EFO) tags, and take a very small subset of tags in use to obtain a tractable display in this document.

library(ontoProc)
ee = getEFOOnto()
tail(sort(table(hfmt$biosample_ontology)),12) -> ioi
nn = names(ioi[grep("EFO", names(ioi))])
tails = gsub(".*_", "", nn)
tailss = gsub("/", "", tails)
tt = paste0("EFO:", tailss)
onto_plot2(ee, tt, cex=.6)

It is noteworthy that in this display HEK293 is not linked to "Homo sapiens cell line". This can be confirmed by visiting the ontology lookup service. It is not uncommon to find gaps in curated data resources. Whether there is a justification for this particular omission is unclear.

Targeted sketching of peak scores

We'll focus on the transcription factor CREB1. There are a modest number of samples in different cell lines. Some are treated with ethanol.

gf1_creb1 = gf1[, which(gf1$target == "CREB1")]
table(gf1_creb1$treatment, gf1_creb1$biosample_name, exclude=NULL)

We'll focus on samples with type signal p-value.

table(gf1_creb1$output_type, gf1_creb1$biosample_name, exclude=NULL)
gf1_creb1_use = gf1_creb1[, 
   which(gf1_creb1$output_type == "signal p-value")]

We will import a megabase of bigWig content from the files selected in this way, using reduceByFile:

myr = GRanges("chr17", IRanges(66e6,67e6))
genome(myr) = "GRCh38"
rowRanges(gf1_creb1_use) = myr
if (.Platform$OS.type == "windows") {
  sels = edxAdvBioc::creb1_sels
  } else sels = reduceByFile(gf1_creb1_use, MAP=function(range, file, ...) {
    sel = BigWigSelection(range)
    import.bw(file, selection=sel, genome=genome(myr)[1])
})

Here is our targeted sketch:

lk1 = sels[[1]][[1]]
lk2 = sels[[2]][[1]]
plot(start(lk1)+.5*(width(lk1)), lk1$score, pch=19,
 xlab="midpoint of scored interval, chr17", ylab="-log10 CREB1 signal p-value",
 cex=.5, col=lava::Col("black", alpha=.3))
points(start(lk2)+.5*width(lk2), lk2$score, pch=19, 
 col=lava::Col("blue", alpha=.3), cex=.5)

Summary

Review the following concepts before proceeding to exercises.

Exercises

Basic descriptive statistics

  1. gf1 represents data on r ncol(gf1) ChIP-seq experiments for organism 'Homo sapiens' in bigWig format. Use the cloud_metadata.file_size colData component to state a) the median file size, and b) the total number of gigabytes of cloud-resident data to which gf1 mediates access.
median(as.numeric(gf1$cloud_metadata.file_size))
sum(as.numeric(gf1$cloud_metadata.file_size))/1e9
  1. What is the most common ChIP-seq target protein studied in the r ncol(gf1) experiments?
tls(gf1$target)
  1. What is the most commonly studied cell type in the r ncol(gf1) experiments?
tls(gf1$biosample_name)
  1. Use the fm metadata table. For the application of the ATAC-seq assay to human samples, what is the most common biosample_name?
 as_tibble(fm) %>% dplyr::filter(assay == "ATAC-seq" & organism=="Homo sapiens") %>% 
    dplyr::select(biosample_name) %>% dplyr::group_by(biosample_name) %>%
    dplyr::summarise(n=dplyr::n()) %>% dplyr::arrange(desc(n))
  1. Chronology and scope of experimental agenda. We can use the date_created field to document submission of assays of diverse types. We'll use the hfm table for this.
newt = as_tibble(hfm) %>% 
  dplyr::select(assay, date_created) %>% 
  dplyr::mutate(ldate=lubridate::as_date(date_created)) 
par(las=2, mar=c(12,4,2,2))
ameds = sapply(split(newt$ldate, newt$assay), median)
with(newt, boxplot(split(ldate, assay)[order(ameds)]))

How many assays of type long read RNA-seq are present?

sum(hfm$assay == "long read RNA-seq", na.rm=TRUE)
  1. RNA-seq preceded by CRISPR interference. We use the dataset_description to find out about genes knocked down in CRISPRi RNA-seq experiments. The language in that field is repetitious, and we massage its values a bit, using gsub and mutate, to get the name of the gene that was putatively knocked down.
ii = (hfm %>% dplyr::filter(assay == "CRISPRi RNA-seq" & file_format=="tsv") %>% 
  dplyr::mutate(targ = gsub(
    "RNA-seq on K562 cells treated by CRISPR interference targeting (..*).", 
     "\\1", dataset_description)) %>%
  dplyr::select(investigated_as, targ) %>% 
  dplyr::group_by(investigated_as, targ) %>% dplyr::summarise(n=dplyr::n()))
ii

For how many different transcription factors, interfered with by CRISPR prior to RNA-seq, have tsv files been prepared?

unique((ii %>% dplyr::filter(investigated_as == "transcription factor"))$targ)

Retention of metadata; plotting

The following function, annotated in roxygen format, gathers the task of filtering bigWig data of a specific output type and target in a given genomic interval.

#' import bigwig in an interval, given GenomicFiles for ENCODE bigWig files
#' @param gf GenomicFiles instance
#' @param target character(1) ChIP-seq target
#' @param biosample_names character() vector of cell types to retain
#' @param output_type character(1) defaults to "signal p-value"
#' @param selection GRanges instance, one range allowed
#' @export
 import_enc_bw = function(gf, target="CREB1", 
   biosample_names = c("HepG2", "MCF-7", "A549"), output_type="signal p-value",
   selection= GRanges("chr17", IRanges(66e6,67e6), genome="GRCh38")) {
  stopifnot(length(selection)==1)
  gf_use = gf[, which(gf$target == target)]
  gf_use = gf_use[, 
     which(gf_use$output_type == "signal p-value")]
  gf_use = gf_use[, 
     which(gf_use$biosample_name %in% biosample_names)]
  rowRanges(gf_use) = selection
  ans = reduceByFile(gf_use, MAP=function(range, file, ...) {
      sel = BigWigSelection(range)
      import.bw(file, selection=sel, genome=genome(myr)[1])
  })
  ans = GRangesList(unlist(ans, recursive=FALSE))
  mcols(ans) = colData(gf_use)
  names(ans) = make.unique(mcols(ans)$biosample_name)
  ans
}

We can generate our sketch data with a pair of simple calls, assuming the construction of gf1 is accomplished as above.

First, we generate the requested data on the available biosamples.

if (.Platform$OS.type == "windows") {
  skd = edxAdvBioc::skd
  } else skd = import_enc_bw(gf1)

A virtue of this approach is that the cell types (and all metadata) are bound to the scores:

table(mcols(skd)$biosample_name)

To complete the task, we define a plotting function. We'll call it to compare HepG2 and MCF-7 patterns in the selected interval.

#' plot the result of a pair of import_enc_bw runs
#' @param impeb output of import_enc_bw
#' @param ylim limits of y axis
#' @param xlim limits of x axis
#' @param logy logical(1) whether or not to use log="y" in plot call
#' @param leg_frac_x numeric(1) fudge factor to move legend to the right of natural
#' position at xlim[1]
#' @export
plot_pair = function(impeb, ylim=c(1,500), xlim=c(66.1e6,66.3e6),
    logy=TRUE, leg_frac_x=.015) {
  stopifnot(length(impeb)==2)
  lk1 = impeb[[1]]
  lk2 = impeb[[2]]
  if (logy) {
   lk1 = lk1[which(lk1$score>0)]
   lk2 = lk2[which(lk2$score>0)]
   }
  seqn = as.character(seqnames(impeb[[1]]))[1]
  g = grep("chr", seqn)
  if (length(g)==0) seqn=paste0("chr", seqn)
  targ = mcols(impeb)$target[1]
  oty = mcols(impeb)$output_type[1]
  pspec = list(x=start(lk1)+.5*(width(lk1)), y=lk1$score, pch=19,
   xlab=sprintf("midpoint of scored interval, %s", seqn), ylab=sprintf("bigWig '%s' for %s", oty, targ),
   cex=.5, col=lava::Col("orange", alpha=.1), ylim=ylim, xlim=xlim, log=ifelse(logy, "y", ""))
  do.call(plot, pspec)
  points(start(lk2)+.5*width(lk2), lk2$score, pch=19, 
   col=lava::Col("blue", alpha=.1), cex=.5)
  legend(xlim[1]+leg_frac_x*diff(xlim), ylim[2], pch=19, col=lava::Col(c("orange", "blue"), alpha=.4),
    legend=mcols(impeb)$biosample_name)
}

In the following calls, we select distinct pairs of samples from different cell types for overlaid visualization within pairs.

plot_pair(skd[c("HepG2.2", "MCF-7")], ylim=c(1,1500), xlim=c(66e6, 67e6), leg_frac_x=.04)
plot_pair(skd[c("HepG2.1", "MCF-7.1")], ylim=c(1,1500), xlim=c(66e6, 67e6), leg_frac_x=.04)
  1. What is the number of positions at which binding of CREB1 to HepG2 exhibits signal p-values consistently greater than those seen for MCF-7 in the two displays here? (0, 1, 4, 10)?

  2. Use the following code to simplify the data on one HepG2 sample:

s5 = skd[["HepG2.2"]]
GenomeInfoDb::seqlevelsStyle(s5) = "NCBI"
seqlevels(s5) = "17"

Using genes(EnsDb.Hsapiens.v79), find the gene nearest the strongest peak for CREB1 in HepG2.

g79 = ensembldb::genes(EnsDb.Hsapiens.v79::EnsDb.Hsapiens.v79)
g79[ nearest(s5[which.max(s5$score)], g79) ]

A different TF target, assayed in a breast cancer cell line.

Run the following code.

if (.Platform$OS.type == "windows") {
  uu = edxAdvBioc::uu
} else uu = import_enc_bw(gf1, target="FOXA1", 
       selection=GRanges("chr6", IRanges(151.5e6,152.5e6)), 
       biosample_names=c("A549", "MCF-7"))
plot_pair(uu[c("A549", "MCF-7")], xlim=c(151.5e6,152.5e6))

What gene is nearest to the strongest FOXA1 binding location seen in the sample named MCF-7?

top = which.max(uu[["MCF-7"]]$score)
cand = uu[[7]][top]
seqlevelsStyle(cand) = "NCBI"
seqlevels(cand) = "6"
g79[nearest(cand, g79)]


vjcitn/edx_adv_bioc documentation built on March 17, 2020, 3:58 p.m.