knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
suppressPackageStartupMessages({
library(ggplot2)
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(AnnotationHub)
library(biocwk312)
library(RNAseqData.HNRNPC.bam.chr14)
library(GenomicAlignments)
library(curatedTCGAData)
})

Prerequisites and motivating examples

Pre-requisites

Programming and package use with R 4.0.3

R is a programming language focusing on activities important in statistical data analysis, but it is often thought of more as an interactive environment for working with data. This concept of "environment" will become clearer as we use R to perform progressively mode complex tasks.

Simple illustrations

(7+5)/(3*6)
s = "abc"
s
paste(s, s, sep="")
set.seed(1234) # setting a seed for reproducibility
x = rnorm(5)
x
y = rnorm(5)
plot(x,y)

Making new variables

In the example s = "abc", we formed a string constant by enclosing abc in quotes and assigned it to the variable s. This operation could also be written s <- "abc" or "abc" -> s.

Calling functions

The most common syntax for working with R was illustrated in the string operation illustration: paste(s, s, sep=""). This operation took the string variables s and concatenated it with itself. The additional function argument sep="" arranges that the concatenation is to occur without any intervening text.

The examples illustrate the use of two more functions: rnorm, and plot.

Packages

The family of functions and data available in an R session depends upon the collection of packages attached in the session. Packages are used to collect related functions together with their documentation. We can add to the collection of functions available for plotting, and use the ggplot approach to designing visualizations, as follows.

library(ggplot2)
simpdf = data.frame(x,y)
ggplot(simpdf, aes(x=x, y=y)) + geom_point()

Summary: strategies for bioinformatics with R/Bioconductor

These questions should be considered in preparing to work with R for bioinformatics analysis.

The Bioconductor software/analysis/documentation ecosystem can help with answers to these questions. There are many facets to the ecosystem. Bioconductor users form an active community and post questions and answers at support.bioconductor.org. You can often get input on answers to some of the questions above by browsing and searching that site and posting your own questions.

Basics of genome annotation: reference genomes, transcriptomes, functional annotation

Reference genomic sequence

The BSgenome package defines infrastructure that supports compact representation of genomic sequences. We'll have a look at a recent reference build for baker's yeast. For this to work, we need to know the name of the package that provides the genomic sequence, install it, and use appropriate functions to work with the BSgenome representation.

# BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3") if necessary
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
yeastg = BSgenome.Scerevisiae.UCSC.sacCer3
yeastg

Evaluation of yeastg in the R session produces a brief report on the BSgenome representation. We can look at chromosomal sequence with

yeastg$chrI

What we call yeastg is a Bioconductor representation of the genome of a model organism. Similar representations are available for more species:

length(available.genomes())
head(available.genomes())

Each of the genomes listed in the output of available_genomes() can be obtained by installing the associated package, attaching the package and then operating with the eponymous object, as in yeastg = BSgenome.Scerevisiae.UCSC.sacCer3. All of the objects representing the reference genomes of model organisms are instances of a 'class'. Class definitions are used to group related data elements together.

class(yeastg)

One way of discovering the family of functions that have been defined to work with instances of a class is to use the methods function:

methods(class="BSgenome")

Let's try one:

atgm = vmatchPattern("ATG", yeastg)
atgm

This gives us the locations of all occurrences of ATG in the yeast genome.

The TxDb family

Model organism genomes and transcriptomes are catalogued at UCSC and Ensembl. The UCSC catalog is attached and used as follows:

library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
ytx = TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
class(ytx)
ytx

As before, we can discover available functions for working with such catalogues:

methods(class="TxDb")

To obtain the DNA sequences of all yeast genes, use getSeq: First, we need the addresses of the genes. We use a special representation called GRanges.

y_g = genes(ytx)
y_g

Instances of the GRanges class are used to collect and annotate regions defined in genomic coordinates. getSeq(x,y) extracts nucleotide sequences from a genome x, at the addresses given by y:

getSeq(yeastg, y_g)

For human studies, the same concepts apply.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
transcripts(txdb)

Using a transcript catalog to visualize RNA-seq data

We've pre-packaged a slice of the human transcriptome (for chr14) in the workshop-based package biocwk312. This slice was massaged for visualization by functions in the TnT package.

library(biocwk312)
library(TnT)
data(tnt_txtr_14)
tnt_txtr_14

Bioconductor has a number of packages that supply aligned RNA-seq reads in BAM format. We'll use one that investigated the effects of knocking down a gene called HNRNPC in HeLa cells.

library(RNAseqData.HNRNPC.bam.chr14)

A character vector provides the paths to the BAM files:

length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]

The 8 files are grouped: the first four are for wild-type cells, the next four are for the cells treated to knock down HNRNPC.

labels = c("wt1", "wt2", "wt3", "wt4", "kd1", "kd2", "kd3", "kd4")

The GenomicAlignments package provides functions to parse and ingest information from BAM files.

library(GenomicAlignments)
readGAlignments(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])

We'll now select four of the BAM files to display coverage in the vicinity of the knocked-down gene. The function linetrack_covg_by_symbol was written specifically to combine information on reads with positions of genes and transcripts.

allfi = lapply(c(1,2,5,6), function(x) {
     curf = RNAseqData.HNRNPC.bam.chr14_BAMFILES[x]
     linetrack_covg_by_symbol("HNRNPC", curf, 
             color="gray", radius=3e5, label=labels[x])
     })
mytr = c(allfi, tnt_gt_sym, tnt_txtr_14)
TnTGenome(mytr)

AnnotationHub example: Epigenetically defined functional annotation

Finally, the AnnotationHub package defines a database with many types of biological annotation. For our last example in this section, we consider how to obtain the ChromHmm cell type annotation for diverse cell types. The AnnotationHub can be updated in real time. We obtain an image of the hub, and then "query".

library(AnnotationHub)
ah = AnnotationHub()
chq = query(ah, "ChromHmm")
chq

The result here can be very large and so an abbreviated view is dumped. To learn about the cell types studied, we use a technical call that takes advantage of knowledge that mcols on a query result gives a table of features for each resource.

mcols(chq)["AH46861",]$tags
head(table(vapply(mcols(chq)[-1,]$tags, 
   function(x)x[8],  character(1))))

Now we can acquire the ChromHmm labeling of genomic regions for the H1-derived mesenchymal stem cells:

h1msc = ah[["AH46861"]]
h1msc
table(h1msc$name)

The Cancer Genome Atlas, a paradigm of federated multiomics

The curatedTCGAData package provides convenient access to all open data in TCGA. Here we'll request available data on tumor mutation profiles and RNA-seq studies available for breast cancer (BRCA).

library(curatedTCGAData)
brca_demo = curatedTCGAData("BRCA", c("Mutation", "RNASeq2GeneNorm"),
   dry.run=FALSE)
brca_demo

The MultiAssayExperiment structure is described in a JCO CCI paper.
Timestamped tags are provided for each assay type.

names(experiments(brca_demo))

Mutations in a GRangesList

We can get immediate access to genomic coordinates and contents of tumor mutations by coercing the mutations component to GRangesList. There is one list element per participant.

grm = as(experiments(brca_demo)[["BRCA_Mutation-20160128"]], "GRangesList")
names(grm)[1:3]
grm[[1]][1:4,1:8]

A SummarizedExperiment for RNA-seq

rnaseq = experiments(brca_demo)[["BRCA_RNASeq2GeneNorm-20160128"]]
rnaseq
head(names(colData(rnaseq)))

The OSCA book, a paradigm of computable scientific literature



vjcitn/biocwk312 documentation built on Jan. 1, 2021, 12:41 p.m.