knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) suppressPackageStartupMessages({ library(ggplot2) library(BSgenome) library(BSgenome.Scerevisiae.UCSC.sacCer3) library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(AnnotationHub) library(biocwk312) library(RNAseqData.HNRNPC.bam.chr14) library(GenomicAlignments) library(curatedTCGAData) })
R is a programming language focusing on activities important in statistical data analysis, but it is often thought of more as an interactive environment for working with data. This concept of "environment" will become clearer as we use R to perform progressively mode complex tasks.
(7+5)/(3*6)
s = "abc" s paste(s, s, sep="")
set.seed(1234) # setting a seed for reproducibility x = rnorm(5) x
y = rnorm(5) plot(x,y)
In the example s = "abc"
, we formed a string constant
by enclosing abc
in quotes and assigned it to the
variable s
. This operation could also be written
s <- "abc"
or "abc" -> s
.
The most common syntax for working with R was illustrated
in the string operation illustration: paste(s, s, sep="")
.
This operation took the string variables s
and
concatenated it with itself. The
additional function argument sep=""
arranges
that the concatenation is to occur without any intervening text.
The examples illustrate the use of two more functions: rnorm
,
and plot
.
The family of functions and data available in an R session depends upon the collection of packages attached in the session. Packages are used to collect related functions together with their documentation. We can add to the collection of functions available for plotting, and use the ggplot approach to designing visualizations, as follows.
library(ggplot2) simpdf = data.frame(x,y) ggplot(simpdf, aes(x=x, y=y)) + geom_point()
These questions should be considered in preparing to work with R for bioinformatics analysis.
The Bioconductor software/analysis/documentation ecosystem
can help with answers to these questions. There are many
facets to the ecosystem. Bioconductor users form an
active community and post questions and answers at
support.bioconductor.org
. You can often get input
on answers to some of the questions above by browsing and
searching that site and posting your own questions.
The BSgenome
package defines infrastructure that supports compact
representation of genomic sequences. We'll have a look at
a recent reference build for baker's yeast. For this to work,
we need to know the name of the package that provides the
genomic sequence, install it, and use appropriate functions
to work with the BSgenome representation.
# BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3") if necessary library(BSgenome) library(BSgenome.Scerevisiae.UCSC.sacCer3) yeastg = BSgenome.Scerevisiae.UCSC.sacCer3 yeastg
Evaluation of yeastg
in the R session produces a brief report on the
BSgenome representation. We can look at chromosomal sequence with
yeastg$chrI
What we call yeastg
is a Bioconductor representation of the
genome of a model organism. Similar representations are available
for more species:
length(available.genomes()) head(available.genomes())
Each of the genomes listed in the output of available_genomes()
can be obtained by
installing the associated package, attaching the package and then operating
with the eponymous object, as in yeastg = BSgenome.Scerevisiae.UCSC.sacCer3
.
All of the objects representing the reference genomes of model organisms
are instances of a 'class'. Class definitions are used to group related
data elements together.
class(yeastg)
One way of discovering the family of functions that have
been defined to work with instances of a class is to use
the methods
function:
methods(class="BSgenome")
Let's try one:
atgm = vmatchPattern("ATG", yeastg) atgm
This gives us the locations of all occurrences of ATG
in the yeast genome.
Model organism genomes and transcriptomes are catalogued at UCSC and Ensembl. The UCSC catalog is attached and used as follows:
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene) ytx = TxDb.Scerevisiae.UCSC.sacCer3.sgdGene class(ytx) ytx
As before, we can discover available functions for working with such catalogues:
methods(class="TxDb")
To obtain the DNA sequences of all yeast genes, use getSeq
:
First, we need the addresses of the genes. We use a
special representation called GRanges
.
y_g = genes(ytx) y_g
Instances of the GRanges
class are used to collect and annotate regions
defined in genomic coordinates.
getSeq(x,y)
extracts nucleotide sequences from a genome x
, at the
addresses given by y
:
getSeq(yeastg, y_g)
For human studies, the same concepts apply.
library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb = TxDb.Hsapiens.UCSC.hg19.knownGene transcripts(txdb)
We've pre-packaged a slice of the human transcriptome (for chr14) in
the workshop-based package biocwk312
. This slice was massaged
for visualization by functions in the TnT
package.
library(biocwk312) library(TnT) data(tnt_txtr_14) tnt_txtr_14
Bioconductor has a number of packages that supply aligned RNA-seq reads in BAM format. We'll use one that investigated the effects of knocking down a gene called HNRNPC in HeLa cells.
library(RNAseqData.HNRNPC.bam.chr14)
A character vector provides the paths to the BAM files:
length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
The 8 files are grouped: the first four are for wild-type cells, the next four are for the cells treated to knock down HNRNPC.
labels = c("wt1", "wt2", "wt3", "wt4", "kd1", "kd2", "kd3", "kd4")
The GenomicAlignments
package provides functions to
parse and ingest information from BAM files.
library(GenomicAlignments) readGAlignments(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])
We'll now select four of the BAM files to display coverage in
the vicinity of the knocked-down gene. The function linetrack_covg_by_symbol
was written specifically to combine information on reads with
positions of
genes and transcripts.
allfi = lapply(c(1,2,5,6), function(x) { curf = RNAseqData.HNRNPC.bam.chr14_BAMFILES[x] linetrack_covg_by_symbol("HNRNPC", curf, color="gray", radius=3e5, label=labels[x]) }) mytr = c(allfi, tnt_gt_sym, tnt_txtr_14) TnTGenome(mytr)
Finally, the AnnotationHub package defines a database with many types of biological annotation. For our last example in this section, we consider how to obtain the ChromHmm cell type annotation for diverse cell types. The AnnotationHub can be updated in real time. We obtain an image of the hub, and then "query".
library(AnnotationHub) ah = AnnotationHub() chq = query(ah, "ChromHmm") chq
The result here can be very large and so an abbreviated
view is dumped. To learn about the cell types studied,
we use a technical call that takes advantage of knowledge
that mcols
on a query result gives a table of features
for each resource.
mcols(chq)["AH46861",]$tags head(table(vapply(mcols(chq)[-1,]$tags, function(x)x[8], character(1))))
Now we can acquire the ChromHmm labeling of genomic regions for the H1-derived mesenchymal stem cells:
h1msc = ah[["AH46861"]] h1msc table(h1msc$name)
The curatedTCGAData package provides convenient access to all open data in TCGA. Here we'll request available data on tumor mutation profiles and RNA-seq studies available for breast cancer (BRCA).
library(curatedTCGAData) brca_demo = curatedTCGAData("BRCA", c("Mutation", "RNASeq2GeneNorm"), dry.run=FALSE) brca_demo
The MultiAssayExperiment structure is described in a
JCO CCI paper.
Timestamped tags are provided for each assay type.
names(experiments(brca_demo))
We can get immediate access to genomic coordinates and contents of tumor mutations by coercing the mutations component to GRangesList. There is one list element per participant.
grm = as(experiments(brca_demo)[["BRCA_Mutation-20160128"]], "GRangesList") names(grm)[1:3] grm[[1]][1:4,1:8]
rnaseq = experiments(brca_demo)[["BRCA_RNASeq2GeneNorm-20160128"]] rnaseq head(names(colData(rnaseq)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.