BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
suppressPackageStartupMessages({ library(BiocEMBO2015) })
Version: r packageDescription("BiocEMBO2015")$Version
Compiled: r date()
This case study servers as a refresher / tutorial on basic input and manipulation of data.
Input a file that contains ALL (acute lymphoblastic leukemia) patient information
fname <- file.choose() ## "ALLphenoData.tsv" stopifnot(file.exists(fname)) pdata <- read.delim(fname)
fname <- "ALLphenoData.tsv" stopifnot(file.exists(fname)) pdata <- read.delim(fname)
Check out the help page ?read.delim
for input options, and explore
basic properties of the object you've created, for instance...
class(pdata) colnames(pdata) dim(pdata) head(pdata) summary(pdata$sex) summary(pdata$cyto.normal)
Remind yourselves about various ways to subset and access columns of a data.frame
pdata[1:5, 3:4] pdata[1:5, ] head(pdata[, 3:5]) tail(pdata[, 3:5], 3) head(pdata$age) head(pdata$sex) head(pdata[pdata$age > 21,])
It seems from below that there are 17 females over 40 in the data set,
but when sub-setting pdata
to contain just those individuals 19 rows
are selected. Why? What can we do to correct this?
idx <- pdata$sex == "F" & pdata$age > 40 table(idx) dim(pdata[idx,])
Use the mol.biol
column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
The mol.biol
column is a factor, and retains all levels even after
subsetting. How might you drop the unused factor levels?
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
The BT
column is a factor describing B- and T-cell subtypes
levels(bcrabl$BT)
How might one collapse B1, B2, ... to a single type B, and likewise for T1, T2, ..., so there are only two subtypes, B and T
table(bcrabl$BT) levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1) table(bcrabl$BT)
Use xtabs()
(cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups
xtabs(~ BT + mol.biol, bcrabl)
Use aggregate()
to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.
aggregate(age ~ mol.biol + sex, bcrabl, mean)
Use t.test()
to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using boxplot()
. In both cases,
use the formula
interface. Consult the help page ?t.test
and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?
t.test(age ~ mol.biol, bcrabl) boxplot(age ~ mol.biol, bcrabl)
Option 1: fastqc
Start fastqc
Select fastq.gz files from the File --> Open menu. Files are in
/mnt/nfs/practicals/day1/martin_morgan/
Press OK
Study plots and the Help -> Contents menu
Option 2: ShortRead
## 1. attach ShortRead and BiocParallel library(ShortRead) library(BiocParallel) ## 2. create a vector of file paths ## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/' fls <- dir("bigdata", pattern="*fastq.gz", full=TRUE) stopifnot(all(file.exists(fls))) ## 3. collect statistics stats <- qa(fls) ## 4. generate and browse the report browseURL(report(stats))
Check out the qa report from all lanes
## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/' load("bigdata/qa_all.Rda") browseURL(report(qa_all))
org
packages
symbol mapping
r
library(airway)
data(airway)
library(org.Hs.eg.db)
ensid <- head(rownames(airway))
mapIds(org.Hs.eg.db, ensid, "SYMBOL", "ENSEMBL")
keytypes(org.Hs.eg.db)
TxDb
packages
Easy to make your own, from GFF files
GenomicFeatures::makeTxDbFromGFF()
& friends
r
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons(txdb)
exonsBy(txdb, "tx")
p <- promoters(txdb)
BSgenome
Possible to make your own, or use other formats --
Rsamtools::FaFile()
, tracklayer::TwoBitFile()
r
library(BSgenome.Hsapiens.UCSC.hg19)
bsgenome <- BSgenome.Hsapiens.UCSC.hg19
ps <- getSeq(bsgenome, p)
ps
hist(letterFrequency(ps, "GC", as.prob=TRUE))
Example: Ensembl 'GTF' files to R / Bioconductor GRanges and TxDb
library(AnnotationHub) hub <- AnnotationHub() hub query(hub, c("Ensembl", "80", "gtf")) ## ensgtf = display(hub) # visual choice hub["AH47107"] gtf <- hub[["AH47107"]] gtf txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf)
Example: non-model organism OrgDb
packages
library(AnnotationHub) hub <- AnnotationHub() query(hub, "OrgDb")
Example: Map Roadmap epigenomic marks to hg38
Roadmap BED file as GRanges
r
library(AnnotationHub)
hub <- AnnotationHub()
query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126 <- hub[["AH29817"]]
UCSC 'liftOver' file to map coordinates
r
query(hub , c("hg19", "hg38", "chainfile"))
chain <- hub[["AH14150"]]
lift over -- possibly one-to-many mapping, so GRanges to GRangesList
r
library(rtracklayer)
E126hg38 <- liftOver(E126, chain)
E126hg38
Integrative Genomics Viewer
Create an 'igv' directory (if it does not already exist) and add the file hg19_alias.tab to it. This is a simple tab-delimited file that maps between the sequence names used by the alignment, and the sequence names known to IGV.
Start igv.
Choose hg19 from the drop-down menu at the top left of the screen
Use File -> Load from File menu to load a bam file, e.g.,
/mnt/nfs/practicals/day1/martin_morgan/SRR1039508_sorted.bam
Zoom in to a particular gene, e.g., SPARCL1, by entering the gene symbol in the box toward the center of the browser window. Adjust the zoom until reads come in to view, and interpret the result.
mkdir -p ~/igv/genomes cp bigdata/hg19_alias.tab ~/igv/genomes/ igv
Bioconductor: we'll explore how to map between different types of identifiers, how to navigate genomic coordinates, and how to query BAM files for aligned reads.
Attach 'Annotation' packages containing information about gene
symbols r Biocannopkg("org.Hs.eg.db")
and genomic coordinates
(e.g., genes, exons, cds, transcripts) r
Biocannopkg(TxDb.Hsapiens.UCSC.hg19.knownGene)
. Arrange for the
'seqlevels' (chromosome names) in the TxDb package to match those
in the BAM files.
Use the org.*
package to map from gene symbol to Entrez gene id,
and the TxDb.*
package to retrieve gene coordinates of the
SPARCL1 gene. N.B. -- The following uses a single gene symbol, but
we could have used 1, 2, or all gene symbols in a vectorized fashion.
Attach the GenomicAlignments package for working with aligned
reads. Use range()
to get the genomic coordinates spanning the
first and last exon of SPARCL1. Input paired reads overlapping
SPARCL1.
What questions can you easily answer about these alignments? E.g., how many reads overlap this region of interest?
```r
library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
fl <- "~/igv/genomes/hg19_alias.tab" map <- with(read.delim(fl, header=FALSE, stringsAsFactors=FALSE), setNames(V1, V2)) seqlevels(txdb, force=TRUE) <- map
sym2eg <- mapIds(org.Hs.eg.db, "SPARCL1", "ENTREZID", "SYMBOL") exByGn <- exonsBy(txdb, "gene") sparcl1exons <- exByGn[[sym2eg]]
library(GenomicAlignments)
fl <- "bigdata/SRR1039508_sorted.bam" sparcl1gene <- range(sparcl1exons) param <- ScanBamParam(which=sparcl1gene) aln <- readGAlignmentPairs(fl, param=param) ```
As another exercise we ask how many of the reads we've input are compatible with the known gene model. We have to find the transcripts that belong to our gene, and then exons grouped by transcript.
```r
txids <- select(txdb, sym2eg, "TXID", "GENEID")$TXID exByTx <- exonsBy(txdb, "tx")[txids]
hits <- findCompatibleOverlaps(query=aln, subject=exByTx) good <- seq_along(aln) %in% queryHits(hits) table(good) ```
Finally, let's go from gene model to protein coding sequence. (a) Extract CDS regions grouped by transcript, select just transcripts we're interested in, (b) attach and then extract the coding sequence from the appropriate reference genome. Translating the coding sequences to proteins.
```r
restoreSeqlevels(txdb)
txids <- mapIds(txdb, sym2eg, "TXID", "GENEID") cdsByTx <- cdsBy(txdb, "tx")[txids]
library(BSgenome.Hsapiens.UCSC.hg19) dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx) protein <- translate(dna) ```
Exercises Visit the biomart web service to explore the diversity of annotation offerings available.
Load the biomaRt package and list the available marts. Choose the ensembl mart and list the datasets for that mart. Set up a mart to use the ensembl mart and the hsapiens_gene_ensembl dataset.
A biomaRt dataset can be accessed via getBM()
. In addition to
the mart to be accessed, this function takes filters and attributes as
arguments. Use filterOptions()
and listAttributes()
to discover
values for these arguments. Call getBM()
using filters and
attributes of your choosing.
Solutions
library(biomaRt) head(listMarts(), 3) ## list the marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" head(filterOptions(myFilter, ensembl), 3) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.