Week 1: Visualization for genomic data science

Related resources: Griffith lab course

Overview: base graphics, ggplot2, and graphic file formats

Displaying structure in multivariate data

Setting up the airway dataset

library(airway)
data(gse)
gse # uses gencode gene names
rownames(gse) = gsub("\\..*", "", rownames(gse)) # ENS
library(org.Hs.eg.db)
ss = select(org.Hs.eg.db, keys=rownames(gse), 
   keytype="ENSEMBL", columns=c("GENENAME", "SYMBOL"))
dr = which(duplicated(ss[,1]) | is.na(ss[,1]))
ss = ss[-dr,]
rownames(ss) = ss[,1]
ok = intersect(rownames(ss), rownames(gse))
gse = gse[ok,]
rownames(gse) = ss[rownames(gse),]$SYMBOL
gse = gse[-which(is.na(rownames(gse))),]

Visualizing one-dimensional data

We will work with a reduction of the information in the experiment, summing columns of the assay matrix. We want to present the information for rapid uptake in context. We could just use a table

data.frame(colsum=colSums(assay(gse)), donor=gse$donor, trt=gse$condition)

but the numbers are unwieldy, and patterns are hard to extract from the text.

Questions

  1. What is the interpretation of colSums(assay(gse))?

  2. Execute [when we go live the results of code chunks in problems is suppressed]

plot(colSums(assay(gse)))
text(1:8, colSums(assay(gse)), 
  labels=paste0(gse$donor, "\n", 
             gse$names, "\n", 
             substr(gse$condition,1,3)), cex=.6)

Which of the following is not true?

plot(colSums(assay(gse)), 
    pch=" ", xlim=c(0,9), axes=FALSE, xlab=" ")
text(colSums(assay(gse)), 
    label=paste0(gse$donor, "\n", gse$condition), 
    col=as.numeric(gse$donor), font=2)
axis(2)

Two of the plotted data elements have defects. What graphical parameters can be used to achieve a better rendering?

Easiest answer: ylim

plot(colSums(assay(gse)), 
    pch=" ", xlim=c(0,9), axes=FALSE, xlab=" ", ylim=c(1.25e7,3.2e7))
text(colSums(assay(gse)), 
    label=paste0(gse$donor, "\n", gse$condition), 
    col=as.numeric(gse$donor), font=2)
axis(2)

Sketching densities

Here we'll work with some RNA-seq data developed in the Cancer Genome Atlas.

suppressMessages({
library(curatedTCGAData)
ex = curatedTCGAData(c("PRAD", "ACC"), assay="RNAseq2GeneNorm", 
  dry.run=FALSE)
})
ex
cs = lapply(experiments(ex), function(x) colSums(assay(x)))
plot(density(cs[[2]]))
lines(density(cs[[1]]), lty=2)
legend(2.3e7, 4e-7, legend=c("type 1", "type 2"))

Question: What should you substitute for 'type 1' in the legend so that the tumor type on which the solid density trace is based appears in the legend? Answer should be either 'PRAD' or 'ACC'.

Question: Which axis records the depth of sequencing for the various experiments? x or y?

Visualizing categorical data

Networks for linked data

suppressPackageStartupMessages({
library(graph)
library(SingleR)
})
mon = MonacoImmuneData()
labs = unique(c(mon$label.main, mon$label.fine))
myg = new("graphNEL", nodes=labs, edgemode="directed")
myg2 = addEdge(mon$label.main, mon$label.fine, myg)
myg2

Given a set of nodes, we want to extract a sliceGraph = function(g, n) { a = lapply(n, function(x) adj(g, x)) nn = unique(c(n, unlist(a))) subGraph(nn, g) } myg3 = sliceGraph(myg2, unique(mon$label.main)[1:4]) library(igraph) ig = igraph.from.graphNEL(myg3) plot(ig, vertex.size=0, label.cex=.05, label.dist=1) ```

Multivariate visualization

Week 2: Interactive apps with Bioconductor and shiny

Week 3: Memory-sparing solutions for large data with Bioconductor

Week 4: Using python resources with Bioconductor



vjcitn/edx_adv_bioc documentation built on March 17, 2020, 3:58 p.m.