The biobroom
package contains methods for converting standard objects in
Bioconductor into a "tidy format". It serves as a complement to the popular broom
package, and follows the same division (tidy
/augment
/glance
) of tidying
methods.
"Tidy data" is a data analysis paradigm that focuses on keeping data formatted as a single observation per row of a data table. For further information, please see Hadley Wickham's seminal paper on the subject. "Tidy" is not a normative statement about the quality of an object's structure. Rather, it is a technical specification about the choice of rows and columns. A tidied data frame is not "better" than an S4 object; it simply allows analysis with a different set of tools.
Tidying data makes it easy to recombine, reshape and visualize bioinformatics analyses. Objects that can be tidied include
ExpressionSet
object, GRanges
and GRangesList
objects,RangedSummarizedExperiment
object,MSnSet
object,limma
, edgeR
, and DESeq2
,qvalue
object for multiple hypothesis testing.We are currently working on adding more methods to existing Bioconductor objects. If any bugs are found please contact the authors or visit our github page. Otherwise, any questions can be answered on the Bioconductor support site.
knitr::opts_chunk$set(fig.width=12, fig.height=6, out.width='700in', out.height='350in', echo=TRUE, warning=FALSE, message=FALSE, cache=FALSE, dev='png')
The biobroom
package can be installed by typing in a R terminal:
source("http://bioconductor.org/biocLite.R") biocLite("biobroom")
To find out more about the provided objects:
library(biobroom) ?edgeR_tidiers ?DESeq2_tidiers ?limma_tidiers ?ExpressionSet_tidiers ?MSnSet_tidiers ?qvalue_tidiers
The qvalue package
is a popular package to estimate q-values and local false discovery rates. To get started,
we can load the hedenfalk
dataset included in the qvalue
package:
library(qvalue) data(hedenfalk) qobj <- qvalue(hedenfalk$p) names(qobj)
qobj
is a qvalue
object, generated from the p-values contained in the hedenfalk
dataset. If we wanted to use a package such as dplyr
or ggplot
,
we would need to convert the results into a data frame. The biobroom
package
makes this conversion easy by using the tidy
, augment
and glance
functions:
tidy
returns one row for each choice of the tuning parameter lambda.augment
returns one row for each provided p-value, including the computed q-value and local false discovery rate.glance
returns a single row containing the estimated pi0
.Applying these functions to qobj
:
library(biobroom) # use tidy/augment/glance to restructure the qvalue object head(tidy(qobj)) head(augment(qobj)) head(glance(qobj))
The original data, or in this example the gene names, can be inputted into augment
using the data
argument:
# create sample names df <- data.frame(gene = 1:length(hedenfalk$p)) head(augment(qobj, data = df))
The tidied data can be used to easily create plots:
library(ggplot2) # use augmented data to compare p-values to q-values ggplot(augment(qobj), aes(p.value, q.value)) + geom_point() + ggtitle("Simulated P-values versus Computed Q-values") + theme_bw()
Additionally, we can extract out important information such as significant genes under a false discovery rate threshold:
library(dplyr) # Find significant genes under 0.05 threshold sig.genes <- augment(qobj) %>% filter(q.value < 0.05) head(sig.genes)
To demonstrate tidying on DESeq2
objects we have used the published airway
RNA-Seq experiment, available as a package from Bioconductor:
source("https://bioconductor.org/biocLite.R") biocLite("airway")
Import the airway
dataset:
library(DESeq2) library(airway) data(airway) airway_se <- airway
airway_se
is a SummarizedExperiment
object, which is a type of object used by the DESeq2
package. Next, we create a DESeqDataSet
object
and show the output of tidying this object:
airway_dds <- DESeqDataSet(airway_se, design = ~cell + dex) head(tidy(airway_dds))
Only the gene counts are outputted since there has been no analysis performed.
We perform an analysis on the data and then tidy
the resulting object:
# differential expression analysis deseq <- DESeq(airway_dds) results <- results(deseq) # tidy results tidy_results <- tidy(results) head(tidy_results)
As an example to show how easy it is to manipulate the resulting object,
tidy_results
, we can use ggplot2
to create a volcano plot of the p-values:
ggplot(tidy_results, aes(x=estimate, y=log(p.value), color=log(baseMean))) + geom_point(alpha=0.5) + ggtitle("Volcano Plot For Airway Data via DESeq2") + theme_bw()
Here we use the hammer
dataset included in biobroom
package. edgeR
can be used to perform differential
expression analysis as follows:
library(edgeR) data(hammer) hammer.counts <- Biobase::exprs(hammer)[, 1:4] hammer.treatment <- Biobase::phenoData(hammer)$protocol[1:4] y <- DGEList(counts=hammer.counts,group=hammer.treatment) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y)
The results of the analysis are stored in et
, which is an DGEExact
object. We can tidy
this object using biobroom
:
head(tidy(et))
glance
shows a summary of the experiment: the number of genes found significant (at a specified alpha
), and which contrasts were compared to get the results.
glance(et, alpha = 0.05)
Additionally, we can can easily manipulate the resulting object and create
a volcano plot of the p-values using ggplot2
:
ggplot(tidy(et), aes(x=estimate, y=log(p.value), color=logCPM)) + geom_point(alpha=0.5) + ggtitle("Volcano Plot for Hammer Data via EdgeR") + theme_bw()
To demonstrate how biobroom
works with limma
objects, we generate some simulated data to test the tidier for limma
objects.
# create random data and design dat <- matrix(rnorm(1000), ncol=4) dat[, 1:2] <- dat[, 1:2] + .5 # add an effect rownames(dat) <- paste0("g", 1:nrow(dat)) des <- data.frame(treatment = c("a", "a", "b", "b"), confounding = rnorm(4))
We then use lmFit
and eBayes
(functions included in limma
) to fit a linear model and use tidy
to convert the
resulting object into tidy format:
lfit <- lmFit(dat, model.matrix(~ treatment + confounding, des)) eb <- eBayes(lfit) head(tidy(lfit)) head(tidy(eb))
Analysis can easily be performed from the tidied data. The package ggplot2
can
be used to make a volcano plot of the p-values:
ggplot(tidy(eb), aes(x=estimate, y=log(p.value), color=statistic)) + geom_point() + ggtitle("Nested Volcano Plots for Simulated Data Processed with limma") + theme_bw()
tidy
can also be run directly on ExpressionSet
objects, as described in another popular Bioconductor
package Biobase.
The
hammer
dataset we used above (which is included in the biobroom
package) is an ExpressionSet
object, so we'll use that to demonstrate.
library(Biobase) head(tidy(hammer))
We can add the phenotype information by using the argument addPheno
:
head(tidy(hammer, addPheno = TRUE))
Now we can easily visualize the distribution of counts for each protocol by using
ggplot2
:
ggplot(tidy(hammer, addPheno=TRUE), aes(x=protocol, y=log(value))) + geom_boxplot() + ggtitle("Boxplot Showing Effect of Protocol on Expression")
tidy
can also be run directly on MSnSet
objects from MSnbase
, which as specialised containers for quantitative proteomics data.
library(MSnbase) data(msnset) head(tidy(msnset)) head(tidy(msnset, addPheno = TRUE))
All biobroom tidy
and augment
methods return a tbl_df
by default (this
prevents them from printing many rows at once, while still acting like a
traditional data.frame
). To change this to a data.frame
or data.table
,
you can set the biobroom.return
option:
options(biobroom.return = "data.frame") options(biobroom.return = "data.table")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.