We will provide a detailed example of a case where data pre-processing in R results in a complex, 'untidy' data format. We will walk through an example of applying automated gating to flow cytometry data. We will demonstrate the complex initial format of this pre-processed data and then show trainees how a 'tidy' dataset can be extracted and used for further data analysis and visualization using the popular R 'tidyverse' tools. This example will use real experimental data from one of our Co-Is research on the immunology of tuberculosis.
Objectives. After this module, the trainee will be able to:
library(magrittr) library(tidyverse)
The tidyverse approach in R is based on keeping data in a dataframe structure. By keeping this common structure, the tidyverse allows for straightforward but powerful work with your data by chaining together simple, single-purpose functions. This approach is widely covered in introductory R programming courses and books. A great starting point is the book R Programming for Data Science, which is available both in print and freely online at [site]. Many excellent resources exist for learning this approach, and so we won't recover that information here. Instead, we will focus on how to interface between this approach and the object-based approach that's more common with Bioconductor packages. Bioconductor packages often take an object-based approach, and with good reason because of the complexity and size of many early versions of biomedical data in the preprocessing process. There are also resources for learning to use specific Bioconductor packages, as well as some general resources on Bioconductor, like R Programming for Bioinformatics [ref]. However, there are fewer resources available online that teach how to coordinate between these two approaches in a pipeline of code, so that you can leverage the needed power of Bioconductor approaches early in your pipeline, as you preprocess large and complex data, and then shift to use a tidyverse approach once your data is amenible to this more straightforward approach to analysis and visualization.
The heart of making this shift is learning how to convert data, when possible, from a more complex, class-type data structure (built on the flexible list data structure) to the simpler, more standardized two-dimensional dataframe structure that is required for the tidyverse approach. In this subsection, we'll cover approaches for converting your data from Bioconductor data structures to dataframes.
If you are lucky, this might be very straightforward. A pair of packages called
broom
and biobroom
have been created specifically to facilitate the conversion
of data from more complex structures to dataframes. The broom
package was
created first, by David Robinson, to convert the data stored in the objects that
are created by fitting statistical models into tidy dataframes. Many of the functions
in R that run statistical tests or fit statistical models output results in a
more complex, list-based data structure. These structures have nice "print" methods,
so if fitting the model or running the test is the very last step of your pipeline,
you can just read the printed output from R. However, often you want to include
these results in further code---for example, creating plots or tables that show
results from several statistical tests or models. The broom
package includes
several functions for pulling out different bits of data that are stored in the
complex data structure created by fitting the model or running the test and convert
those pieces of data into a tidy dataframe. This tidy dataframe can then be
easily used in further code using a tidyverse approach.
The biobroom
package was created to meet a similar need with data stored in some
of the complex structures commonly used in Bioconductor packages. [More about
biobroom
.]
[How to convert data if there isn't a biobroom
method.] If you are unlucky,
there may not be a broom
or biobroom
method that you can use for the particular
class-based data structure that your data's in, or it might be in a more general
list, rather than a specific class with a biobroom
method. In this case, you'll
need to extract the data "by hand" to move it into a dataframe once your data is
simple enough to work with using a tidyverse approach. If you've mastered how to
explore data stored in a list (covered in the last subsection), you'll have a headstart
on how to do this. Once you know where to find each element of the data in the
structure of the list, you can assign these specific pieces to their own R objects
using typical R assignment (e.g., with the gets arrow, <-
, or with =
, depending
on your preferred R programming style). ...
Use complex data structures early and convert to tidyverse approach when possible
"Converting" as restructuring the data
Converting along the way for exploration
May "drop" some data when converting
Accessor functions
broom and biobroom
"tidy" data structures for single cell gene expression data
These data come from a study related to vaccine development for COVID-19 [@ragan2021whole]. This studied tested a COVID-19 vaccine candidate using hamsters as the animal model. It included as an assay single-cell RNA-sequencing, with the goal of clarifying immunological responses in the animals to COVID-19 with and without vaccination [@ragan2021whole].
One of the goals for single-cell RNA-seq is to characterize the populations of cells in a sample. All cells within an organism have the same DNA (short of some mutations). They differ and gain different functions based on how they express that genetic code. While there are "housekeeping" genes that are used by all cells, there are other genes that are expressed only by some types of cells, and others that are expressed at different levels by different types of cells. Therefore, patterns of gene expression within different cells can be used to group cells together into cell types and to create a census of the types of cells that are within one sample compared to another. This can be used, for example, to determine if there are differences in the immune cells populations within an animal that is vaccinated versus one that is not.
In the previous modules, we have talked about two topics. First we have talked about the convenience and power of tidyverse tools. These tools can be used at points in your workflow when the data can be stored in a simple standard format: the tidy dataframe format. We have also talked about reasons why there are advantages to using more complex data storage formats earlier in the process.
Work with research data will typically require a series of steps for pre-processing, analysis, exploration, and visualization. Collectively, these form a workflow or pipeline for the data analysis. With large, complex biological data, early steps in this workflow might need to use a Bioconductor approach, given the size and complexity of the data. However, this doesn't mean that you must completely give up the power and efficiency of the tidyverse approach described in earlier modules. Instead, you can combine the two, in a workflow like that shown in Figure \@ref(fig:combinedworkflow). In this combined approach, you start the workflow in the Bioconductor approach and transition when possible to a tidyverse approach, transitioning by "tidying" from a more complex data structure to a simpler dataframe data structure along the way. In this module, we will describe how you can make this transition to create this type of combined workflow. This is a useful approach, because once your workflow has advanced to a stage where it is straightforward to store the data in a a dataframe, there are a large advantages to shifting into the tidyverse approach as compared to using more complex object-oriented classes for storing the data, in particular when it comes to data analysis and visualization at later stages in your workflow.
knitr::include_graphics("figures/workflow.png")
There are two key tools that have been developed as our packages that facilitate the shift of data from being stored in a more customized object-oriented class, for example one of the S4 type classes that we discussed when talking about complex data formats for Bioconductor. These packages move data from one of those storage containers into a tidy dataframe format. By doing this it moves the data into a format that is very easy to use in conjunction with the tidyverse tools and the tidyverse approach.
In this module we will focus specifically on the biobroom package. Of the two packages this focuses specifically on moving data out of many of the common bioconductor classes and into tidy dataframes. this package drawers and an object-oriented approach in that it provides generic functions for extracting data from many different object classes that are coming in by a conductor. You will call the same function regardless of the class that the dad is in. If that object class has a bio broom method for that generic function, then the function will be able to extract parts of the data into a tidy data frame.
In this module we will also discuss another tool from the tidyverse, or rather a
tool that draws on the tiny verse approach, that can be easily used in
conjunction with biomedical data that has been processed using bioconductor
tools. This is a package called ggbio
that facilitates the visualization of
biomedical data. It includes functions and Specialized gian's or geometrical
objects that are customized for some of the tasks that you might want to conduct
in visualizing biomedical data in r. by drawing on tools and an approach from
ggplot which is part of the tidyverse approach, these tools allow you to work
with this data while still leveraging the powerful visualization tools and
philosophy underlying the ggplot package.
Finally it is quite likely better purchase will continue to evolve through are, and that in the future there might be tidy data frame format that are adaptable enough to handle earlier stages in the data preprocessing. Tidy first dataframe have already been adapted to enable them to include more complex types of data within certain columns of the data frame any special list type column. This functionality is being leveraged through the ffs package to an evil a tidy approach to working with geographical data. This allows those who are working with geographical data, for example data from shapefiles for creating Maps, to use the standard tidyverse approaches while still containing complex data needed for this geographical information. It seems very possible that similar approaches may be adapted in the near future to allow for biomedical or genomic data to be stored in a way that both accounts for complexity early and pre-processing of these data but also allows for a more natural integration with the wealth of powerful tools available through the tidyverse approach.
biobroom
packageThe biobroom
package includes three main generic functions (methods), which
can be used on a number of Bioconductor object classes. When applied to object
stored in one of these Bioconductor classes, these functions will extract part
of the data into a tidy dataframe format. In this format, it is easy to use the
tools from the tidyverse to further explore, analyze, and visualize the data.
The three generic functions of biobroom
are the functions tidy
, augment
,
and glance
. These function names mimic the names of the three main functions
in the broom
package, which is a more general purpose package for extracting
tidy datasets from more complex R object containers. The broom
package
focuses on the output from functions in R for statistical testing and modeling,
while the newer biobroom
package replicates this idea, but for many of the
common object classes used to store data through Bioconductor packages and
workflows.
The biobroom
package includes methods for the following object classes
qvalue
objects, which are used ...DESeqDataSet
objects, which are used ...DGEExact
objects, which are used ...MSnSet
objects, which are used ...As an example, we can look at how the biobroom
package can be used to
convert output generated by functions in the edgeR
package into a tidy
dataframe, and how that output can then be explored and visualized using
functions from the tidyverse.
The edgeR
package is a popular Bioconductor package that can be used on gene
expression data to explore which genes are expressed differently across
experimental groups (differential expression analysis) [@edgeR]. Before using
the functions in the package, the data must be preprocessed to align sequence
reads from the raw data and then to create a table with the counts of each read
at each gene across each sample. The edgeR
package includes functions for
pre-processing through its own functions, as well, including capabilities for
filtering out genes with low read counts across all samples and model-based
normalization across samples to help handle technical bias, including
differences in sequencing depth [@chen2014edger].
The edgeR
package operates on data stored in a special object class
defined by the package, the DGEList
object class [@chen2014edger].
This object class includes areas for storing the table of read counts,
in the form of a matrix appropriate for analysis by other functions in
the package, as well as other spots for storing information about each
sample and, if needed, a space to store annotations of the genes
[@chen2014edger].
[Example from the biobroom
help documentation---uses the hammer
data
that comes with the package. These data are stored in an ExpressionSet
object, an object class defined by the Biobase
package. You can see how the tidy
function extracts these data in a
tidy format. Then, the data are put in a DGEList
class so they are in
the right container for operations from edgeR
. Then functions from the
edgeR
package are run to perform differential expression analysis on the
data. The result is an object in the DGEExact
class, which is defined
by the edgeR
package. To extract data from this class in a tidy format,
you can use the tidy
and glance
functions from biobroom
.]
library(biobroom) library(Biobase) library(edgeR) data(hammer) class(hammer) tidy(hammer) ## Example from `biobroom` help documentation hammer.counts <- exprs(hammer)[, 1:4] hammer.treatment <- phenoData(hammer)$protocol[1:4] y <- DGEList(counts=hammer.counts,group=hammer.treatment) y <- calcNormFactors(y) y <- estimateCommonDisp(y) y <- estimateTagwiseDisp(y) et <- exactTest(y) class(et) tidy(et) glance(et)
The creator of the broom
package listed some of the common ways that
statistical model output objects---the focus on 'tidying' in broom
---tend to
be untidy. These include that important information is stored in the row names,
where it is harder to access, that the names of some columns can be tricky to
work with because they use non-standard conventions (i.e., they don't follow the
rules for naming objects in R), that some desired information is not available
in the object, but rather is typically computed with later methods for the
object, like when summary
is run on the object, or are only available as the
result of a print
method run on the object, and vectors that a user may want
to explore in tandem are stored in different places in the object.
[@robinson2014broom]
[Examples of these in Bioconductor objects?]
These 'messy' characteristics show up in the data stored in Bioconductor
objects, as well, in terms of characteristics that impede working with
data stored in these formats easily using tools from the tidyverse.
As an example, one common class for storing data in Bioconductor work is
the ExpressionSet
object class, defined in the Biobase
package [@biobase].
This object class can be used to store the data from high-throughput
assays. It includes slots for the assay data, as well as slots
for storing metadata about the experiment, which could include information
like sampling time points or sample strains, as well as the experimental
group of each sample (control versus treated, for example).
Data from the assay for the experiment---for example, gene expression
or intensity [?] measurements for each gene and each sample [?]---can be
extracted from an ExpressionSet
object using an extractor function
called exprs
. Here is an example using the hammer
example dataset
available with the biobroom
package. The code call here extracts the
assay data from the hammer
R object, which is an instance of the
ExpressionSet
object class. It uses indexing ([1:10, 1:3]
) to limit
printing to the first ten rows and first three columns of the output, so
we can investigate a small snapshot of the data:
exprs(hammer)[1:10, 1:3]
These data are stored in a matrix format. The gene identifiers [?] are given in the rownames and the samples in the column names. Each cell of the matrix provides the expression level (number of reads [?]) of a specific gene in a specific sample.
These data are structured and stored in such a way that they have some of the
characteristics that can make data difficult to work with the data using
tidyverse tools. For example, they store gene identifiers in the rownames,
rather than in a separate column where they can be easily accessed when using
tidyverse functions. Also, there are phenotype / meta data that are stored
in other parts of the ExpressionSet
data but that may be interesting to
explore in conjunction with these assay data, including the experimental
group of each sample (control versus animals in which chronic neuropathic
pain was induced, in these example data).
The tidy
function from the biobroom
package extracts these data and
restructures them into a 'tidy' format, ready to use easily with tidyverse
tools.
tidy(hammer)
This output is a tidy dataframe object, with three columns providing the gene name, the sample identifier, and the expression level. In this format, the data can easily be explored and visualized with tidyverse tools. For example, you could easily create a set of histograms, one per sample, showing the distribution of expression levels across all genes in each sample: [better example visualization here?]
tidy(hammer) %>% ggplot(aes(x = value)) + geom_histogram() + facet_wrap(~ sample) + scale_x_continuous(trans = "log1p")
You can also incorporate data that are stored in the phenoData
slot of
the ExpressionSet
object by specifying addPheno = TRUE
:
tidy(hammer, addPheno = TRUE)
With this addition, visualizations can easily be changed to also show the experimental group of each sample:
tidy(hammer, addPheno = TRUE) %>% ggplot(aes(x = value, color = protocol)) + geom_histogram() + facet_wrap(~ sample) + scale_x_continuous(trans = "log1p")
You can also do things like look at differences in values for specific genes, pairing tools for exploring data with tools for visualization, both from the tidyverse:
library(ggbeeswarm) library(ggrepel) tidy(hammer, addPheno = TRUE) %>% filter(gene == "ENSRNOG00000004805") %>% ggplot(aes(x = protocol, y = value)) + geom_beeswarm() + geom_label_repel(aes(label = sample)) + ggtitle("Values by sample for gene ENSRNOG00000004805", subtitle = "Samples are labeled with their sample ID")
The data for a subset of the sample can be analyzed using functions from the
edgeR
package, to complete needed pre-processing (for example, calculating
normalizing factors with calcNormFactors
, to reduce impacts from technical
bias [?]), estimate dispersion using conditional maximum likelihood and
empirical Bayesian methods (estimateCommonDisp
and estimateTagwiseDisp
), and
then perform a statistical analysis, conducting a differential expression
analysis (exactTest
) [@chen2014edger].
The data are stored in a special Bioconductor class, as an instance of
DGEList
, throughout most of this process. This special class can be initialized
with data from the original ExpressionSet
object, specifically, assay
data with the counts per gene in each sample and data on the experimental
phenotypes for the experiment---specifically, the protocol for each sample,
in terms of whether it was a control or if the sample was from an animal
in which chronic neuropathic pain was induced [@hammer2010mrna].
## Based on example from `biobroom` help documentation hammer.counts <- exprs(hammer)[, 1:4] hammer.treatment <- phenoData(hammer)$protocol[1:4] y <- DGEList(counts = hammer.counts, group = hammer.treatment) class(y)
Once the data are stored in this special DGEList
class, different
steps of the preprocessing can be conducted. In each case, the results are
stored in special slots of the DGEList
object. In this way, the original
data and results from preprocessing are all kept together in a single
object, each in a special slot within the object's structure.
y <- calcNormFactors(y) class(y) y <- estimateCommonDisp(y) class(y) y <- estimateTagwiseDisp(y) class(y)
After the preprocessing, the data can be analyzed using the exactText
function.
This inputs the data stored in a DGEList
object and outputs results into
a different object class, a DGEExact
class.
et <- exactTest(y) class(et)
The DGEExact
class is defined by the edgeR
package and was created
specifically to store the results from a differential expression analysis
[@chen2014edger]. It has slots for a dataframe giving the estimates of
differential change in expression across the experimental groups for each
gene, within a table
slot. Again, in this output, gene identifiers are
stored as rownames---which makes them hard to access with tidyverse tools---rather
than in their own column:
head(et$table)
The DGEExact
object also has a slot that contains a vector with identifiers
for the two experimental groups that are being compared in the
differential expression analysis, under the slot comparison
:
et$comparison
There is also a space in this object class where information about each gene can be stored, if desired.
Two biobroom
methods are defined for the DGEExact
object class, glance
and tidy
. The tidy
method extracts the results from the differential
experssion analysis, but moves these results into a dataframe where the
gene names are given their own column, rather than being stored in the
hard-to-access rownames:
tidy(et)
Now that the data are in this tidy format, tools from the tidyverse can
be easily applied. For example, you could use functions from the dplyr
package to see the genes for which the differential expression analysis
resulted in both a very low p-value and a large difference in expression
across the experimental groups:
tidy(et) %>% filter(p.value <= 1e-20) %>% arrange(desc(abs(estimate)))
Other exploratory analysis will also be straightforward with the data using tidyverse tools, now that they are in a "tidy" format.
The glance
method can also be applied to data that are stored in a
DGEExact
class. In this case, the method will extract the names of the
experimental groups being compared (from the comparison
slot of the
object) as well as count the number of genes with statistically
significant differences in expression level, based on the values in the
table
slot of the object.
glance(et) glance(et, alpha = 0.001)
As another example, you can now use tools from ggplot2
, as well as
extensions built on this package, to do things like create a volcano
plot of the data with highlighting of noteworthy genes on the plot:
library(gghighlight) # Building on an example in the biobroom documentation tidy(et) %>% ggplot(aes(x = estimate, y = p.value)) + geom_point() + scale_y_continuous(trans = "log") + gghighlight(p.value < 1e-150, label_key = gene)
If you wanted to access the help files for these biobroom
methods for this
object class, you could do so by calling help in R (?
) using the name of the
method, a dot, and then the name of the object class, e.g., ?tidy.DGEExact
.
ggbio
package"The biobroom package contains methods for converting standard objects in Bioconductor into a 'tidy format'. It serves as a complement to the popular broom package, and follows the same division (tidy/augment/glance) of tidying methods." @biobroom
"Tidying data makes it easy to recombine, reshape and visualize bioinformatics analyses. Objects that can be tidied include: ExpressionSet object, GRanges and GRangesList objects, RangedSummarizedExperiment object, MSnSet object, per-gene differential expression tests from limma, edgeR, and DESeq2, qvalue object for multiple hypothesis testing." @biobroom
"We are currently working on adding more methods to existing Bioconductor objects." @biobroom
"All biobroom tidy and augment methods return a tbl_df by default (this prevents them from printing many rows at once, while still acting like a traditional data.frame)." @biobroom
"The concept of 'tidy data' offers a powerful framework for structuring data to ease manipulation, modeling and visualization. However, most R functions, both those builtin and those found in third-party packages, produce output that is not tidy, and that is therefore difficult to reshape, recombine, and otherwise manipulate. Here I introduce the broom package, which turns the output of model objects into tidy data frames that are suited to further analysis, manipulation, and visualization with input-tidy tools." [@robinson2014broom]
"Tools are classified as 'messy-output' if their output does not fit into this [tidy] framework. Unfortunately, the majority of R modeling tools, both from the built-in stats package and those in common third party packages, are messy-output. This means the data analyst must tidy not only the original data, but the results at each intermediate stage of an analysis." [@robinson2014broom]
"The broom package is an attempt to solve this issue, by bridging the gap from untidy outputs of predictions and estimations to create tidy data that is easy to manipulate with standard tools. It centers around three S3 methods, tidy, augment, and glance, that each take an object produced by R statistical functions (such as lm, t.test, and nls) or by popular third-party packages (such as glmnet, survival, lme4, and multcomp) and convert it into a tidy data frame without rownames (Friedman et al., 2010; Therneau, 2014; Bates et al., 2014; Hothorn et al., 2008). These outputs can then be used with input-tidy tools such as dplyr or ggplot2, or downstream statistical tests. broom should be distinguished from packages such as reshape2 and tidyr, which rearrange and reshape data frames into different forms (Wickham, 2007b, 2014b). Those packages perform essential tasks in tidy data analysis but focus on manipulating data frames in one specific format into another. In contrast, broom is designed to take data that is not in a data frame (sometimes not anywhere close) and convert it to a tidy data frame." [@robinson2014broom]
"
tidy
constructs a data frame that summarizes the model’s statistical components, which we refer to as the component level. In a regression such as the above it may refer to coefficient estimates, p-values, and standard errors for each term in a regression. The tidy generic is flexible- in other models it could represent per-cluster information in clustering applications, or per-test information for multiple comparison functions. ...augment
add columns to the original data that was modeled, thus working at the observation level. This includes predictions, residuals and prediction standard errors in a regression, and can represent cluster assignments or classifications in other applications. By convention, each new column starts with . to ensure it does not conflict with existing columns. To ensure that the output is tidy and can be recombined, rownames in the original data, if present, are added as a column called .rownames. ... Finally,glance
constructs a concise one-row summary of the model level values. In a regression this typically contains values such as R2 , adjusted R2 , residual standard error, Akaike Information Criterion (AIC), or deviance. In other applications it can include calculations such as cross validation accuracy or prediction error that are computed once for the entire model. ... These three methods appear across many analyses; indeed, the fact that these three levels must be combined into a single S3 object is a common reason that model outputs are not tidy. Importantly, some model objects may have only one or two of these methods defined. (For example, there is no sense in which a Student’s T test or correlation test generates information about each observation, and therefore no augment method exists). " [@robinson2014broom]"While model inputs usually require tidy inputs, such attention to detail doesn’t carry over to model outputs. Outputs such as predictions and estimated coefficients aren’t always tidy. For example, in R, the default representation of model coefficients is not tidy because it does not have an explicit variable that records the variable name for each estimate, they are instead recorded as row names. In R, row names must be unique, so combining coefficients from many models (e.g., from bootstrap resamples, or subgroups) requires workarounds to avoid losing important information. This knocks you out of the flow of analysis and makes it harder to combine the results from multiple models." [@wickham2014tidy]
"edgeR can be applied to differential expression at the gene, exon, transcript or tag level. In fact, read counts can be summarized by any genomic feature. edgeR analyses at the exon level are easily extended to detect differential splicing or isoform-specific differential expression." [@chen2014edger]
"edgeR provides statistical routines for assessing differential expression in RNA-Seq experiments or differential marking in ChIP-Seq experiments. The package implements exact statistical methods for multigroup experiments developed by Robinson and Smyth [33, 34]. It also implements statistical methods based on generalized linear models (glms), suitable for multifactor experiments of any complexity, developed by McCarthy et al. [22], Lund et al. [20], Chen et al. [5] and Lun et al. [19]. ... A particular feature of edgeR functionality, both classic and glm, are empirical Bayes methods that permit the estimation of gene-specific biological variation, even for experiments with minimal levels of biological replication." [@chen2014edger]
"edgeR performs differential abundance analysis for pre-defined genomic features. Although not strictly necessary, it usually desirable that these genomic features are non-overlapping. For simplicity, we will hence-forth refer to the genomic features as 'genes', although they could in principle be transcripts, exons, general genomic intervals or some other type of feature. For ChIP-seq experiments, abundance might relate to transcription factor binding or to histone mark occupancy, but we will henceforth refer to abundance as in terms of gene expression." [@chen2014edger]
"edgeR stores data in a simple list-based data object called a DGEList. This type of object is easy to use because it can be manipulated like any list in R. ... The main components of an DGEList object are a matrix counts containing the integer counts, a data.frame samples containing information about the samples or libraries, and a optional data.frame genes containing annotation for the genes or genomic features. The data.frame samples contains a column lib.size for the library size or sequencing depth for each sample. If not specified by the user, the library sizes will be computed from the column sums of the counts. For classic edgeR the data.frame samples must also contain a column group, identifying the group membership of each sample." [@chen2014edger]
"Genes with very low counts across all libraries provide little evidence for differential expression. In the biological point of view, a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be biologically important. In addition, the pronounced discreteness of these counts interferes with some of the statistical approximations that are used later in the pipeline. These genes should be filtered out prior to further analysis. As a rule of thumb, genes are dropped if they can’t possibly be expressed in all the samples for any of the conditions. Users can set their own definition of genes being expressed. Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples." [@chen2014edger]
"The most obvious technical factor that affects the read counts [in data for edgeR and so requires normalizations], other than gene expression levels, is the sequencing depth of each RNA sample. edgeR adjusts any differential expression analysis for varying sequencing depths as represented by differing library sizes. This is part of the basic modeling procedure and flows automatically into fold-change or p-value calculations. It is always present, and doesn’t require any user intervention." [@chen2014edger]
"In edgeR, normalization takes the form of correction factors that enter into the statistical model. Such correction factors are usually computed internally by edgeR functions, but it is also possible for a user to supply them. The correction factors may take the form of scaling factors for the library sizes, such as computed by calcNormFactors, which are then used to compute the effective library sizes. Alternatively, gene-specific correction factors can be entered into the glm functions of edgeR as offsets. In the latter case, the offset matrix will be assumed to account for all normalization issues, including sequencing depth and RNA composition. Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR." [@chen2014edger]
"Recent work on a grammar of graphics could be extended for biological data. The grammar of graphics is based on modular components that when combined in different ways will produce different graphics. This enables the user to construct a cominatoric number of plots, including those that were not preconceived by the implementation of the grammar. Most existing tools lack these capabilities." [@yin2012ggbio]
"A new package, ggbio, has been developed and is available on Bioconductor. The package provides the tools to create both typical and non-typical biological plots for genomic data, generated from core Bioconductor data structures by either the high-level autoplot function, or the combination of low-level components of the grammar of graphics. Sharing data structures with the rest of Bioconductor enables direct integration with Bioconductor workflows." [@yin2012ggbio]
"In ggbio, most of the functionality is available through a single command, autoplot, which recognizes the data structure and makes a best guess of the appropriate plot. ... Compared to the more general qplot API of ggplot2, autoplot facilitates the creation of specialized biological graphics and reacts to the specific class of object passed to it. Each type of object has a specific set of relevant graphical parameters, and further customization is possible through the low-level API." [@yin2012ggbio]
The ggbio
package has autoplot
functions for many Bioconductor object
classes, including GRangesList
, ... [@yin2012ggbio
"The grammar [of graphics] is composed of interchangeable components that are combined according to a flexible set of rules to produce plots from a wide range of types." [@yin2012ggbio]
"Data are the first component of the grammar... The ggbio package attempts to automatically load files of specific formats into common Bioconductor data structures, using routines provided by Bioconductor packages... The type of data structure loaded from a file or returned by an algorithm depends on the intrinsic structure of the data. For example, BAM files are loaded into a GappedAlignments, while FASTA and 2bit sequences result in a DNAStringSet. The ggbio package handles each type of data structure differently... In summary, this abstraction mechanism allows ggbio to handle multiple file formats, without discarding any intrinsic properties that are critical for effective plotting." [@yin2012ggbio]
"Genomic data have some specific features that are different from those of more conventional data types, and the basic grammar does not conveniently capture such aspects. The grammar of graphics is extended by ggbio in several ways... These extensions are specific to genomic data, that is, genomic sequences and features, like genes, located on those sequences." [@yin2012ggbio]
"A geom is responsible for translating data to a visual, geometric representation according to mappings between variables and aesthetic properties on the geom. In comparison to regular data elements that might be mapped to the ggplot2 geoms of points, lines, and polygons, genomic data has the basic currency of a range. Ranges underlie exons, introns, and other features, and the genomic coordinate system forms the reference frame for biological data. We have introduced or extended several geoms for representing ranges and gaps between ranges. ... For example, the alignment geom delegates to two other geoms for drawing the ranges and gaps. These default to rectangles and chevrons, respectively. Having specialized geoms for commonly encountered entities, like genes, relegates the tedious coding of primatives, and makes use code simpler and more maintainable." [@yin2012ggbio]
"Coordinate systems locate points in space, and we use coordinate transformations to map from data coordinates to plot coordinates. The most common coordinate system in statistical graphics is cartesian. The transformation of data to cartesian coordinates involves mapping points onto a plane specified by two perpendicular axes (x and y). Why would two plots transform the coordinates differently for the same data? The first reason is to simplify, such as changing curvilinear graphics to linear, and the second reason is to reshape a graphic so that the most important information jumps out at the viewer or can be more accurately perceived. Coordinate transformations are also important in genomic data visualization. For instance, features of interest are often small compared to the intervening gaps, especially in gene models. The exons are usually much smaller than the introns. If users are generally interested in viewing exons and associated annotations, we could simply cut or shrink the intervening introns to use the plot space efficiently." [@yin2012ggbio]
"Almost all experimental outputs are associated with an experimental design and other meta-data, for example, cancer types, gender, and age. Faceting allows users to subset the data by a combination of factors and then lay out multiple plots in a grid, to explore relationships between factors and other variables. The ggplot2 package supports various types of faceting by arbitrary factors. THe ggbio package extends this notion to facet by a list of ranges of interest, for example, a list of gene regions. There is always an implicit faceting by sequence (chromosome), because when the x axis is the chromosomal coordinate, it is not sensible to plto data from different chromosomes on the same plot." [@yin2012ggbio]
"For custom use cases, ggbio provides a low-level API that maps more directly to components of the grammar and thus expresses the plot more explicitly. Generally speaking, we strive to provide sensible, overrideable defaults at the high-level entry points, such as autoplot, while still supporting customizability through the low-level API. All lower level functions have a special prefix to indicate their role in the grammalr, like layout, geom, stat, coord, and theme. The objects returned by the low-level API may be added together via the conventional + syntax. This facilitates the creation of new types of plots. A geom in ggplot2 may be extended to work with more biological data model, for example, geom rect will automatically figure out the boundary of rectangles when the data is a GRanges, as to geom bar, geom segment, and so on." [@yin2012ggbio]
"We use ggplot2 as the foundation for ggbio, due to its principled style, intellegent defaults and explicit orientation towards the grammar of graphics model." [@yin2012ggbio] [@yin2012ggbio]
When you have different types and formats of input, you need different tools to operate on them. For example, you'd (ideally) use different types of scissors depending on whether you needed to cut paper or cloth or metal [@savage2020every]. In a similar way, you can think of these "classes" of objects in R as different materials or types of inputs, and therefore a function that works when you input data in one object class won't work with a different object class. The beauty of the tidyverse set of tools is that their system enforces a common input "material"---the tidy dataframe---and as a result all the functions work on this type of input. You can always, in other words, assume that you are working with one material (say paper), and what's more, you can (almost) always assume that you'll get your output in that "material" as well. That means that you have a set of tools that works on one type of material (scissors for paper, glue for paper, tape for paper, pens for writing on paper), but you can combine them in different ways across your pipeline, because no matter how much you've cut or glued or written on your paper at any given point, it's still paper, and so you can still use paper-focused tools.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.