Many R extension packages for pre-processing experimental data use complex (rather than 'tidy') data formats within their code, and many output data in complex formats. Very recently, the broom and biobroom R packages have been developed to extract a 'tidy' dataset from a complex data format. These tools create a clean, simple connection between the complex data formats often used in pre-processing experimental data and the 'tidy' format required to use the 'tidyverse' tools now taught in many introductory R courses. In this module, we will describe the 'list' data structure, the common backbone for complex data structures in R and provide tips on how to explore and extract data stored in R in this format, including through the broom and biobroom packages.
Objectives. After this module, the trainee will be able to:
When you are writing scripts in R to work with your code, if you are at a point in your pipeline when you can use a "tidyverse" approach, then you will "keep" your data in a dataframe, as your data structure, throughout your work. However, at earlier stages in your preprocessing, you may need to use tools that use other data structures. It's helpful to understand the basic building blocks of R data structures, so you can find elements of your data in these other, more customized data structures.
For example, metabolomics data can be collected from the mass spectrometer with the goal of measuring levels of a large number of metabolite features in each sample. The data collected from the mass spectrometer will be very large, as these data describe the full spectra [?] measured for each sample. Through pre-processing, these data can be used to align peaks across different samples and measure the area under each peak [?] to estimate the level of each metabolite feature in each sample. This pre-processing will produce a much smaller table of data, with a structure that can be easily stored in a dataframe structure (for example, a row for each sample and a column for each metabolite feature, with the cell values giving the level of each metabolite feature in each sample). Therefore, before pre-processing, the data will be too complex and large to reasonably be stored in a dataframe structure, but instead will require a Bioconductor approach and the use of more complex data structures, while after pre-processing, the workflow can move into a tidyverse approach, centered on keeping the data in a dataframe structure.
Come in packages
Help files for data structures
Data structure often changes over pipeline
Generic versus structure-specific functions
How to access data in a complex structure
A tour of the Seurat data structure
Some object classes in BioConductor:
eSet
from Biobase
Sequence
from IRanges
MAlist
from limma
ExpressionSet
from Biobase
Some of the most important data structures in Bioconductor are [@huber2015orchestrating] (from Table 2 in this reference):
ExpressionSet
(Biobase
package) SummarizedExperiment
(GenomicRanges
package)GRanges
(GenomicRanges
package) VCF
(VariantAnnotation
package)VRanges
(VariantAnnotation
package) BSgenome
(BSgenome
package) Structures for sequence data
Structures for mass spectrometry data
Structures for flow cytometry data
Structures for gene expression data
Structures for single-cell gene expression data
BiocViews to find more tools
How to explore data in Bioconductor structures
While it is a bit trickier to explore your data when it is stored in a list---either a general list you created, or one that forms the base for a specialized class structure through functions from a Bioconductor package---you can certainly learn how to do this navigation. This is a powerful and critical tool for you to learn as you learn to preprocess your data in R, as you should never feel like you data is stored in a "black box" structure, where you can't peek in and explore it. You should always feel like you can take a look at any part of your data at any step in the process of preprocessing, analyzing, and visualizing it.
You can use generic functions, like View()
and str()
.
You can use typeof
to determine the data type and is.[x]
(is.logical
,
is.character
, is.double
, and is.integer
) to test if data has a certain
type [@wickham2019advanced].
To feel comfortable exploring your data at any stage during the preprocessing steps, you should learn how to investigate and explore data that's stored in a list structure in R. Because the list structure is the building block for complex data structures, including Bioconductor class structures, this will serve you well throughout your work. You should get in the habit of checking the structure and navigating where each piece of data is stored in the data structure at each step in preprocessing your data. Also, by checking your data throughout preprocessing, you might find that there are bits of information tucked in your data at early stages that you aren't yet using. For example, many file formats for laboratory equipment include slots for information about the equipment and its settings during when running the sample. This information might be read in from the file into R, but you might not know it's there for you to use if you'd like, to help you in creating reproducible reports that include this metadata about the experimental equipment and settings.
First, you will want to figure out whether your data is stored in a generic
list, or if it's stored in a specific class-based data structure, which means it
will have a bit more of a standardized structure. To do this, you can run the
class
function on your data object. The output of this might be a single value
(e.g., "list" [?]) or a short list. If it's a short list, it will include both
the specific class of the object and, as you go down the list, the more
general data structure types that this class is built on. For example, if the
class
function returns this list:
[Example list of data types---maybe some specific class, then "list"?]
it means that the data's in a class-based structure called ... which is built on the more general structure of a list. You can apply to this data any of the functions that are specifically built for ... data structures, but you can also apply functions built for the more general list data structure.
There are several tools you can use to explore data structured as lists in R. R lists can sometimes be very large---in terms of the amount of data stored in them---particularly for some types of biomedical data. With some of the tools covered in this subsection, that will mean that your first look might seem overwhelming. We'll also cover some tools, therefore, that will let you peel away levels of the data in a bit more manageable way, which you can use when you encounter list-structured data that at first feels overwhelming.
First, if your data is stored in a specific class-based data structure, there likely will also be help files specifically for the class structure that can help you navigate it and figure out where things are. [Example]
[More about exploring data in list structures.]
You can use the getSlots
function with S4 objects to see all the
slots within the object.
How to extract data from Bioconductor structures
By using the accessor function, instead of @
, your code will be more robust
to changes that the developers make. They will be sensitive to insuring that
the accessor function for a particular part of the data continues to work
regardless of changes they make to the structure that is used to store data in
objects in that class. They will be less committed, however, to keeping the
same slots, and in the same positions, as they develop the software. The
"contract" with the user is through the accessor function, in other words,
rather than through the slot name in the object.
Finding functions that work with a data structure
Chaining together preprocessing steps with Bioconductor structures
When you are writing scripts in R to work with your code, if you are at a point in your pipeline when you can use a "tidyverse" approach, then you will "keep" your data in a dataframe, as your data structure, throughout your work. However, at earlier stages in your preprocessing, you may need to use tools that use other data structures. It's helpful to understand the basic building blocks of R data structures, so you can find elements of your data in these other, more customized data structures.
For example, metabolomics data can be collected from the mass spectrometer with the goal of measuring levels of a large number of metabolite features in each sample. The data collected from the mass spectrometer will be very large, as these data describe the full spectra [?] measured for each sample. Through pre-processing, these data can be used to align peaks across different samples and measure the area under each peak [?] to estimate the level of each metabolite feature in each sample. This pre-processing will produce a much smaller table of data, with a structure that can be easily stored in a dataframe structure (for example, a row for each sample and a column for each metabolite feature, with the cell values giving the level of each metabolite feature in each sample). Therefore, before pre-processing, the data will be too complex and large to reasonably be stored in a dataframe structure, but instead will require a Bioconductor approach and the use of more complex data structures, while after pre-processing, the workflow can move into a tidyverse approach, centered on keeping the data in a dataframe structure.
Many R data structures are built on a general structure called a "list". This data structure is a useful basic general data structure, because it is extraordinarily flexible. The list data structure is flexible in two important ways: it allows you to include data of different types in the same data structure, and it allows you to include data with different dimensions---and data stored hierarchically, including various other data structures---within the list structure. We'll cover each of these points a bit more below and describe why they're helpful in making the list a very good general purpose data structure.
In R, your data can be stored as different types of data: whole numbers can be stored as an integer data type, continuous [?] numbers through a few types of floating data types, character strings as a character data type, and logical data (which can only take the two values of "TRUE" and "FALSE") as a logical data type. More complex data types can be built using these---for example, there's a special data type for storing dates that's based on a combination of an [integer?] data type, with added information counting the number of days [?] from a set starting date (called the [Unix epoch?]), January 1, 1970. (This set-up for storing dates allows them to be printed to look like dates, rather than numbers, but at the same time allows them to be manipulated through operations like finding out which date comes earliest in a set, determining the number of days between two dates, and so on.) R uses these different data types for several reasons. First, by using different data types, R can improve its efficiency [?] in storing data. Each piece of data must---as you go deep in the heart of how the computer works---as a series of binary digits (0s and 1s). Some types of data can be stored using fewer of these bits (binary digits). Each measurement of logical data, for example, can be stored in a single bit, since it only can take one of two values (0 or 1, for FALSE and TRUE, respectively). For character strings, these can be divided into each character in the string for storage (for example, "cat" can be stored as "c", "a", "t"). There is a set of characters called the ASCII character set that includes the lowercase and uppercase of the letters and punctuation sets that you see on a standard US keyboard [?], and if the character strings only use these characters, they can be stored in [x] bits per character. For numeric data types, integers can typically be stores in [x] bits per number, while continuous [?] numbers, stored in single or double floating point notation [?], are stored in [x] and [x] bits respectively. When R stores data in specific types, it can be more memory efficient by packing the types of data that can be stored in less space (like logical data) into very compact structures.
The second advantage of the list structure in R is that it has enormous flexibility in terms of storing lots of data in lots of possible places. This data can have different types and even different substructures. Some data structures in R are very constrained in what type of data they can store and what structure they use to store it. For example, one of the "building block" data structures in R is the vector. This data structure is one dimensional and can only contain data that have the same data type---you can think of this as a bead string of values, each of the same type. For example, you could have a vector that gives a series of names of study sites (each a character string), or a vector that gives the dates of time points in a study (each a date data type), or a vector that gives the weights of mice in a study (each a numeric data type). You cannot, however, have a vector that includes some study site names and then some dates and then some weights, since these should be in different data types. Further, you can't arrange the data in any structure except a straight, one-dimensional series if you are using a vector. The dataframe structure provides a bit more flexibility---you can expand into two dimensions, rather than one, and you can have different data types in different columns of the dataframe (although each column must itself have a single data type).
The list data structure is much more flexible. It essentially allows you to create different "slots", and you can store any type of data in each of these slots. In each slot you can store any of the other types of data structures in R---for example, vectors, dataframes, or other lists. You can even store unusual things like R environments [?] or pointers that give the directions to where data is stored on the computer without reading the data into R (and so saving room in the RAM memory, which is used when data is "ready to go" in R, but which has much more limited space than the mass [?] storage on your computer).
Since you can put a list into the slot of a list, it allows you to create deep, layered structures of data. For example, you could have one slot in a list where you store the metadata for your experiment, and this slot might itself be a list where you store one dataframe with some information about the settings of the laboratory equipement you used to collect the data, and another dataframe that provides information about the experimental design variables (e.g., which animal received which treatment). Another slot in the larger list then might have experimental measurements, and these might either be in a dataframe or, if the data are very large, might be represented through pointers to where the data is stored in memory, rather than having the data included directly in the data structure.
Given all these advantages of the list data structure, then, why not use it all the time? While it is a very helpful building block, it turns out that its flexibility can have some disadvantages in some cases. This flexibility means that you can's always assume that certain bits of data are in a certain spot in each instance of a list in R. Conversely, if you have data stored in a less flexible structure, you can often rely on certain parts of the data always being in a certain part of the data structure. In a "tidy" dataframe, for example, you can always assume that each row represents the measurements for one observation at the unit of observation for that dataframe, and that each column gives values across all observations for one particular value that was measured for all the observations. For example, if you are conducting an experiment with mice, where a certain number of mice were sacrificed at certain time points, with their weight and the bacteria load in their lungs measured when the mouse was sacrificed, then you could store the data in a dataframe, with a row for each mouse, and columns giving the experimental characteristics for each mouse (e.g., treatment status, time point when the mouse was sacrificed), the mouse's weight, and the mouse's bacteria load when sacrificed. You could store all of this information in a list, as well, but the defined, two-dimensional structure of the dataframe makes it much more clearly defined where all the data goes in the dataframe structure, while you could order the data in many ways within a list.
There is a big advantage to having stricter standards for what parts of data go where when it comes to writing functions that can be used across a lot of data. You can think of this in terms of how cars are set up versus how kitchens are set up. Cars are very standardized in the "interface" that you get when you sit down to drive them. The gas and brakes are typically floor pedals, with the gas to the right of the brake. The steering is almost always provided through a wheel centered in front of the driver's torso. The mechanism for shifting gears (e.g., forward, reverse) is typically to the right of the steering wheel, while mecahnisms for features like lights and windshield wipers, are typically to the left of the steering wheel. Because this interface is so standardized, you can get into a car you've never driven before and typically figure out how to drive it very quickly. You don't need a lot of time exploring where everything is or a lot of directions from someone familiar with the car to figure out where things are. Think of the last time that you drove a rental car---within five minutes, at most, you were probably able to orient yourself to figure out where everything you needed was. This is like a dataframe in R---you can pretty quickly figure out where everything you might need is stored in the data structure, and people can write functions to use with these dataframes that work well generally across lots of people's data because they can assume that certain pieces of data are in certain places.
By contrast, think about walking into someone else's kitchen and orienting yourself to use that. Kitchen designs do tend to have some general features---most will have a few common large elements, like a stove somewhere, a refrigerator somewhere, a pantry somewhere, and storage for pots, pans, and utensils somewhere. However, there is a lot of flexibility in where each of these are in the kitchen design, and further flexibility in how things are organized within each of these structures. If you cook in someone else's kitchen, it is easy to find yourself disoriented in the middle of cooking a recipe, where a utensil that you can grab almost without thinking in your own kitchen requires you to stop and search many places in someone else's kitchen. This is like a list in R---there are so many places that you can store data in a list, and so much flexibility, that you often find yourself having to dig around to find a certain element in a list data structure that someone else has created, and you often can't assume that certain pieces are in certain places if you are writing your own functions, so it becomes hard to write functions that are "general purpose" for generic list structures in R.
There is a way that list structures can be used in R in a way that retains some of their flexibility while also leveraging some of the benefits of standardization. This is R's system for creating objects. These object structures are built on the list data structure, but each object is constrained to have certain elements of data in certain structures of the data. These structures cannot be used as easily as dataframes in a "tidyverse" approach, since the tidyverse tools are built based on the assumption that data is stored in a tidy dataframe. However, they are used in many of the Bioconductor approaches that allow powerful tools for the earlier stages in preprocessing biological data. The types of standards that are imposed in the more specialized objects include which slots the list can have, the names they have, what order they're in (e.g., in a certain object, the metadata about the experiment might always be stored in the first slot of the list), and what structures and/or data types the data in each slot should have.
R programmers get a lot of advantages from using these classes because they can write functions under the assumption that certain pieces of the data will always be in the same spot for that type of object. There is still flexibility in the object, in that it can store lots of different types of data, in a variety of different structures. While this "object oriented" approach in R data structures does provide great advantages for programmers, and allow them to create powerful tools for you to use in R, it does make it a little trickier in some cases for you to explore your data by hand as you work through preprocessing. This is because there typically are a variety of these object classes that your data will pass through as you go through different stages of preprocessing, because different structures are suited to different stages of analysis. Functions often can only be used for a single class of objects, and so you have to keep track of which functions pair up with which classes of data. Further, it can be a bit tricky---at least in comparison to when you have data in a dataframe---to explore your data by hand, because you have to navigate through different slots in the object. By contrast, a dataframe always has the same two-dimensional, rectangular structure, and so it's very easy to navigate and explore data in this structure, and there are a large number of functions that are built to be used with dataframes, providing enormous flexibility in what you can do with data stored in this structure.
While it is a bit trickier to explore your data when it is stored in a list---either a general list you created, or one that forms the base for a specialized class structure through functions from a Bioconductor package---you can certainly learn how to do this navigation. This is a powerful and critical tool for you to learn as you learn to preprocess your data in R, as you should never feel like you data is stored in a "black box" structure, where you can't peek in and explore it. You should always feel like you can take a look at any part of your data at any step in the process of preprocessing, analyzing, and visualizing it.
"There are four primary types of atomic vectors: logical, integer, double, and character (which contains strings). Collectively, integer and double vectors are known as numeric vectors." [@wickham2019advanced]
"... the most important family of data types in base R [is] vectors. ... Vectors come in two flavours: atomic vectors and lists. They differ in terms of their elements' types: for atomic vectors, all elements must have the same type; for lists, elements can have different types. ... Each vector can also have attributes, which you can think of as a named list of arbitrary metadata. Two attributes are particularly important. The dimension attribute turns vectors into matrices and arrays and the class attribute powers the S3 object system." [@wickham2019advanced]
"A few places in R's documentation call lists generic vectors to emphasise their difference from atomic vectors." [@wickham2019advanced]
"Some of the most important S3 vectors [are] factors, dates and times, data frames, and tibbles." [@wickham2019advanced]
You can use typeof
to determine the data type and is.[x]
(is.logical
,
is.character
, is.double
, and is.integer
) to test if data has a certain
type [@wickham2019advanced].
"You may have noticed that the set of atomic vectors does not include a number of important data structures like matrices, arrays, factors, or date-times. These types are all built on top of atomic vectors by adding attributes." [@wickham2019advanced]
"Adding a
dim
attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array. Matrices and arrays are primarily mathematical and statistical tools, not programming tools..." [@wickham2019advanced]"One of the most important vector attributes is
class
, which underlies the S3 object system. Having a class attribute turns an object into a S3 object, which means it will behave differently from a regular vector when passed to a generic function. Every S3 object is built on top of a base type, and often stores additional information in other attributes." [@wickham2019advanced]"Lists are a step up in complexity from atomic vectors: each element can be any type, not just vectors." [@wickham2019advanced]
"Lists are sometimes called recursive vectors because a list can contain other lists. This makes them fundamentally different from atomic vectors." [@wickham2019advanced]
"The two most important S3 vectors built on top of lists are data frames and tibbles. If you do data analysis in R, you're going to be using data frames. A data frame is a named list of vectors with attributes for (column)
names
,row.names
, and its class, "data.frame"... In contrast to a regular list, a data frame has an additional constraint: the length of each of its vectors must be the same. This gives data frames their rectangular structure..." [@wickham2019advanced]"Data frames are one of the biggest and most important ideas in R, and one of the things that makes R different from other programming languages. However, in the over 20 years since their creation, the ways that people use R have changed, and some of the design decisions that made sense at the time data frames were created now cause frustration. This frustration led to the creation of the tibble [Muller and Wickham, 2018], a modern reimagining of the data frame. Tibbles are meant to be (as much as possible) drop-in replacements for data frames that fix those frustrations. A concise, and fun, way to summarise the main differences is that tibbles are lazy and surly: they do less and complain more." [@wickham2019advanced]
"Tibbles are provided by the tibble package and share the same structure as data frames. The only difference is that the class vector is longer, and includes
tbl_df
. This allows tibbles to behave differentlyin [several] key ways. ... tibbles never coerce their input (this is one feature that makes them lazy)... Additionally, while data frames automatically transform non-syntactic names (unlesscheck.names = FALSE
), tibbles do not... While every element of a data frame (or tibble) must have the same length, bothdata.frame()
andtibble()
will recycle shorter inputs. However, while data frames automatically recycle columns that are an integer multiple of the longest column, tibbles will only recycle vectors of length one. ... There is one final different:tibble()
allows you to refer to variables created during construction. ... [Unlike data frames,] tibbles do not support row names. ... One of the most obvious differences between tibbles and data frames is how they print... Tibbles tweak [a data frame's subsetting] behaviours so that a[
always returns a tibble, and a$
doesn't do partial matching and warns if it can't find a variable (this is what makes tibbles surly). ... List columns are easier to use with tibbles because they can be directly included insidetibble()
and they will be printed tidily." [@wickham2019advanced]"Since the elements of lists are references to values, the size of a list might be much smaller than you expect." [@wickham2019advanced]
"[The] behavior [of environments] is different from that of other objects: environments are always modified in place. This property is sometimes described as reference semantics because when you modify an environment all existing bindings to that environment continue to have the same reference. ... This basic idea can be used to create functions that 'remember' their previous state... This property is also used to implement the R6 object-oriented programming system..." [@wickham2019advanced]
To feel comfortable exploring your data at any stage during the preprocessing steps, you should learn how to investigate and explore data that's stored in a list structure in R. Because the list structure is the building block for complex data structures, including Bioconductor class structures, this will serve you well throughout your work. You should get in the habit of checking the structure and navigating where each piece of data is stored in the data structure at each step in preprocessing your data. Also, by checking your data throughout preprocessing, you might find that there are bits of information tucked in your data at early stages that you aren't yet using. For example, many file formats for laboratory equipment include slots for information about the equipment and its settings during when running the sample. This information might be read in from the file into R, but you might not know it's there for you to use if you'd like, to help you in creating reproducible reports that include this metadata about the experimental equipment and settings.
First, you will want to figure out whether your data is stored in a generic
list, or if it's stored in a specific class-based data structure, which means it
will have a bit more of a standardized structure. To do this, you can run the
class
function on your data object. The output of this might be a single value
(e.g., "list" [?]) or a short list. If it's a short list, it will include both
the specific class of the object and, as you go down the list, the more
general data structure types that this class is built on. For example, if the
class
function returns this list:
[Example list of data types---maybe some specific class, then "list"?]
it means that the data's in a class-based structure called ... which is built on the more general structure of a list. You can apply to this data any of the functions that are specifically built for ... data structures, but you can also apply functions built for the more general list data structure.
There are several tools you can use to explore data structured as lists in R. R lists can sometimes be very large---in terms of the amount of data stored in them---particularly for some types of biomedical data. With some of the tools covered in this subsection, that will mean that your first look might seem overwhelming. We'll also cover some tools, therefore, that will let you peel away levels of the data in a bit more manageable way, which you can use when you encounter list-structured data that at first feels overwhelming.
First, if your data is stored in a specific class-based data structure, there likely will also be help files specifically for the class structure that can help you navigate it and figure out where things are. [Example]
[More about exploring data in list structures.]
"Use
[
to select any number of elements from a vector. ... Positive integers return elements at the specified positions. ... Negative integers exclude elements at the specified positions... Logical vectors select elements where the corresponding logical vector isTRUE
. This is probably the most useful type of subsetting because you can write an expression that uses a logical vector... If the vector is named, you can also use character vectors to return elements with matching names." [@wickham2019advanced]"Subsetting a list works in the same way as subsetting an atomic vector. Using
[
always returns a list;[[
and$
... lets you pull out elements of a list." [@wickham2019advanced]"
[[
is most important when working with lists because subsetting a list with[
always returns a smaller list. ... Because[[
can return only a single item, you must use it with either a single positive integer or a single string." [@wickham2019advanced]"
$
is a shorthand operator:x$y
is roughly equivalent tox[["y"]]
. It's often used to access variables in a data frame... The one important difference between$
and[[
is that$
does (left-to-right) partial matching [which you likely want to avoid to be safe]." [@wickham2019advanced]"There are two additional subsetting operators, which are needed for S4 objects:
@
(equivalent to$
), andslot()
(equivalent to[[
)." [@wickham2019advanced]"The environment is the data structure that powers scoping. ... Understanding environments is not necessary for day-to-day use of R. But they are important to understand because they power many important features like lexical scoping, name spaces, and R6 classes, and interact with evaluation to give you powerful tools for making domain specific languages, like
dplyr
andggplot2
." [@wickham2019advanced]"The job of an environment is to associate, of bind, a set of names to a set of values. You can think of an environment as a bag of names, with no implied order (i.e., it doesn't make sense to ask which is the first element in an environment)." [@wickham2019advanced]
"... environments have reference semantics: unlike most R objects, when you modify them, you modify them in place, and don't create a copy." [@wickham2019advanced]
"As well as powering scoping, environments are also useful data structures in their own right because they have reference semantics. There are three common problems that they can help solve: Avoiding copies of large data. Since environments have reference semantics, you'll never accidentally create a copy. But bare environments are painful to work with, so I instead recommend using R6 objects, which are built on top of environments. ..." [@wickham2019advanced]
"Generally in R, functional programming is much more important than object-oriented programming, because you typically solve complex problems by decomposing them into simple functions, not simple objects. Nevertheless, there are important reasons to learn each of the three [object-oriented programming] systems [S3, R6, and S4]: S3 allows your functions to return rich results with user-friendly display and programmer-friendly internals. S3 is used throughout base R, so it's important to master if you want to extend base R functions to work with new types of input. R6 provides a standardised way to escape R's copy-on-modify semantics. This is particularly important if you want to model objects that exist independently of R. Today, a common need for R6 is to model data that comes from a web API, and where changes come from inside or outside R. S4 is a rigorous system that forces you to thing carefully about program design. It's particularly well-suited for building large systems that evolve over time and will receive contributions from many programmers. This is why it's used by the Bioconductor project, so another reason to learn S4 is to equip you to contribute to that project." [@wickham2019advanced]
"The main reason to use OOP is polymorphism (literally: many shapes). Polymorphism means that a developer can consider a function's interface separately from its implementation, making it possible to use the same function form for different types of input. This is closely related to the idea of encapsulation: the user doesn't need to worry about details of an object because they are encapsulated behind a standard interface. ... To be more precise, OO systems call the type of an object its class, and an implementation for a specific class is called a method. Roughly speaking, a class defines what an object is and methods describe what that object can do. The class defines the fields, the data possessed by every instance of that class. Classes are organised in a hierarchy so that if a method does not exist for one class, its parent's method is used, and the child is said to inherit behaviour. ... The process of finding the correct method given a class is called method dispatch." [@wickham2019advanced]
"There are two main paradigms of object-oriented programming which differ in how methods and classes are related. In this book, we'll borrow the terminology of Extending R [Chambers 2016] and call these paradigms encapsulated and functional: In encapsulated OOP, methods belong to objects or classes, and method calls typically look like
object.method(arg1, arg2)
. This is called encapsulated because the object encapsulates both data (with fields) and behaviour (with methods), and is the paradigm found in most popular languages. In functional OOP, methods belong to generic functions, and method calls look like ordinary function calls:generic(object, arg2, arg3)
. This is called functional because from the outside it looks like a regular function call, and internally the components are also functions." [@wickham2019advanced]"S3 is R's first OOP system... S3 is an informal implementation of functional OOP and relies on common conventions rather than ironclad guarantees. This makes it easy to get started with, providing a low cost way of solving many simple problems. ... S4 is a formal and rigorous rewrite of S3... It requires more upfront work than S3, but in return provides more guarantees and greater encapsulation. S4 is implemented in the base methods package, which is always installed with R." [@wickham2019advanced]
"While everything is an object, not everything is object-oriented. This confusion arises because the base objects come from S, and were developed before anyone thought that S might need an OOP system. The tools and nomenclature evolved organically over many years without a single guiding principle. Most of the time, the distinction between objects and object-oriented objects is not important. But here we need to get into the nitty gritty details so we'll use the terms base objects and OO objects to distinguish them. ... Techinally, the difference between base and OO objects is that OO objects have a 'class attribute'." [@wickham2019advanced]
"An S3 object is a base type with at least a
class
attribute (other attributes may be used to store other data). ... An S3 object behaves differently from its underlying base type whenever it's passed to a generic (short for generic function). ... A generic function defines an interface, which uses a different implementation depending on the class of an argument (almost always the first argument). Many base R functions are generic, including the important"If you have done object-oriented programming in other languages, you may be surprised to learn that S3 has no formal definition of a class: to make an object an instance of a class, you simply set the class attribute. ... You can determine the class of an S3 object with
class(x)
, and see if an object is an instance of a class usinginherits(x, "classname")
." [@wickham2019advanced]"The job of an S3 generic is to perform method dispatch, i.e., find the specific implementation for a class." [@wickham2019advanced]
"An important new componenet of S4 is the slot, a named component of the object that is accessed using the specialised subsetting operator
@
(pronounced 'at'). The set of slots, and their classes, forms an important part of the definition of an S4 class." [@wickham2019advanced]"Given an S4 object you can see its class with
is()
and access slots with@
(equivalent to$
) andslot()
(equivalent to[[
) ... Generally, you should only use@
in your methods. If you're working with someone else's class, look for accessor functions that allow you to safely set and get slot values. ... Accessors are typically S4 generics allowing multiple classes to share the same external interface." [@wickham2019advanced]"If you're using an S4 class defined in a package, you can get help on it with
class?Person
. To get help for a method, put?
in front of a call (e.g.,?age(john)
) and?
will use the class of the arguments to figure out which help file you need." [@wickham2019advanced]"Slots [in S4 objects] should be considered an internal implementation detail: they can change without warning and user code should avoid accessing them directly. Instead, all user-accessible slots should be accompanied by a pair of accessors. If the slot is unique to the class, this can just be a function... Typically, however, you'll define a generic so that multiple classes can used the same interface" [@wickham2019advancedr]
"The strictness and formality of S4 make it well suited for large teams. Since more structure is provided by the system itself, there is less need for convention, and new contributors don't need as much training. S4 tends to require more upfront design than S3, and this investment is more likely to pay off on larger projects where greater resources are available. One large team where S4 is used to good effect is Bioconductor. Bioconductor is similar to CRAN: it's a way of sharing packages amongst a wider audient. Bioconductor is smaller than CRAN (~1,300 versus ~10,000 packages, July 2017) and the packages tend to be more tightly integrated because of the shared domain and because Bioconductor has a stricter review process. Bioconductor packages are not required to use S4, but most will because the key data structures (e.g., Summarized Experiment, IRanges, DNAStringSet) are built using S4." [@wickham2019advanced]
"The biggest challenge to using S4 is the combination of increased complexity and absence of a single source of documentation. S4 is a complex system and it can be challenging to use effectively in practice. This wouldn't be such a problem if S4 wasn't scattered through R documentation, books, and websites. S4 needs a book length treatment, but that book does not (yet) exist. (The documentation for S3 is no better, but the lack is less painful because S3 is much simpler.)" [@wickham2019advanced]
The tidyverse approach in R is based on keeping data in a dataframe structure. By keeping this common structure, the tidyverse allows for straightforward but powerful work with your data by chaining together simple, single-purpose functions. This approach is widely covered in introductory R programming courses and books. A great starting point is the book R Programming for Data Science, which is available both in print and freely online at [site]. Many excellent resources exist for learning this approach, and so we won't recover that information here. Instead, we will focus on how to interface between this approach and the object-based approach that's more common with Bioconductor packages. Bioconductor packages often take an object-based approach, and with good reason because of the complexity and size of many early versions of biomedical data in the preprocessing process. There are also resources for learning to use specific Bioconductor packages, as well as some general resources on Bioconductor, like R Programming for Bioinformatics [ref]. However, there are fewer resources available online that teach how to coordinate between these two approaches in a pipeline of code, so that you can leverage the needed power of Bioconductor approaches early in your pipeline, as you preprocess large and complex data, and then shift to use a tidyverse approach once your data is amenible to this more straightforward approach to analysis and visualization.
The heart of making this shift is learning how to convert data, when possible, from a more complex, class-type data structure (built on the flexible list data structure) to the simpler, more standardized two-dimensional dataframe structure that is required for the tidyverse approach. In this subsection, we'll cover approaches for converting your data from Bioconductor data structures to dataframes.
If you are lucky, this might be very straightforward. A pair of packages called
broom
and biobroom
have been created specifically to facilitate the conversion
of data from more complex structures to dataframes. The broom
package was
created first, by David Robinson, to convert the data stored in the objects that
are created by fitting statistical models into tidy dataframes. Many of the functions
in R that run statistical tests or fit statistical models output results in a
more complex, list-based data structure. These structures have nice "print" methods,
so if fitting the model or running the test is the very last step of your pipeline,
you can just read the printed output from R. However, often you want to include
these results in further code---for example, creating plots or tables that show
results from several statistical tests or models. The broom
package includes
several functions for pulling out different bits of data that are stored in the
complex data structure created by fitting the model or running the test and convert
those pieces of data into a tidy dataframe. This tidy dataframe can then be
easily used in further code using a tidyverse approach.
The biobroom
package was created to meet a similar need with data stored in some
of the complex structures commonly used in Bioconductor packages. [More about
biobroom
.]
[How to convert data if there isn't a biobroom
method.] If you are unlucky,
there may not be a broom
or biobroom
method that you can use for the particular
class-based data structure that your data's in, or it might be in a more general
list, rather than a specific class with a biobroom
method. In this case, you'll
need to extract the data "by hand" to move it into a dataframe once your data is
simple enough to work with using a tidyverse approach. If you've mastered how to
explore data stored in a list (covered in the last subsection), you'll have a headstart
on how to do this. Once you know where to find each element of the data in the
structure of the list, you can assign these specific pieces to their own R objects
using typical R assignment (e.g., with the gets arrow, <-
, or with =
, depending
on your preferred R programming style). ...
[Comparison of complexity of biological systems versus complexity of code and algorithms for data pre-processing---for the later, nothing is unknowable or even unknown. Someone somewhere is guaranteed to know exactly how it works, what it's doing, and why. By contrast, with biological systems, there are still things that noone anywhere completely understands. It's helpful to remember that all code and algorithms for data pre-processing is knowable, and that the details are all there if and when you want to dig to figure out what's going on.]
[There are ways to fully package up and save the computer environment used to run a pipeline of pre-processing and analysis, including any system settings, all different software used in analysis steps, and so on. Some of the approaches that are being explored for this include the use of "containers", including Docker containers. This does allow, typically, for full reproducibility of the workflow. However, this approach isn't very proactive in emphasizing the robustness of a workflow or its comprehensibility to others---instead, it makes the workflow reproducible by putting everything in a black box that must be carefully unpackaged and explored if someone wants to understand or adapt the pipeline.]
"Object-oriented design doesn't have to be over-complicated design, but we've observed that too often it is. Too many OO designs are spaghetti-like tangles of is-a and has-a relationships, or feature thick layers of glue in which many of the objects seem to exist simply to hold places in a steep-sided pyramid of abstractions. Such designs are the opposite of transparent; they are (notoriously) opaque and difficult to debug." [@raymond2003art]
"Unix programmers are the original zealots about modularity, but tend to go about it in a quiter way [that with OOP]. Keeping glue layers thin is part of it; more generally, our tradition teaches us to build lower, hugging the ground with algorithms and structures that are designed to be simple and transparent." [@raymond2003art]
"A standard is a precise and detailed description of how some artifact is built or is supposed to work. Examples of software standards include programming languages (the definition of syntax and semantics), data formats (how information is represented), algorithmic processing (the steps necessary to do a computation), and the like. Some standards, like the Word
.doc
file format, are de facto standards---they have no official standing but everyoen uses them. The word 'standard' is best reserved for formal descriptions, often developed and maintained by a quasi-neutral party like a government or a consortium, that define how something is built or operated. The definition is sufficiently complete and precise that separate entities can interact or provide independent implementations. We benefit from hardware standards all the time, though we may not notice how many there are. If I buy a new television set, I can plug it inot the electrical outlets in my home thanks to standards for the size and shape of plugs and the voltage they provide. The set itself will receive signals and display pictures because of standards for broadcast and cable television. I can plug other devices into it through standard cables and connectors like HDMI, USB, S-video and so on. But every TV needs its own remote control and every cell phone needs a different charger because those have not been standardized. Computing has plenty of standards as well, including character sets like ASCII and Unicode, programming languages like C and C++, algorithms for encryption and compression, and protocols for exchanging information over networks." [@kernighan2011d]"Standards are important. They make it possible for independently created things to cooperate, and they open an area to competition from multiple suppliers, while proprietary systems tend to lock everyone in. ... Standards have disadvantages, too---a standard can impede progress if it is inferior or outdated yet everyone is forced to use it. But these are modest drawbacks compared to the advantages." [@kernighan2011d]
"A class is a blueprint for constructing a particular package of code and data; each variable created according to a class's blueprint is known as an object of that class. Code outside of a class that creates and uses an object of that class is known as a client of the class. A class declaration names the class and lists all of the members, or items inside that class. Each item is either a data member---a variable declared within the class---or a method (also known as a member function), which is a function declared within the class. Member functions can include a special type called a constructor, which has the same name as the class and is invoked implicitly when an object of the class is declared. In addition to the normal attributes of a variable or function declaration (such as type, and for functions, the parameter list), each member has an access specifier, which indicates what functions can access the member. A public member can be accessed by any code using the object: code inside the class, a client of the class, or code in a subclass, which is a class that 'inherits' all the code and data of an existing class. A private member can be accessed only by the code inside the class. Protected members ... are similar to private members, except that methods in subclasses can also reference them. Both private and protected members, though, are inaccessible from client code." [@spraul2012think]
"An object should be a meaningful, closely knit collection of data and code that operates on the data." [@spraul2012think]
"Recognizing a situation in which a class would be useful is essential to reaching the higher levels of programming style, but it's equally important to recognize situations in which a class is going to make things worse." [@spraul2012think]
"The word encapsulation is a fancy way of saying that classes put multiple pieces of data and code together in a single package. If you've ever seen a gelatin medicine capsule filled with little spheres, that's a good analogy: The patient takes one capsule and swallows all the individual ingredient spheres inside. ... From a problem-solving standpoint, encapsulation allows us to more easily reuse the code from previous problems to solve current problems. Often, even though we have worked on a problem similar to our current project, reusing what we learned before still takes a lot of work. A fully encapsulated class can work like an external USB drive; you just plug it in and it works. FOr this to happen, though, we must design the class correctly to make sure that the code and data is truly encapsulated and as independent as possible from anything outside of the class. For example, a class that references a global variable can't be copied into a new project without copying the global variable, as well." [@spraul2012think]
"Beyond reusing classes from one program to the next, classes offer the potential for a more immediate form of code reuse: inheritance. ... Using inheritance, we create parent classes with methods common to two or more child classes, thereby 'factoring out' not just a few lines of code [as with helper functions in procedural code] but whole methods." [@spraul2012think]
"One technique we're returned to again and again is dividing a complex problem into smaller, more manageable pieces. Classes are great at dividing programs up into functional units. Encapsulation not only holds data and code together in a reusable package; it also cordons off that data and code from the rest of the program, allowing us to work on that class, and everything else separately. The more classes we make in a program, the greater the problem-dividing effect." [@spraul2012think]
"Some people use the terms information hiding and encapsulation interchangeable, but we'll separate the ideas here. As described previously ..., encapsulation is packaging data and code together. Information hiding means separating the interface of a data structure---the definition of the operations and their parameters---from the implementation of a data structure, or the code inside the functions. If a class has been written with information hiding as a goal, then it's possible to change the implementation of the methods without requiring any changes in the client code (the code that uses the class). Again, we have to be clear on the term interface; this means not only the name of the methods and their parameter list but also the explanation (perhaps expressed in code documentation) of what the different methods do. When we talk about changing the implementation without changing the interface, we mean that we change how the class methods work but not what they do. Some programming authors have referred to this as a kind of implicit contract between the class and the client: The class agrees never to change the effects of existing operations, and the client agrees to use the class strictly on the basis of its interface and to ignore any implementation details." [@spraul2012think]
"So how does information hiding affect problem solving? The principle of information hiding tells the programmer to put aside the class implementation details when working on the client code, or more broadly, to be concerned about a particular class's implementation only when working inside that class. When you can put implementation details out of your mind, you can eliminate distracting thoughts and concentrate on solving the problem at hand." [@spraul2012think]
"A final goal of a well-designed class is expressiveness, or what might be broadly called writability---the ease with which code can be written. A good class, once written, makes the rest of the code simpler to write in the way that a good function makes code simpler to write. Classes effectively extend a language, becoming high-level counterparts to basic low-level features such as loops, if statements, and so forth. ... With classes, programming actions that previously took many steps can be done in just a few steps or just one." [@spraul2012think]
"Right now, in labs across the world, machines are sequencing the genomes of the life on earth. Even with rapidly decreasing costs and huge technological advancements in genome sequencing, we're only seeing a glimpse of the biological information contained in every cell, tissue, organism, and ecosystem. However, the smidgen of total biological information we're gathering amounts to mountains of data biologists need to work with. At no other point in human history has our ability to understand life's complexities been so dependent on our skills to work with and analyze data." [@buffalo2015bioinformatics]
"Bioinformaticians are concerned with deriving biological understanding from large amounts of data with specialized skills and tools. Early in biology's history, the datasets were small and manageable. Most biologists could analyze their own data after taking a statistics course, using Microsoft Excel on a personal desktop computer. However, this is all rapidly changing. Large sequencing datasets are widespread, and will only become more common in the future. Analyzing this data takes different tools, new skills, and many computers with large amounts of memory, processing power, and disk space." [@buffalo2015bioinformatics]
"In a relatively short period of time, sequencing costs dropped drastically, allowing researchers to utilize sequencing data to help answer important biological questions. Early sequencing was low-throughput and costly. Whole genome sequencing efforts were expensive (the human genome cost around $2.7 billion) and only possible through large collaborative efforts. Since the release of the human genome, sequencing costs have decreased explonentially until about 2008 ... With the introduction of next-generation sequencing technologies, the cost of sequencing a megabase of DNA dropped even more rapidly. At this crucial point, a technology that was only affordable to large collaborative sequencing efforts (or individual researchers with very deep pockets) became affordable to researchers across all of biology. ... What was the consequence of this drop in sequencing costs due to these new technologies? As you may have guessed, lots and lots of data. Biological databases have swelled with data after exponential growth. Whereas once small databases shared between collaborators were sufficient, now petabytes of useful data are sitting on servers all over the world. Key insights into biological questions are stored not just in the unanalyzed experimental data sitting on your hard drive, but also spinning around a disk in a data center thousands of miles away." [@buffalo2015bioinformatics]
"To make matters even more complicated, new tools for analyzing biological data are continually being created, and their underlying algorithms are advancing. A 2012 review listed over 70 short-read mappers ... Likewise, our approach to genome assembly has changed considerably in the past five years, as methods to assemble long sequences (such as overlap-layout-consensus algorithms) were abandoned with the emergence of short high-throughput sequencing reads. Now, advances in sequencing chemistry are leading to longer sequencing read lengths and new algorithms are replacing others that were just a few years old. Unfortunately, this abundance and rapid development of bioinformatics tools has serious downsides. Often, bioinformatics tools are not adequately benchmarked, or if they are, they are only benchmarked in one organism. This makes it difficult for new biologists to find and choose the best tool to analyze their data. To make matters more difficult, some bioinformatics programs are not actively developed so that they lose relevance or carry bugs that could negatively affect results. All of this makes choosing an appropriate bioinformatics program in your own research difficult. More importantly, it's imperative to critically assess the output of bioinformatics programs run on your own data." [@buffalo2015bioinformatics]
"With the nature of biological data changing so rapidly, how are you supposed to learn bioinformatics? With all of the tools out there and more continually being created, how is a biologist supposed to know whether a program will work appropriately on her organism's data? The solution is to approach bioinformatics as a bioinformatician does: try stuff, and assess the results. In this way, bioinformatics is just about having the skills to experiment with data using a computer and understanding your results. The experimental part is easy: this comes naturally to most scientists. The limiting factor for most biologists is having the data skills to freely experiment and work with large data on a computer." [@buffalo2015bioinformatics]
"Unfortunately, many of the biologist's common computational tools can't scale to the size and complexity of modern biological data. Complex data formats, interfacing numerous programs, and assessing software and data make large bioinformatics datasets difficult to work with." [@buffalo2015bioinformatics]
"In 10 years, bioinformaticians may only be using a few of the bioinformatics software programs around today. But we most certainly will be using data skills and experimentation to assess data and methods of the future." [@buffalo2015bioinformatics]
"Biology's increasing use of large sequencing datasets is changing more that the tools and skills we need: it's also changing how reproducible and robust our scientific findings are. As we utilize new tools and skills to analyze genomics data, it's necessary to ensure that our approaches are still as reproducible and robust as any other experimental approaches. Unfortunately, the size of our data and the complexity of our analysis workflows make these goals especially difficult in genomics." [@buffalo2015bioinformatics]
"The requisite of reproducibility is that we share our data and methods. In the pre-genomics era, this was much easier. Papers coule include detailed method summaries and entire datasets---exactly as Kreitman's 1986 paper did with a 4,713bp Adh gene flanking sequence (it was embedded in the middle of the paper). Now papers have long supplementary methods, code, and data. Sharing data is no longer trivial either, as sequencing projects can include terabytes of accompanying data. Reference genomes and annotation datasets used in analyses are constantly updated, which can make reproducibility tricky. Links to supplemental materials, methods, and data on journal websites break, materials on faculty websites disappear when faculty members move or update their sites, and software projects become stale when developers leave and don't update code. ... Additionally, the complexity of bioinformatics analyses can lead to findings being susceptible to errors and technical confounding. Even fairly routine genomics projects can use dozens of different programs, complicated input paramter combinations, and many sample and annotation datasets; in addition, work may be spread across servers and workstations. All of these computational data-processing steps create results used in higher-level analyses where we draw our biological conclusions. The end result is that research findings may rest on a rickety scaffold of numerous processing steps. To make matters worse, bioinformatics workflows and analyses are usually only run once to produce results for a publication, and then never run or tested again. These analyses may rely on very specific versions of all software used, which can make it difficult to reproduce on a different system. In learning bioinformatics data skills, it's necessary to concurrently learn reproducibility and robust best practices." [@buffalo2015bioinformatics]
"When we are writing code in a programming language, we work most of the time with RAM, combining and restructuring data values to produce new values in RAM. ... The computer memory in RAM is a series of 0's and 1's, just like the computer memory used to store files in mass storage. In order to work with data values, we need to get those values into RAM in some format. At the basic level of representing a single number or a single piece of text, the solution is the same as it was in Chapter 5 [on file formats for mass storage]. Everything is represented as a pattern of bits, using various numbers of bytes for different sorts of values. In R, in an English locale, and on a 32-bit operating system, a single character usually takes up one byte, an integer takes up four bytes, and a real number 8 bytes. Data values are stored in different ways depending on the data type---whether the values are numbers or texts." [@murrell2009introduction]
"ALthough we do not often encounter the details of the memory representation, except when we need a rough estimate of how much RAM a data set might require, it is important to keep in mind what sort of data type we are working with because the computer code that we will produce different results for different data types. For example, we can only calculate an average if we are dealing with values that have been stored as text." [@murrell2009introduction]
"Another important issue is how collections of values are stored in memory. The tasks that we will consider will typically involve working with an entire data set, or an entire variable from a data set, rather than just a single value, so we need to have a way to represent several related values in memory. This is similar to the problem of deciding on a storage format for a data set... However, rather than talking about different file formats, [in this case] we will talk about different data structures for storing a collection of data values in RAM. ... It will be important to always keep close in our minds what data type we are working with and what sort of data structure we are working with." [@murrell2009introduction]
"Every individual data value has a data type that tells us what sort of value it is. The most common data types are numbers, which R calls numeric values, and text, which R calls character values." [@murrell2009introduction]
"Vectors: A collection of values that all have the same data type. The elements of a vector are all numbers, giving a number vector, or all character values, giving a character vector." [@murrell2009introduction]
"Data frames: A collection of vectors that all have the same length. This is like a matrix, except that each column can contain a different data type." [@murrell2009introduction]
"Lists: A collection of data structures. The components of a list can be simply vectors---similar to a data frame, but with each column allowed to have a different length. However, a list can also be a much more complicated structure. This is a very flexible data structure. Lists can be used to store any combination of data values together." [@murrell2009introduction]
"Notice the way that lists are displayed. The first component of the list starts with the component indes,
[[1]]
, followed by the contents of this component...The second component of the list starts with the component index[[2
]] followed by the contenets of this component..." [@murrell2009introduction]"A list is a very flexible data structure. It can have any number of components, each of which can be any data structure of any length or size. A simple example is a data-frame-like structure where each column can have a different length, but much more complex structures are also possible. For example, it is possible for a component of a list to be another list." [@murrell2009introduction]
"Anyone who has worked with a computer should be familiar with the idea of a list containing another list because a directory or folder of files has this sort of structure: a folder contains multiple files of different kinds and sizes and a folder can contain other folders, which can contain more files or even more folders, and so on. Lists allow for this kind of hierarchical structure." [@murrell2009introduction]
"One of the most basic ways that we can manipulate data structures is to subset them---select a smaller portion from a larger data structure. This is analogous to performing a query on a database. ... R has very powerful mechanisms for subsetting... A subset from a vector may be obtained by appending an index within square brackets to the end of a symbol name. ... The index can be a vector of any length ... The index does not have to be a contiguous sequence, and it can include repetitions... As well as using integers for indices, we can use logical values... A data frame can also be indexed using square brackets, though slightly differently because we have to specify both which rows and which columns we want ... When a data structure has named components, a subset may be selected using those names." [@murrell2009introduction]
"Single square bracket subsetting on a data frame is like taking an egg container that contains a dozen eggs and chopping up the container so that we are left with a smaller egg container that contains just a few eggs. Double square bracket subsetting on a data frame is like selecting just one egg from an egg container." [@murrell2009introduction]
"We can often get some idea of what sort of data structure we are working with by simply viewing how the data are displayed on screen. However, a more definitive answer can be obtained by calling the
class()
function. ... Many R functions return a data structure that is not one of the basic data structures that we have already seen [like the 'xtabs' and 'table' classes]. ... We have not seen either of these data structures before. However, much of what we know about working with the standard data structures ... will work with any new class that we encounter. For example, it is usually possible to subset any class using the standard square bracket syntax. ... Where appropriate, arithmetic and comparisons will also generally work... Furthermore, if necessary, we can ofter resort to coercing a class to something more standard and familiar." [@murrell2009introduction]"Dates are an important example of a special data structure. Representing dates as just text is convenient for humans to view, but other representations are better for computers to work with. ... Having a special class for dates means that we can perform tasks with dates, such as arithmetic and comparisons, in a meaningful way, something we could not do if we stored the date as just a character value." [@murrell2009introduction]
"The Date class stores date values as integer values, representing the number of days since January 1st 1970, and automatically converts the numbers to a readable text value to display the dates on the screen." [@murrell2009introduction]
"When working with anything but tiny data sets, basic reatures of the data set cannot be determined by just viewing the data values. [There are] a number of functions that are useful for obtaining useful summary features from a data structure. The
summary()
function produces summary information for a data structure... Thelength()
function is useful for determining the number of values in a vector or the number of components in a list. ... Thestr()
function (short for 'structure') is useful when dealing with large objects because it only shows a sample of the values in each part of the object, although the display is very low-level so it may not always make things clearer. ... Another function that is useful for inspecting a large object is thehead()
function. This shows just the first few elemeents of an object, so we can see the basic structure without seeing all of the values." [@murrell2009introduction]"Generic functions ... will accept many different data structures as arguments. ... a generic function adapts itself to the data structure it is given. Generic functions do different things when given different data structures." [@murrell2009introduction]
"An example of a generic function is the
summary()
function. The result of a call tosummary()
sill depend on what sort of data structure we provide." [@murrell2009introduction]"Generic functions are another reason why it is easy to work with data in R; a single function will produce a sensible result no matter what data structure we provide. However, generic functions are also another reason why it is so important to be aware of what data structures we are working with. Without knowing what sort of data we are using, we cannot know what sort of result to expect from a generic function." [@murrell2009introduction]
"R has become very popular and is now being used for projects that require substantial software engineering as well as its continued widespread use as an interactive environment for data analysis. This essentially means that there are two masters---reliability and ease of use. S3 is indeed easy to use, but can be made unreliable through nothing other than bad luck, or a poor choice of names, and hence is not a suitable paradigm for constructing large systems. S4, on the other hand, is better suited for developing large software projects but has an increased complexity of use." [@gentleman2008r]
"Object-oriented programming has become a widely used and valuable tool for software engineering. Much of its value derives from the fact that it is often easier to design, write, and maintain software when there is some clear separation of the data representation from the operations that are to be performed on it. In an OOP system, real physical things ... are generally represented by classes, and methods (functions) are written to handle the different manipulations that need to be performed on the objects." [@gentleman2008r]
"The views that many people have of OOP have been based largely on exposure to languages like Java, where the system can be described as class-centric. In a a class-centric system, classes define objects and are repositories for the methods that act on those objects. In contrast, languages such as ... R separate the class specification from the specification of generic functions, and could be described as function-centric systems." [@gentleman2008r]
"The genome of every organism is encoded in chromosomes that consist of either DNA or RNA. High throughput sequencing technology has made it possible to determine the sequence of the genome for virtually any organism, and there are many that are currently available. ... However, in many cases, either the exact nucleotide at any location is unknown, or is variable, and the International Union of Pure and Applied Chemistry (IUPAC) has provided a standard nomenclature suitable for representing such sequences. The alphabet for dealing with protein sequences is based on the 20 amino acids. ... The basic class used to hold strings [in the Biostrings package] is the BString class, which has been designed to be efficient in its handling of large character strings. Subclasses include DNAString, RNAString, and AAString (for holding amino acid sequences). The BStringViews class holds a set of views on a single BString instance; each view is essentially a substring of the underlying BString instance. Alignments are stored using the BStringAlign class." [@gentleman2008r] [More on functions that work with these classes on p. 171]
"A number of complete genomes, represented as DNAString objects, are provided through the Bioconductor project. They rely on the infrastructure in the BSgenome package, and all such packages have names that begin with
BSgenome
. You can find the list of available genomes using theavailable.genomes
function." [@gentleman2008r]"Atomic vectors are the most basic of all data structures. An atomic vector contains some number of values of the same type; that number could be zero. Atomic vectors can contain integers, doubles, logicals or character strings. Both complex numbers and raw (pure bytes) have atomic representations ... Character vectors in the S language are vectors of character strings, not the vectors of characters. For example, the string 'super' would be represented as a character vector of length one, not of lenth five..." [@gentleman2008r]
"Lists can be used to store items that are not all of the same type. ... Lists are also referred to as generic vectors since they share many of the properties of vectors, but the elements are allowed to have different types." [@gentleman2008r]
"Lists can be of any length, and the elements of a list can be named, or not. Any R object can be an element of a list, including another list..." [@gentleman2008r]
"A
data.frame
is a special kind of list. Data frames were created to provide a common structure for storing rectangular data sets and for passing them to different functions for modeling and visualization. In many cases a data set can be thought of as a rectangular structure with rows corresponding to cases and columns corresponding to the different variables that were measured on each of the cases. One might think that a matrix would be the appropriate representation, but that is only true if all of the variables are of the same type, and this is seldom the case." [@gentleman2008r]"[Data frames] are essentially a list of vectors, with one vector for each variable. It is an error if the vectors are not all of the same length." [@gentleman2008r]
"Sometimes it will be helpful to find out about an object. Obvious functions to try are
class
andtypeof
. But many find that bothstr
andobject.size
are more useful. ... The functionshead
andtail
are convenience functions that list the first few, or the last few, rows of a matrix." [@gentleman2008r]"The S langauge has its roots in the Algol family of languages and has adopted some of the general vector subsetting and subscripting techniques that were available in languages such as APL. This is perhaps one area wehre programmers more familiar with other languages fail to make appropriate use of the available functionality. ... There are slight differents between subsetting of vectors, arrays, lists, data.frames, and enviroments that can sometimes catch the unwary. But there are also many commonalities. ... Subsetting can be carried out by three different operators: the single square bracket
[
, the double square bracket[[
, and the dollar,$
. We note that each of these three operators are actually generic functions and users can write methods that extend and override them... One way of describing the behavior of the single bracket operator is that the type of the return value matches the type of the value it is applied to. Thus, a single bracket subset of a list is a list itself. ... Both[[
and$
extract a single value. There are some differences between the two;$
does not evaluate its second argument while[[
does, and hence one can use expressions. The$
operator uses partial matching when extracting named elements but[
and[[
do not." [@gentleman2008r]"Subsetting plays two roles in the S language. One is an extraction role, where a subset of a vector is identified by a set of supplied indices and the resulting subset is returned as a value. Venables and Ripley (2000) refer to this as indexing. The second purpose is subset assignment, where the goal is to identify a subset of values that should have their values changed; we call this subset assignment." [@gentleman2008r]
"There are four basic types of subscript indices: positive integers, negative integers, logical vectors, and character vectors. These four types cannot be mixed... For matrix and array subscripting, one can use different types of subscripts for the different dimensions. Not all vectors, or recursive objects, support all types of subscripting indices. For example, atomic vectors cannot be subscripted using
$
, while environments cannot be subscripted using[
." [@gentleman2008r]"In bioinformatics, the plain-text data we work with is often encoded in ASCII. ASCII is a character encoding scheme that uses 7 bits to represent 128 different values, including letters (upper- and lowercase), numbers, and special nonvisible characters. While ASCII only uses 8 bits, nowadays computers use an 8-bit byte (a unit representing 8 bits) to store ASCII characters. More information about ASCII is available in your terminal through
man ascii
." [@buffalo2015bioinformatics]"Some files will have non-ASCII encoding schemes, and may contain special characters. The most common character encoding scheme is UTF-8, which is a superset of ASCII but allows for special characters." [@buffalo2015bioinformatics]
"Bioinformatics data is often text---for example, the As, Cs, Ts, and Gs in sequencing read files or reference genomes, or tab-delimited files fo gene coordinates. The text data in bioinformatics is often large, too (gigabytes or more that can't fit into your computer's memory at once). This is why Unix's philosophy of handling text streams is useful to bioinformatics: text streams allow us to do processing on a stream of data rather than holding it all in memory." [@buffalo2015bioinformatics]
"Exploratory data analysis plays an integral role throughout an entire bioinformatics project. Exploratory data analysis skills are just as applicable in analyzing intermediate bioinformatics data (e.g., are fewer reads from this sequencing lane aligning?) as they are in making sense of results from statistical analyses (e.g., what's the distribution of these p-values, and do they correlate with possible confounders like gene length?). These exploratory analyses need not be complex or exceedingly detailed (many patterns are visible with simple analyses and visualization); it's just about wanting to look into the data and having the skill set to do so." [@buffalo2015bioinformatics]
"Functions like
table()
are generic---they are designed to work with objects of all kinds of classes. Generic functions are also designed to do the right thing depending on the class of the object they're called on (in programming lingo, we say that the function is polymorphic)." [@buffalo2015bioinformatics]"It's quite common to encounter genomics datasets that are difficult to load into R because they're large files. This is either because it takes too long to load the entire dataset into R, or your machine simply doesn't have enough memory. In many cases, the best strategy is to reduce the size of your data somehow: summarizing data in earlier processing steps, omitting unnecessary columns, splitting your data into chunks (e.g., working with a chromosome at a tiem), or working on a random subset of your data. Many bioinformatics analyses do not require working on an entire genomic dataset at once, so these strategies can work quite well. These approaches are also the only way to work with data that is truly too large to fit in your machine's memory (apart from getting a machine with more memory)." [@buffalo2015bioinformatics]
"If your data is larger than the available memory on your machine, you'll need to use a strategy that keeps the bulk of your data out of memory, but still allows for each access from R. A good solution for moerately large data is to use SQLite and query out subsets for computation using the R package
RSQLite
. ... Finally ... many Unix data tools have versions that work on gzipped files:zless
,zcat
(gzcat
on BSD-derived systems like Mac OS X), and others. Likewise, R's data-reading functions can also read gzipped files directly---there's some slight performance gains in reading in gzipped files, as there are fewer bytes to read off of (slow) hard disks." [@buffalo2015bioinformatics]"Quite often, data we load in to R will be in the wrong shape for what we want to do with it. Tabular data can come in two different formats: long and wide. ... In many cases, data is recorded by humans in wide format, but we need data in long format when working with and plotting statistical modeling functions." [@buffalo2015bioinformatics]
"Exploratory data analysis emphasizes visualization as the best tool to understand and explore our data---both to learn what the data says and what its limitations are." [@buffalo2015bioinformatics]
"R vectors require all elements to have the same data type (that is, vectors are homogeneous). They only support the six data types discussed earlier (integer, double, character, logical, complex, and raw). In contrast, R's lists are more versatile: Lists can contain elements of different types (they are heterogenesou); Elements can be any object in R (vectors with different types, other lists, environments, dataframes, matrices, functions, etc.); Because lists can store other lists, they allow for storing data in a recursive way (in contrast, vectors cannot contain other vectors." [@buffalo2015bioinformatics]
"The versatility of lists make them indispensable in programming and data analysis with R." [@buffalo2015bioinformatics]
"As with R's vectors, we can extract subsets of a list or change values of specific elements using indexing. However, accessing elements from an R list is slightly different than with vectors. Because R's list can contain objects with different types, a subset containing multiple list elements could contain objects with different types. Consequently, the only way to return a subset of more than one list element is with another list. As a result, there are two indexing operators for lists: one for accessing a subset of multiple elements as a list (the single bracket...) and one for accessing an element within a list (the double bracket...)." [@buffalo2015bioinformatics]
"Because R's lists can be nested and contain any type of data, list-based data structures can grow to be quite complex. In some cases, it can be difficult to understand the overall structure of some lists. The function
str()
is a convenient R function for inspecting complex data structures.str()
prints a line for each contained data structure, complete with its type, length (or dimensions), and the first few elements it contains. ... For deeply nested lists, you can simplifystr()''s output by specifying the maximum depth of nested structure to return with
str()'s second argument,
max.level. By default,
max.levelis
NA`, which returns all nested structures." [@buffalo2015bioinformatics]"Understanding R's data structures and how subsetting works are fundamental to having the freedom in R to explore data any way you like." [@buffalo2015bioinformatics]
"Some of Bioconductor's core packages: GenomicRanges: Used to represent and work with genomic ranges; GenomicFeatures: used to represent and work with ranges that represent gene models and other features of a genome (genes, exons, UTRs, transcripts, etc.); Biostrings and BSgenome: Used for manipulating genome sequence data in R... rtracklayer: Used for reading in common bioinformatics formats like BED, GTF/GFF, and WIG." [@buffalo2015bioinformatics]
"The
GenomicRanges
package introduces a new class calledGRanges
for storing genomic ranges. TheGRanges
builds off ofIRanges
.IRanges
objects are used to store ranges of genomic regions on a single sequence, andGRanges
objects contain the two other pieces of information necessary to specify a genomic location: sequence name (e.g., which chromosome) and strand.GRanges
objects also have metadata columns, which are the data linked to each genomic range." [@buffalo2015bioinformatics]"All metadata attached to a
GRanges
object are stored in aDataFrame
, which behaves identically to R's basedata.frame
but supports a wider variety of column types. For example,DataFrames
allow for run-length encoded vectors to save memory ... in practice, we can store any type of data: identifiers and names (e.g., for genes, transcripts, SNPs, or exons), annotation data (e.g., conservation scores, GC content, repeat content, etc.), or experimental data (e.g., if ranges correspond to alignments, data like mapping quality and the number of gaps). ... the union of genomic location with any type of data is what makesGRanges
so pwoerful." [@buffalo2015bioinformatics]
Some object classes in BioConductor:
eSet
from Biobase
Sequence
from IRanges
MAlist
from limma
ExpressionSet
from Biobase
You can use the getSlots
function with S4 objects to see all the
slots within the object.
"Methods and classes in the S language are essentially programming concepts to enable good organization of functions and of general objects, respectively." [@chambers2006s4]
"Programming in R starts out usually as writing functions, at least once we get past the strict cut-and-paste stage. Functions are the actions of the language; calls to them express what the user wants to happen. The arguments to the functions and the values returned by function calls are the objects. These objects represent everything we deal with. Actions create new objects (such as summaries and models) or present the information in the objects (by plots, printed summaries, or interfaces to other software). R is a functional, object-based system where users program to extend the capacity of the system in terms of new functionality and new kinds of objects." [@chambers2006s4]
"Languages to which the object-oriented programming (OOP) term is typically applied mostly support what might better be called class-oriented programming, well-known examples being C++ and Java. In these languages the essential programming unit is the class definition. Objects are generated as instances of a class and computations on the objects consist of invoking methods on that object. Depending on how strict the language is, all or most of the computations must be expressed in this form. Method invocation is an operator, operating on an instance of a class. Software organization is essentially simple and hierarchical, in the sense that all methods are defined as part of a particular class. That's not how S works; as mentioned, the first and most important programming unit is the function. From the user’s perspective, it's all done by calling functions (even if some of the functions are hidden in the form of operators). Methods and classes provide not class-oriented programming but function- and class-oriented programming. It’s a richer view, but also a more complicated one." [@chambers2006s4]
"A generic function will collect or cache all the methods for that function belonging to all the R packages that have been loaded in the session. When the function is called, the R evaluator then selects a method from those available, by examining how well different methods match the actual arguments in the call." [@chambers2006s4]
"From the users' view, the generic function has (or at least should have) a natural definition in terms of what it is intended to do:
plot()
displays graphics to represent an object or the relation between two objects; arithmetic operators such as '+' carry out the corresponding intuitive numerical computations or extensions of those. Methods should map those intuitive notions naturally and reliably into the concepts represented by the class definitions." [@chambers2006s4]"The class definition contains a definition of the slots in objects from the class and other information of various kinds, but the most important information for the present discussion defines what other classes this class extends; that is, the inheritance or to use the most common term, the superclasses of this class. In R, the names of the superclasses can be seen as the value of
extends(thisClass)
. By definition, an object from any class can be used in a computation designed for any of the superclasses of that class. Therefore, it's precisely the superclasses of the class of an argument that define candidate methods in a particular function call." [@chambers2006s4]"Conceptually, a generic function extends the idea of a function in R by allowing different methods to be selected corresponding to the classes of the objects supplied as arguments in a call to the function." [@chambers2006s4]
The code for different implementations of a method (in other words, different ways it will run with new object classes) can come in different R packages. This allows a developer to add his or her own applications of methods, suited for object classes he or she creates.
A class defines the structure for a way of storing data. When you create
an object that follows this structure, it's an instance of that class.
The new
function is used to create new instances of a class.
When a generic function determines what code to run based on the class of the object, it's called method dispatch.
By using the accessor function, instead of @
, your code will be more robust
to changes that the developers make. They will be sensitive to insuring that
the accessor function for a particular part of the data continues to work
regardless of changes they make to the structure that is used to store data in
objects in that class. They will be less committed, however, to keeping the
same slots, and in the same positions, as they develop the software. The
"contract" with the user is through the accessor function, in other words,
rather than through the slot name in the object.
"Bioconductor is an open-source, open-development software project for the analysis and comprehension of high-throughput data in genomics and molecular biology. The project aims to enable interdisciplinary research, collaboration and rapid development of scientific software. Based on the statistical programming language R, Bioconductor comprises 934 interoperable packages contributed by a large, diverse community of scientists. Packages cover a range of bioinformatic and statistical applications. They undergo formal initial review and continuous automated testing." [@huber2015orchestrating]
"Bioconductor provides core data structures and methods that enable genome-scale analysis of highthroughput data in the context of the rich statistical programming environment offered by the R project. It supports many types of high-throughput sequencing data (including DNA, RNA, chromatin immunoprecipitation, Hi-C, methylomes and ribosome profiling) and associated annotation resources; contains mature facilities for microarray analysis; and covers proteomic, metabolomic, flow cytometry, quantitative imaging, cheminformatic and other high-throughput data. Bioconductor enables the rapid creation of workflows combining multiple data types and tools for statistical inference, regression, network analysis, machine learning and visualization at all stages of a project from data generation to publication." [@huber2015orchestrating]
"Bioconductor is also a flexible software engineering environment in which to develop the tools needed, and it offers users a framework for efficient learning and productive work. The foundations of Bioconductor and its rapid coevolution with experimental technologies are based on two motivating principles. The first is to provide a compelling user experience. Bioconductor documentation comes at three levels: workflows that document complete analyses spanning multiple tools; package vignettes that provide a narrative of the intended uses of a particular package, including detailed executable code examples; and function manual pages with precise descriptions of all inputs and outputs together with working examples. In many cases, users ultimately become developers, making their own algorithms and approaches available to others. The second is to enable and support an active and open scientific community developing and distributing algorithms and software in bioinformatics and computational biology. The support includes guidance and training on software development and documentation, as well as the use of appropriate programming paradigms such as unit testing and judicious optimization. A primary goal is the distributed development of interoperable software components by scientific domain experts. In part we achieve this by urging the use of common data structures that enable workflows integrating multiple data types and disciplines. To facilitate research and innovation, we employ a high-level programming language. This choice yields rapid prototyping, creativity, flexibility and reproducibility in a way that neither point-and-click software nor a general-purpose programming language can. We have embraced R for its scientific and statistical computing capabilities, for its graphics facilities and for the convenience of an interpreted language. R also interfaces with low-level languages including C and C++ for computationally intensive operations, Java for integration with enterprise software and JavaScript for interactive web-based applications and reports." [@huber2015orchestrating]
"Case study: high-throughput sequencing data analysis. Analysis of large-scale RNA or DNA sequencing data often begins with aligning reads to a reference genome, which is followed by interpretation of the alignment patterns. Alignment is handled by a variety of tools, whose output typically is delivered as a BAM file. The Bioconductor packages Rsamtools and GenomicAlignments provide a flexible interface for importing and manipulating the data in a BAM file, for instance for quality assessment, visualization, event detection and summarization. The regions of interest in such analyses are genes, transcripts, enhancers or many other types of sequence intervals that can be identified by their genomic coordinates. Bioconductor supports representation and analysis of genomic intervals with a 'Ranges' infrastructure that encompasses data structures, algorithms and utilities including arithmetic functions, set operations and summarization (Fig. 1). It consists of several packages including IRanges, GenomicRanges, GenomicAlignments, GenomicFeatures, VariantAnnotation and rtracklayer. The packages are frequently updated for functionality, performance and usability. The Ranges infrastructure was designed to provide tools that are convenient for end users analyzing data while retaining flexibility to serve as a foundation for the development of more complex and specialized software. We have formalized the data structures to the point that they enable interoperability, but we have also made them adaptable to specific use cases by allowing additional, less formalized userdefined data components such as application-defined annotation. Workflows can differ vastly depending on the specific goals of the investigation, but a common pattern is reduction of the data to a defined set of ranges in terms of quantitative and qualitative summaries of the alignments at each of the sites. Examples include detecting coverage peaks or concentrations in chromatin immunoprecipitation–sequencing, counting the number of cDNA fragments that match each transcript or exon (RNA-seq) and calling DNA sequence variants (DNA-seq). Such summaries can be stored in an instance of the class GenomicRanges." [@huber2015orchestrating]
"To facilitate the analysis of experiments and studies with multiple samples, Bioconductor defines the SummarizedExperiment class. The computed summaries for the ranges are compiled into a rectangular array whose rows correspond to the ranges and whose columns correspond to the different samples .. . For a typical experiment, there can be tens of thousands to millions of ranges and from a handful to hundreds of samples. The array elements do not need to be single numbers: the summaries can be multivariate. The SummarizedExperiment class also stores metadata on the rows and columns. Metadata on the samples usually include experimental or observational covariates as well as technical information such as processing dates or batches, file paths, etc. Row metadata comprise the start and end coordinates of each feature and the identifier of the containing polymer, for example, the chromosome name. Further information can be inserted, such as gene or exon identifiers, references to external databases, reagents, functional classifications of the region (e.g., from efforts such as the Encyclopedia of DNA Elements (ENCODE)) or genetic associations (e.g., from genome-wide association studies, the study of rare diseases, or cancer genetics). The row metadata aid integrative analysis, for example, when matching two experiments according to overlap of genomic regions of interest. Tight coupling of metadata with the data reduces opportunities for clerical errors during reordering or subsetting operations." [@huber2015orchestrating]
"The integrative data container SummarizedExperiment. Its assays component is one or several rectangular arrays of equivalent row and column dimensions. Rows correspond to features, and columns to samples. The component rowData stores metadata about the features, including their genomic ranges. The colData component keeps track of samplelevel covariate data. The exptData component carries experiment-level information, including MIAME (minimum information about a microarray experiment)-structured metadata. The R expressions exemplify how to access components. For instance, provided that these metadata were recorded, rowData(se)$entrezId returns the NCBI Entrez Gene identifiers of the features, and se$tissue returns the tissue descriptions for the samples. Range-based operations, such as %in%, act on the rowData to return a logical vector that selects the features lying within the regions specified by the data object CNVs. Together with the bracket operator, such expressions can be used to subset a SummarizedExperiment to a focused set of genes and tissues for downstream analysis." [@huber2015orchestrating]
"A genomics-specific visualization type is plots along genomic coordinates. There are several packages that create attractive displays of along-genome data tracks, including Gviz and ggbio ... These packages operate directly on common Bioconductor data structures and thus integrate with available data manipulation and modeling functionality. A basic operation underlying such visualizations is computing with genomic regions, and the biovizBase package provides a bridge between the Ranges infrastructure and plotting packages." [@huber2015orchestrating]
"Genomic data set sizes sometimes exceed what can be managed with standard in-memory data models, and then tools from high performance computing come into play. An example is the use of rhdf5---an interface to the HDF5 large data management system (http://www.hdfgroup.org/HDF5)—by the h5vc package to slice large, genome-size data cubes into chunks that are amenable for rapid interactive computation and visualization. Both ggbio and Gviz issue range-restricted queries to file formats including BAM, BGZIP/Tabix and BigWig via Rsamtools and rtracklayer to quickly integrate data from multiple files over a specific genomic region." [@huber2015orchestrating]
"Developers are constantly updating their packages to extend capabilities, improve performance, fix bugs and enhance documentation. These changes are introduced into the development branch of Bioconductor and released to end users every 6 months; changes are tracked using a central, publicly readable Subversion software repository, so details of all changes are fully accessible. Simultaneously, R itself is continually changing, typically around performance enhancements and increased functionality. Owing to this dynamic environment, all packages undergo a daily testing procedure. Testing is fully automated and ensures that all code examples in the package documentation, as well as further unit tests, run without error. Successful completion of the testing will result in the package being built and presented to the community." [@huber2015orchestrating]
"Interoperability between software components for different stages and types of analysis is essential to the success of Bioconductor. Interoperability is established through the definition of common data structures that package authors are expected to use ... Technically, Bioconductor’s common data structures are implemented as classes in the S4 object-oriented system of the R language. In this manner, useful software concepts including encapsulation, abstraction of interface from implementation, polymorphism, inheritance and reflection are directly available. It allows core tasks such as matching of sample data and metadata to be adopted across disciplines, and it provides a foundation on which community development is based. It is instructive to compare such a representation to popular alternatives in bioinformatics: file-based data format conventions and primitive data structures of a language such as matrices or spreadsheet tables. With file-based formats, operations such as subsetting or data transformation can be tedious and error prone, and the serialized nature of files discourages operations that require a global view of the data. In either case, validity checking and reflection cannot rely on preformed or standardized support and need to be programmed from scratch again for every convention—or are missing altogether. As soon as the data for a project are distributed in multiple tables or files, the alignment of data records or the consistency of identifiers is precarious, and interoperability is hampered by having to manipulate disperse, loosely coordinated data collections." [@huber2015orchestrating]
Some of the most important data structures in Bioconductor are [@huber2015orchestrating] (from Table 2 in this reference):
ExpressionSet
(Biobase
package) SummarizedExperiment
(GenomicRanges
package)GRanges
(GenomicRanges
package) VCF
(VariantAnnotation
package)VRanges
(VariantAnnotation
package) BSgenome
(BSgenome
package) "For Bioconductor, which provides tools in R for analyzing genomic data, interoperability was essential to its success. We defined a handful of data structures that we expected people to use. For instance, if everybody puts their gene expression data into the same kind of box, it doesn’t matter how the data came about, but that box is the same and can be used by analytic tools. Really, I think it’s data structures that drive interoperability." --- Robert Gentlemen in [@altschul2013anatomy]
"I have found that real hardcore software engineers tend to worry about problems that are just not existent in our space. They keep wanting to write clean, shiny software, when you know that the software that you’re using today is not the software you’re going to be using this time next year." --- Robert Gentlemen in [@altschul2013anatomy]
"Biology, formerly a science with sparse, often only qualitative data, has turned into a field whose production of quantitative data is on par with high energy physics or astronomy and whose data are wildly more heterogeneous and complex." [@holmes2018modern]
"Any biological system or organism is composed of tens of thousands of components, which can be in different states and interact in multiple ways. Modern biology aims to understand such systems by acquiring comprehensive---and this means high dimensional---data in their temporal and spatial context, with multiple covariates and interactions." [@holmes2018modern]
"Biological data come in all sorts of shapes: nucleic acid and protein sequences, rectagular tables of counts, multiple tables, continuous variables, batch factors, phenotypic images, spatial coordinates. Besides data measured in lab experiments, there are clinical data, longitudinal information, environmental measurements, networks, lineage trees, annotation from biological databases in free text or controlled vocabularies, ..." [@holmes2018modern]
"Bioconductor packages support the reading of many of the data types and formats produced by measurement instruments used in modern biology, as well as the needed technology-specific 'preprocessing' routines. This community is actively keeping these up-to-date with the rapid developments in the instrument market." [@holmes2018modern]
"The Bioconductor project has defined specialized data containers to represent complex biological datasets. These help to keep your data consistent, safe and easy to use." [@holmes2018modern]
"Bioconductor in particular contains packages from diverse authors that cover a wide range of functionalities but still interoperate because of the common data containers." [@holmes2018modern]
"
IRanges
is a general container for mathematical intervals. We create the biological context with the next line [which usesGRanges
]. [Footnote: 'The 'I in IRanges stands for 'interval', the 'G' in GRanges for 'genomic']." [@holmes2018modern]"Here we had to assemble a copy of the expression data (
exprs(x)
) and the sample annotation data (pData(x)
) all together into the dataframedftx
---since this is the data format that ggplot2 functions most easily take as input." [@holmes2018modern]GRanges is "a specialized class from the Bioconductor project for storing data that are associated with genomic coordinated. The first three columns are obligatory:
seqnames
, the name of the containing biopolymer (in our case, the names of human chromosomes);ranges
, the genomic coordinates of the intervals (in this case, the intervals all have lengths 1, as they refer to a single nucleotide), and the DNAstrand
from which the RNA is transcribed. You can find out more on how to use this class and its associated infrastructure in the documentation, e.g., the vignette of theGenomicRanges
package. Learning it is worth the effort if you want to work with genome-associated datasets, as it enables convenient, efficient and safe manipulation of these data and provides many powerful utilities." [@holmes2018modern]ChiP-Seq data "are sequences of pieces of DNA that are obtained from chromatin immunoprecipitation (ChIP). This technology enables the mapping of the locations along genomic data of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, polymerases and other proteins. It was the main technology used by the Encyclopedia of DNA Elements (ENCODE) project. Here we use an example (Kuan et al., 2011) from the
mosaicsExample
package, which shows data measured on chromosome 22 from a ChIP-Seq of antibodies for the STAT1 protein and the H3K4me3 histone modification applied to the GM12878 cell line. Here we do not show the code used to construct thebinTFBS
object that contains the binding sites for one chromosome (22) [in aBinData
class, it looks like from themosaics
package perhaps]." [@holmes2018modern]"At different stages of their development, immune cells express unique combinations of proteins on their surfaces. These protein-markers are called CDs (clusters of differentiation) and are collected by flow cytometry (using fluorescence...) or mass cytometry (using single-cell atomic mass spectrometry of heavy metal reporters). An example of a commonly used CD is CD4; this protein is expressed by helper T cells that are referred to as being 'CD4+'. Note, however, that some cells express CD4 (thus are CD4+) but are not actually helper T cells. We start by loading some useful Bioconductor packages for flow cytometry,
flowCore
andflowViz
." [@holmes2018modern]"Many datasets consist of several variables measured on the same set of subjects: patients, samples or organisms. For instance, we may have biometric characteristics such as height, weight, age as well as clinical variables such as blood pressure, blood sugar, heart rate and genetic data for, say, a thousand patients. The raison d'etre for multivariate analysis is the investigation of connections or associations between the different variables measured. Usually the data are reported in a tabular data structure, with one row for each subject and one column for each variable. ... in the special case where each of the variables is numeric, ... we can represent the data structure as a matrix in R. If the columns of the matrix are independent of each other (unrelated), we can simply study each column separately and do standard 'univariate' statistics on them one by one; there would be no benefit in studying them as a matrix. More often, there will be patterns and dependencies. For instance, in the biology of cells, we know that the proliferation rate will influence the expression of many genes simultaneously. Studying the expression of 25,000 genes (columns) on many samples (rows) of patient-derived cells, we notice that many of the genes act together; either they are positively correlated or they are anti-correlated. We would miss a lot of important information if we were to only study each gene separately. Important connections between genes are detectable only if we consider the data as a whole, each row representing the many measurements made on the same observational unit. However, having 25,000 dimensions of variation to consider at once is daunting; [you can] reduce our data to a smaller number of the most important dimensions without losing too much information." [@holmes2018modern]
"RNA-Seq transcriptome data report the number of sequence reads matching each gene [or sub-gene structure, such as exons] in each of several biological samples... It is customary in the RNA-Seq field ... to report genes in rows and samples in columns. Compared with the other matrices we look at here, this is transposed: rows and columns swapped. Such different conventions easily lead to errors, so they are worth paying attention to. [Footnote: 'The Bioconductor project tries to help users and developers to avoid such ambiguities by defining data containers in which such conventions are explicitly fixed...']" [@holmes2018modern]
"Proteomic profiles: Here the columns are aligned mass spectroscopy peaks or molecules identified through their m/z ratios; the entries in the matrix are measured intensities." [@holmes2018modern]
"... unlike regression, PCA treats all variables equally (to the extent that they were preprocessed to have equivalent standard deviations). However, it is still possible to map other continuous variables or categorical factors onto plots in order to help interpret the results. Often we have supplementary information on the samples, for example diagnostic lables in the diabetes data or cell types in the T-cell gene expression data. Here we see how we can use such extra variables to inform our interpretation. The best place to store such so-called metadata is in approapriate slots of the data object (such as in the Bioconductor
SummarizedExperiment
class); the second best is in additional columns of the data frame that also contains the numeric data. In practice, such information is often stored in a more or less cryptic manner in the row names of the matrix." [@holmes2018modern]"Multivariate data anlayses require 'conscious' preprocessing. After consulting all the means, variances, and one-dimensional histograms, we saw how to rescale and center the data." [@holmes2018modern]
"Many measurement devices in biotechnology are based on massively parallel sampling and counting of molecules. One example is high-throughput DNA sequencing. It's applications fall broadly into two main classes of data output. In the first case, the outputs of interest are the sequences themselves, perhaps also their polymorphisms or differences from other sequences seen before. In the second case, the sequences themselves are more or less well understood (if, say, we have a well-assembled and annotated genome) and our interest is the abundance of different sequence regions in our sample. For instance, in RNA-Seq..., we sequence the RNA molecules found in a population of cells or in a tissue. In ChIP-Seq, we sequence DNA regions that are bound to a particular RNA-binding protein. In DNA-Seq, we sequence genomic DNA and are interested in the prevalence of genetic variants in heterogeneous populations of cells, for instance the clonal composition of a tumor. In high-throughput chromatin conformation capture (HiC) we aim to map the 3D spatial arrangement of DNA. In genetic screens (using, say, RNAi or CRISPR-Cas9 libraries for perturbation and high-throughput sequencing for readout), we're interested in the proliferation or survival of cells upon gene knockdown, knockout, or modification. In microbiome analysis, we study the abundance of different microbial species in complex microbial habitats. Ideally, we might want to sequence and count all molecules of interest in the sample. Generally this is not possible; the biochemical protocols are not 100% efficient, and some molecules or intermediates get lost along the way. Moreover, it's often also not even necessary. Instead, we sequence and count a statistical sample. The sample size will depend on the complexity of the sequence pool assayed; it can go from tens of thousands to billions. This sampling nature of the data is important when it comes to analyzing them. We hope that the sampling is sufficiently representative for us to identify interesting trends and patterns." [@holmes2018modern]
"
DESeq2
uses a specialized data container, calledDESeqDataSet
, to store the datasets it works with. Such use of specialized containers---or, in R terminology, classes---is a common principle of the Bioconductor project, as it helps users keep related data together. While this way of doing things requires users to invest a little more time up front to understand the classes, compared with just using basic R data types like matrix and dataframe, it helps in avoiding bugs due to loss of synchronization between related parts of the data. It also enables the abstraction and encapsulation of common operations that could be quite wordy if always expressed in basic terms [footnote: Another advantage is that classes can contain validity methods, which make sure that the data always fulfill certain expectations, for instance, that the counts are positive integers, or that the columns of the counts matrix align with the rows of the sample annotation dataframe.]DESeqDataSet
is an extension of the classSummarizedExperiment
in Bioconductor. TheSummarizedExperiment
class is also used by many other packages, so learning to work with it will enable you to use a large range of tools. We will use the constructor functionDESeqDataSetFromMatrix
to create aDESeqDataSet
from the count data matrix ... and the sample annotation dataframe ... TheSummarizedExperiment
class---and thereforeDESeqDataSet
---also contains facilities for storing annotations of the rows of the count matrix." [@holmes2018modern]"We introduced the R data.frame class, which allows us to combine heterogeneous data types: categorical factors and continuous measurements. Each row of the dataframe corresponds to an object, or a record, and the columns are the different variables or features. Extra information about sample batches, dates of measurement and different protocols is often misnamed metadata. This information is actually real data that needs to be integrated into analyses. Here we show an example of an analysis that was done by Holmes et al. (2011) on bacterial abundance data from Phylochip microarrays. The experiment was designed to detect differences between a group of healthy rats and a group that had irritable bowel disease. This example shows how nuisance batch effects can become apparent in the analysis of experimental data. It illustrates why best practices in data analysis are sequential and why it is better to analyze data as they are collected---to adjust for severe problems in the experimental design as they occur---instead of trying to deal with deficiencies post mortem. When data collection started on this project, data for days 1 and 2 were delivered and we made the plot ... This shows a definite day effect. When investigating the source of this effect, we found that both the protocol and the array were different on days 1 and 2. This leads to uncertainty about the source of variation; we call this confounding of effects." [@holmes2018modern]
"Many programs and workflows in biological sequence analysis or assays separate the environmental and contextual information they call metadata from the assays or sequence read numbers; we discourage this practice, as the exact connections between the samples and covariates are important. The lost connections between the assays and covariates makes later analyses impossible. Covariates such as clinical history, time, batch and location are important and should be considered components of the data." [@holmes2018modern]
"The data provide and example of an awkward way of combining bach information from the actual data. The day information has been combined with the array data and encoded as a number and could be confused with a continuous variable. We will see in the next section a better practice for storing and manipulating heterogeneous data using a Bioconductor container called
SummarizedExperiment
" [@holmes2018modern]"A more rational way of combining the batch and treatment information into compartments of a composite object is to use
SummarizedExperiment
classes. These include special slots for the assay(s) where rows represent features of interest (e.g., genes, transcripts, exons, etc.) and columns represent samples. Supplementary information about the features can be stored in aDataFrame
object, accessible using the functionrowData
. Each row of theDataFrame
provides informaiton on the feature in the corresponding row of theSummarizedExperiment
object. ... This is the best way to keep all the relevant data together. It will also enable you to quickly filter the data while keeping all the information aligned properly. ... Columns of theDataFrame
represent different attributes of the features of interest, e.g., gene or transcript IDs. This is an example of a hybrid data container from a single-cell experiment..." [@holmes2018modern]"The success of the tidyverse attests to the power of its underlying ideas and the quality of its implementation. ... Nevertheless, dataframes in the long format are not a panacea. ... When we write a function that expects to work on an object like
xldf
, we have no guarantee that the columnprobe
does indeed contain valid probe identifiers, or that such a column even exists. There is not even a proper way to express programmatically what 'an object likexldf
means in the tidyverse. Object-oriented (OO) programming, and its incarnation S4 in R, solves such questions. For instance, the above-mentioned checks could be performed by avalidObject
method for a suitably defined class, and the class definition would formalize the notion of 'an object likexldf
'. Addressing such issues is behind the object-oriented design of the data structures in Bioconductor, such as theSummarizedExperiment
class. Other potentially useful features of OO data representations include: 1. Abstraction of interface from implementation and encapsulation: the user accesses the data only through defined channels and does not need to see how the data are stored 'inside'---which means that inside can be changed and optimized without breaking user-level code. 2. Polymorphism: you can have different functions with the same name, such as plot or filter, for different classes of objects, and R figures out for you which to call. 3. Inheritance: you can build up more complex data representations from simpler ones. 4. Reflection and self-documentation: you can send programmatic queries to an object to ask for more information about itself. All of these make it easier to write high-level code that focuses on the 'big picture' functionality rather than on implementation details of the building blocks---albeit at the cost of more initial investment in infrastructure and 'bureaucracy'." [@holmes2018modern]"Data provenance and metadata. THere is no obvious place in an object like
xldf
to add information about data provenance: e.g., who performed the experiment, where it was published, where the data were downloaded from, or which version of the data we're looking at (data bugs exist ...). Neither are there any explanations of the columns, such as units and assay type. Again, the data classes in Bioconductor try to address this." [@holmes2018modern]Matrix-like data. Many datasets in biology have a natural matrix-like structure, since a number of features (e.g., genes: conventionally, the rows of the matrix) are assayed on several samples (conventionally, the columns of the matrix). Unrolling the matrix into a long form like
xldf
makes some operations (say, PCA, SVD, clustering of features or samples) more awkward." [@holmes2018modern]"Out-of memory data and chunking. Some datasets are too big to load into random access memory (RAM) and manipulate all at once. Chunking means splitting the data into manageable portions ('chunks') and then sequentially loading each portion, computing on it, storing the results and removing it from memory before loading the next portion. R also offers infrastructure for working with large datasets that are stored on disk in a relational database management systems (the DBI package) or in HDF5 (the rhdf5 package). The Bioconductor project provides the class
SummarizedExperiment
, which can store big data matrices either in RAM or in an HDF5 backend in a manner that is transparent to the user of objects of this class." [@holmes2018modern]
Many of the statistical algorithms rely on matrices---these store data all of the same data type (e.g., numeric or counts). If you store extra variables, like binary outcome classifications (sick/well; alive/dead) or categorical variables, it will complicate these operations. Further, if these aren't to be used in things like dimension reduction and clustering, then you will continuously need to subset as you perform those matrix-based processes. Conversely, once you move to using ggplot to visualize your data and other tidy tools to create summary tables and other output for reports, it's handy to have all the information revlevant to each of your observations handy within a dataframe---a structure than can hold and align data of many different types in its different columns. It therefore makes sense to evolve from more complex object types, in which different types of variables for each observation are stored in their own places, and where variables with similar types can be collected in a matrix that is ready for statistical processing, to the simpler dataframe at later stages in the pipeline, when working on publication-ready tables and figures. This requires a switch as some point in the pipeline from a coding approach that stores data in more complex Bioconductor S4 objects to one that stores data in a simple and straightforward tidy dataframe.
One file format called a fasta file is used to store DNA sequence data. The
Biostrings
package has a function for reading these data in from a fasta file.
It stores the data in an instance of the DNAStringSet
class from that package.
Within this class are DNAString
objects for each sample.
Bioconductor is used for many of the R packages for working with genomic and other bioinformatic data. One characteristic of packages on Bioconductor is that they make heavy use of a system for object-oriented programming in R. There are several systems for object-oriented programming in R. Bioconductor relies heavily on one called S4.
Object-oriented programming allows developers to create methods. These are
functions in R that first check the class of the object that is input, and then
run different code for different functions. For example, summary
is one of these
method-style functions. If you call the summary function with the input as a
numeric vector, one set of code will be run: you will get numeric summaries of
the values in that vector, including the minimum, maximum and median. However,
if you run the same function, summary
, on a dataframe with columns of factor
data, the output will be a small summary for each column, giving the levels of
the factor in each column and the number of column values in the most common
of those levels.
With this system of writing methods, the same function call can be used for many different object types. By contrast, other approaches to programming might constrain a function to work with a single class of object---for example, a function might work on a dataframe and only a dataframe, not a vector, matrix, or other more complex types of objects.
These methods often have very names. Examples of these method-style functions
include plot
, summary
, head
[?], [others]. You can try running these
method-style functions on just about any object class that you're using to
store your data, and chances are good that it will work on the object and
output something interesting.
The S4 system of object-oriented programming in R allows for something called inheritance. [More on this.]
As you use R with the Bioconductor packages, you often might not notice how much S4 objects are being used "under the hood", as you pre-process and analyze data. By contrast, you may have learned the "tidyverse" approach in R, which is a powerful general approach for working with data. The tidyverse approach is centered on the object class is predominated uses, the dataframe, and so a lot of attention is given to thinking about that style of data storage in an object when learning the approach.
A pre-processing pipeline in Bioconductor might take the data through a number of different object classes over the course of the pipeline. Different functions within a Bioconductor package may manage this progression without you being very aware of it. For example, one function may read the data from the file format for the equipment and move the data directly into a certain complex object class, which a second function might input this object class, do some actions on the data, and then output the result in a different class.
Generally, if the functions in a pipeline handle these object transitions gracefully, you may not feel the need to dig more deeply into the object types. However, ideally you should feel comfortable taking a peek at your data at any step in the process. This can include seeing snippets of the data in its object (e.g., the first few elements in each component of the data at that stage) and also feel comfortable visualizing parts of the data in simple ways.
This is certainly possible even when data are stored in complex or unfamiliar object classes. However, it's a bit less natural than exploring your data when it's stored in an object class that you feel very comfortable with. For example, most R programmers have several go-to strategies for checking any data they have stored in a dataframe. You can develop these same go-to strategies for data in more complex object classes once you understand a few basics about the S4 system and the object classes created using this system.
First, there are a few methods you can use to figure out what's in a data
object. [More on this. str
, some on interactive ways to look at objects?]
Further, most S4 objects will have their own helpfiles [doublecheck], and you
can use this resource to learn more about what it's storing and where it puts
each piece of data. [More on accessing these help files. ?ExpressionSet
,
for example.]
Once you know what's in your object, there are a few ways that you can pull
out different elements of the data. One basic way (it's a bit heavy-handed,
and best to save for when you're struggling with other methods) is to
extract slots from the object using the @
symbol. If you have worked much
with base R, you will be familiar with pulling out elements of more basic object
classes using $
. For example, if you wanted to extract a column named weight
from a dataframe object called experiment_1
, you could do so using
the syntax experiment_1$weight
. The $
operator does not work in the same
way with S4 objects. With these, we say that the elements are stored in different
slots of the object, and each slot can be extracted using @
. So if you had
an S4 object with data on animal weights stored in a slot called weight
, you
could extract it from an S4 object instanced named experiment_2
with
experiment_2@weight
.
A more elegant approach is to access elements stored in the object using a special type of function called an accessor function.
One important object class in Bioconductor is ExpressionSet
. This object class
helps to keep different elements of data from an experiment aligned---for
example, it helps ensure that higher-level data about each sample is kept
well-aligned with data on more specific values---like measurements from each
metabolite feature [? better example? gene expression values for each gene?]
specific to each sample. The three slots in this object class are pData
,
exprs
, and fData
. The data in these three slots can be accessed using the
accessor functions of pData
, exprs
, and fData
.
Often, the contents of the slots within a Bioconductor class will be a more generic object type that you're familiar with, like a matrix or vector.
S4 objects can be set to check that the inputs are valid for that object class when someone creates a new object of that class. This helps with quality control in creating new objects, where these issues can be caught early, before functions are run on the object that assume certain characteristics of its data.
Methods are also referred to as generic functions within the S4 system?
"the whole point of OOP is not to have to worry about what is inside an object. Objects made on different machines and with different languages should be able to talk to each other" --- Alan Kay
This idea in object-oriented programming may be very helpful for large, multi-developer programming, since different people, or even whole teams could develop their parts independently. As long as the teams have all agreed on the way that messages will be passed between different objects and parts of the code, they could have independence in how they conduct work on their own objects. There are rules for how things connect, and independence in how each part works. If the rules for connecting different objects are set, then this approach allows for immense flexibility in how the code to work with the objects on their own can be written and changed, without breaking the whole system of code.
However, the idea of not worrying about what's inside an object is at odds with some basic principles for working with experimental data. Exploratory data analysis is a key principle for improving quality control, rigor, and even creativity in working with scientific data sets. [More on EDA, including from Tukey] EDA requires a researcher to be able to explore the data stored inside an object, ideally at any stage along a pipeline of pre-processing and then analyzing those data. Therefore, there's a bit of tension in the S4 approach in R, between using a system that allows for powerful development of tools to explore data and the fundamental needs of the researcher to access and explore their data as they work with it---to "see inside" the objects storing their data at every step.
Objects store data. They are data structures, with certain rules for where they store different elements of the data. They also are associated with specific functions that work with the way they store the data.
"A programming language serves two related purposes: it provides a vehicle for the programmer to specify actions to be executed and a set of concepts for the programmer to use when thinking about what can be done. The first aspect ideally requires a language that is 'close to the machine', so that all important aspects of a machine are handled simply and efficiently in a way that is reasonably obvious to the programmer. The C language was primarily designed with this in mind. The second aspect ideally requires a language that is 'close to the problem to be solved' so that the concepts of a solution can be expressed directly and concisely. The facilities added to C to create C++ were primarily designed with this in mind." --- Bjarne Stroustrup, The C++ Programming Language, Addison-Wesley, 1986
"The basis for OOP started in the early 1960s. A breakthrough involving instances and objects was achieved at MIT with the PDP-1, and the first programming language to use objects was Simula 67. It was designed for the purpose of creating simulations, and was developed by Kristen Nygaard and Ole-Johan Dahl in Norway. They were working on simulations that deal with exploding ships, and realized they could group the ships into different categories. Each ship type would have its own class, and the class would generate its unique behavior and data. Simula was not only responsible for introducing the concept of a class, but it also introduced the instance of a class. The term 'object oriented programming' was first used by Xerox PARC in their Smalltalk programming language. The term was used to refer to the process of using objects as the foundation for computation. The Smalltalk team was inspired by the Simula 67 project, but they designed Smalltalk so that it would be dynamic. The objects could be changed, created, or deleted, and this was different from the static systems that were commonly used. Smalltalk was also the first programming language to introduce the inheritance concept. It is this feature that allowed Smalltalk to surpass both Simula 67 and the analog programming systems. While these systems were advanced for their time, they did not use the inheritance concept." --- http://www.exforsys.com/tutorials/oops-concepts/the-history-of-object-oriented-programming.html
"Object-oriented programming is first and foremost about objects. Initially object-oriented languages were geared toward modeling real world objects so the objects in a program corresponded to real world objects. Examples might include: 1. Simulations of a factory floor–objects represent machines and raw materials 2. Simulations of a planetary system–objects represent celestial bodies such as planets, stars, asteroids, and gas clouds 3. A PC desktop–objects represent windows, documents, programs, and folders 4. An operating system–objects represent system resources such as the CPU, memory, disks, tapes, mice, and other I/O devices" https://www.ephemeralobjects.org/2014/02/03/a-brief-history-of-object-oriented-programming/
"The idea with an object is that it advertises the types of data that it will store and the types of operations that it allow to manipulate that data. However, it hides its implementation from the user. For a real world analogy, think of a radio. The purpose of a radio is to play the program content of radio stations (actually translate broadcast signals into sounds that humans can understand). A radio has various dials that allow you to control functions such as the station you are tuned to, the volume, the tone, the bass, the power, and so on. These dials represent the operations that you can use to manipulate the radio. The implementation of the radio is hidden from you. It could be implemented using vacuum tubes or solid state transistors, or some other technology. The point is you do not need to know. The fact that the implementation is hidden from you allows radio manufacturers to upgrade the technology within radios without requiring you to relearn how to use a radio." https://www.ephemeralobjects.org/2014/02/03/a-brief-history-of-object-oriented-programming/
"The set of operations provided by an object is called its interface. The interface defines both the names of the operations and the behavior of these operations. In essence the interface is a contract between the object and the program that uses it. The object guarantees that it will provide the advertised set of operations and that they will behave in a specified fashion. Any object that adheres to this contract can be used interchangeably by the program. Hence the implementation of an object can be changed without affecting the behavior of a program." https://www.ephemeralobjects.org/2014/02/03/a-brief-history-of-object-oriented-programming/
"An object is not much good if each one must be custom crafted. For example, radios would not be nearly as prevalent if each one was handcrafted. What is needed is a way to provide a blueprint for an object and a way for a 'factory' to use this blueprint to mass produce objects. Classes provide this mechanism in object-oriented programming. A class is a factory that is able to mass produce objects. The programmer provides a class with a blueprint of the desired type of object. A 'blueprint' is actually composed of: 1. A declaration of a set of variables that the object will possess, 2. A declaration of the set of operations that the object will provide, and 3. A set of function definitions that implements each of these operations. The set of variables possessed by each object are called instance variables. The set of operations that the object provides are called methods. For most practical purposes, a method is like a function. When a program wants a new instance of an object, it asks the appropriate class to create a new object for it. The class allocates memory to hold the object’s instance variables and returns the object to the program. Each object knows which class created it so that when an operation is requested for that object, it can look up in the class the function that implements that operation and call that function." https://www.ephemeralobjects.org/2014/02/03/a-brief-history-of-object-oriented-programming/
"In object-oriented programming, inheritance means the inheritance of another object’s interface, and possibly its implementation as well. Inheritance is accomplished by stating that a new class is a subclass of an existing class. The class that is inherited from is called the superclass. The subclass always inherits the superclass’s complete interface. It can extend the interface but it cannot delete any operations from the interface. The subclass also inherits the superclass’s implementation, or in other words, the functions that implement the superclass’s operations. However, the subclass is free to define new functions for these operations. This is called overriding the superclass’s implementation. The subclass can selectively pick and choose which functions it overrides. Any functions that are not overridden are inherited." https://www.ephemeralobjects.org/2014/02/03/a-brief-history-of-object-oriented-programming/
For xcms
, basic object class is now SCMSnExp
. [@holmes2018modern]
This is the container the data is stored in while pre-processing
LCMS data with the xcms
package.
"xcms supports analysis of LC/MS data from files in (AIA/ANDI) NetCDF, mzML/mzXML and mzData format. For the actual data import Bioconductor’s mzR is used." [@smith2013lc]
"Subsequently we load the raw data as an OnDiskMSnExp object using the readMSData method from the MSnbase package. The MSnbase provides based structures and infrastructure for the processing of mass spectrometry data. ... The resulting OnDiskMSnExp object contains general information about the number of spectra, retention times, the measured total ion current etc, but does not contain the full raw data (i.e. the m/z and intensity values from each measured spectrum). Its memory footprint is thus rather small making it an ideal object to represent large metabolomics experiments while allowing to perform simple quality controls, data inspection and exploration as well as data sub-setting operations. The m/z and intensity values are imported from the raw data files on demand, hence the location of the raw data files should not be changed after initial data import." [@smith2013lc]
Important skills for working with data in Bioconductor classes:
?`Chromatogram-class`
accesses the helpfile that describes the Chromatogram class in the
MSnbase package. It includes info on what is typically stored in
objects of this class (renetion time-intensity value pairs for chromatographic
mass spectroscopy data). It tells how to create a new object of that
class using its constructor function. It lists accessor functions for
objects in that class: rtime
to get retention times, intensity
to
get the intensities, mz
to get the range of the chromatogram, etc.
It also lists some functions, including generic functions like length
,
that can be used with objects in that class, as well as some details on
how the class's method for that generic function works (in terms of
what it will return). It provides the usage, and defines the parameters,
for functions that work with this object class. Often, you'll have a class that stores data for one sample (e.g.,
Chromatogram
from the MSnbase
package, which stores chromatographic
mass spectrometry data), and then another class that will collectively
store these sample-specific data in a larger object (e.g.,
Chromatograms
class, also from the MSnbase
package, which stores
multiple Chromatogram
objects, from different samples, in a
structure derived from the matrix structure). j
You can use the pipe operator from magrittr
in Bioconductor workflows,
too. It works by "piping" the output from one function call as the
input into the next function call (typically, the parameter in the
first position among parameters to that function call).
Calling the object name at the R console will run the print method for
that object's class on the object. Often, this will provide a print out of
useful metadata, descriptions, and summaries for the data stored in that
object. If you want a more granular look at what's contained in the
object, you can use the str
function.
Object classes are often set up to inherit from another class. This means that a method that works for one class might also work for a similar class, if the second inherits from the first. "The results are returned as an XCMSnExp object which extends the OnDiskMSnExp object by storing also LC/GC-MS preprocessing results. This means also that all methods to sub-set and filter the data or to access the (raw) data are inherited from the OnDiskMSnExp object and can thus be re-used. Note also that it is possible to perform additional rounds of peak detection (e.g. on MS level > 1 data) on the xdata object by calling findChromPeaks with the parameter add = TRUE." [@smith2013lc]
Sometimes there will be a class just for storing the parameters for
running an algorithm, for example, the "CentWaveParam" and
"MergeNeighboringPeaksParam" classes in the xcms
package.
Presumably this is to allow validity checking before using them
in the algorithm?
Moving into a more general object class after pre-processing:
"Results from the xcms-based preprocessing can be summarized into a SummarizedExperiment object from the SummarizedExperiment package with the quantify method. This object will contain the feature abundances as the assay matrix, the feature definition (their m/z, retention time and other metadata) as rowData (i.e. row annotations) and the sample/phenotype information as colData (i.e. column annotations). All the processing history will be put into the object’s metadata. This object can then be used for any further (xcms-independent) processing and analysis." [@smith2013lc]
"The concept in R of attributes of an object allows an exceptionally rich set of data objects. S3 methods make the class attribute the driver of an object-oriented system. It is an optional system. Only if an object has a class attribute do S3 methods really come into effect." [@burns2011r]
"There are some functions that are generic. Examples include print, plot, summary. These functions look at the class attribute of their first argument. If that argument does have a class attribute, then the generic function looks for a method of the generic function that matches the class of the argument. If such a match exists, then the method function is used. If there is no matching method or if the argument does not have a class, then the default method is used. Let’s get specific. The lm (linear model) function returns an object of class 'lm'. Among the methods for print are print.lm and print.default. The result of a call to lm is printed with print.lm. The result of 1:10 is printed with print.default." [@burns2011r]
"S3 methods are simple and powerful. Objects are printed and plotted and summarized appropriately, with no effort from the user. The user only needs to know print, plot and summary." [@burns2011r]
"If your mystery number is in obj, then there are a few ways to look for it:
print.default(obj)
print(unclass(obj))
str(obj)
The first two print the object as if it had no class, the last prints an outline of the structure of the object. You can also do:names(obj)
to see what components the object has---this can give you an overview of the object." [@burns2011r]"median is a generic function as evidenced by the appearance of UseMethod. What the new user meant to ask was, 'How can I find the default method for median?' The most sure-fire way of getting the method is to use getS3method: getS3method(’median’, ’default’)." [@burns2011r]
"The methods function lists the methods of a generic function [for classes loaded in the current session]. Alternatively given a class it returns the generic functions that have methods for the class." [@burns2011r]
head(methods(print)) library(Biobase) methods("ExpressionSet")
"Inheritance should be based on similarity of the structure of the objects, not similarity of the concepts for the objects. Matrices and data frames have similar concepts. Matrices are a specialization of data frames (all columns of the same type), so conceptually inheritance makes sense. However, matrices and data frames have completely different implementations, so inheritance makes no practical sense. The power of inheritance is the ability to (essentially) reuse code." [@burns2011r]
"S3 methods are simple and powerful, and a bit ad hoc. S4 methods remove the ad hoc—they are more strict and more general. The S4 methods technology is a stiffer rope—when you hang yourself with it, it surely will not break. But that is basically the point of it---the programmer is restricted in order to make the results more dependable for the user. That’s the plan anyway, and it often works." [@burns2011r]
"S4 is quite strict about what an object of a specific class looks like. In contrast S3 methods allow you to merely add a class attribute to any object—as long as a method doesn’t run into anything untoward, there is no penalty. A key advantage in strictly regulating the structure of objects in a particular class is that those objects can be used in C code (via the .Call function) without a copious amount of checking." [@burns2011r]
"Along with the strictures on S4 objects comes some new vocabulary. The pieces (components) of the object are called slots. Slots are accessed by the
@
operator." [@burns2011r]"By now you will have noticed that S4 methods are driven by the class attribute just as S3 methods are. This commonality perhaps makes the two systems appear more similar than they are. In S3 the decision of what method to use is made in real-time when the function is called. In S4 the decision is made when the code is loaded into the R session—there is a table that charts the relation. [@burns2011r]
biobroom
to extract pieces of data in the Bioconductor dataset as tidy dataframes.
Try using this with further tidyverse code to create a nice table/visualization.]Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.