startQuest("Bioconductor website")
h1("Introduction to R / Bioconductor")
This chapter[^3] introduces R and Bioconductor in details. During the course, we will only scheme through it as you are supposed to be familiar with most of the concepts presented here. If that were not the case - and even if it were - consider it as homework.
h2("Bioconductor")
Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor started more than 10 years ago and gained credibility for its statistically rigorous approach to microarray pre-processing and analysis of designed experiments, and integrative and reproducible approaches to bioinformatic tasks. There are now more than 800 Bioconductor packages for expression and other microarrays, sequence analysis, flow cytometry, imaging, and other domains. The Bioconductor web site provides installation, package repository, help, and other documentation.
The Bioconductor web site is at bioconductor.org. Features include:
Introductory workflows.
A manifest of Bioconductor packages arranged in BiocViews.
Annotation (data bases of relevant genomic information, e.g.: Entrez gene ids in model organisms, KEGG pathways) and experiment data (containing relatively comprehensive data sets and their analysis) packages.
Mailing lists, including searchable archives, as the primary source of help.
Course and conference information, including extensive reference material.
General information about the project.
Package developer resources, including guidelines for creating and submitting new packages.
quest(1) endQuest()
startQuest("Bioconductor rationale")
h2("Statistical programming")
You would have realized that all the links above are useful ones (and one is a shameless self-citation)
Many academic and commercial software products are available; why would one use R and ? One answer is to ask about the demands high-throughput genomic data places on effective computational biology software.
h3("Effective computational biology software")
High-throughput questions make use of large data sets. This applies both to the primary data (microarray expression values, sequenced reads, ) and also to the annotations on those data (coordinates of genes and features such as exons or regulatory regions; participation in biological pathways, ). Large data sets place demands on our tools that preclude some standard approaches, such as spread sheets. Likewise, intricate relationships between data and annotation, and the diversity of research questions, require flexibility typical of a programming language rather than a narrowly-enabled graphical user interface.
Analysis of high-throughput data is necessarily statistical. The volume of data requires that it be appropriately summarized before any sort of comprehension is possible. The data are produced by advanced technologies, and these introduce artifacts (e.g.: probe-specific bias in microarrays; sequence or base calling bias in RNA-seq experiments) that need to be accommodated to avoid incorrect or inefficient inference. Data sets typically derive from designed experiments, requiring a statistical approach both to account for the design and to correctly address the large number of observed values (e.g.: gene expression or sequence tag counts) and small number of samples accessible in typical experiments.
Research needs to be reproducible. Reproducibility is both an ideal of the scientific method, and a pragmatic requirement. The latter comes from the long-term and multi-participant nature of contemporary science. An analysis will be performed for the initial experiment, revisited again during manuscript preparation, and revisited during reviews or in determining next steps. Likewise, analyses typically involve a team of individuals with diverse domains of expertise. Effective collaborations result when it is easy to reproduce, perhaps with minor modifications, an existing result, and when sophisticated statistical or bioinformatic analysis can be effectively conveyed to other group members.
Science moves very quickly. This is driven by the novel questions that are the hallmark of discovery, and by technological innovation and accessibility. Rapidity of scientific development places significant burdens on software, which must also move quickly. Effective software cannot be too polished, because that requires that the correct analyses are ‘known’ and that significant resources of time and money have been invested in developing the software; this implies software that is tracking the trailing edge of innovation. On the other hand, leading-edge software cannot be too idiosyncratic; it must be usable by a wider audience than the creator of the software, and fit in with other software relevant to the analysis.
Effective software must be accessible. Affordability is one aspect of accessibility. Another is transparent implementation, where the novel software is sufficiently documented and source code accessible enough for the assumptions, approaches, practical implementation decisions, and inevitable coding errors to be assessed by other skilled practitioners. A final aspect of affordability is that the software is actually usable. This is achieved through adequate documentation, support forums, and training opportunities.
h3("Bioconductor as effective computational biology software")
What features of R and Bioconductor contribute to its effectiveness as a software tool?
Bioconductor is well suited to handle extensive data and annotation. Bioconductor ‘classes’ represent high-throughput data and their annotation in an integrated way. Bioconductor methods use advanced programming techniques or R resources (such as transparent data base or network access) to minimize memory requirements and integrate with diverse resources. Classes and methods coordinate complicated data sets with extensive annotation. Nonetheless, the basic model for object manipulation in R involves vectorized in-memory representations. For this reason, particular programming paradigms (e.g.: block processing of data streams; explicit parallelism) or hardware resources (e.g.: large-memory computers) are sometimes required when dealing with extensive data.
R is ideally suited to addressing the statistical challenges of
high-throughput data. Three examples include the development of the
‘RMA’ affy
and other normalization algorithm for
microarray pre-processing, use of moderated \(t\)-statistics for
assessing microarray differential expression
Limma
, and development of negative binomial
approaches to estimating dispersion read counts necessary for
appropriate analysis of RNA-Seq designed experiments
[@AndersHuber2010],[@pmid19910308].
Many of the ‘old school’ aspects of R and Bioconductor facilitate reproducible research. An analysis is often represented as a text-based script. Reproducing the analysis involves re-running the script; adjusting how the analysis is performed involves simple text-editing tasks. Beyond this, R has the notion of a ‘vignette’, which represents an analysis as a LaTeX document with embedded R commands. The R commands are evaluated when the document is built, thus reproducing the analysis. The use of LaTeX, and more recently of simpler markdown languages in conjonction with pandoc to generate HTML formatted reports or vignettes means that the symbolic manipulations in the script are augmented with textual explanations and justifications for the approach taken; these include graphical and tabular summaries at appropriate places in the analysis. R includes facilities for reporting the exact version of R and associated packages used in an analysis so that, if needed, discrepancies between software versions can be tracked down and their importance evaluated. While users often think of R packages as providing new functionality, packages are also used to enhance reproducibility by encapsulating a single analysis. The package can contain data sets, vignette(s) describing the analysis, R functions that might have been written, scripts for key data processing stages, and documentation (via standard R help mechanisms) of what the functions, data, and packages are about.
The Bioconductor project adopts practices that facilitate
reproducibility. Versions of R and Bioconductor are released once
and twice each year, respectively. Each Bioconductor release is the
result of development, in a separate branch, during the previous six
months. The release is built daily against the corresponding version
of R on Linux, Mac, and Windows platforms, with an extensive suite
of tests performed. The biocLite
function ensures
that each release of R uses the corresponding Bioconductor packages.
The user thus has access to stable and tested package versions. R
and Bioconductor are effective tools for reproducible research.
R and Bioconductor exist on the leading portion of the software life cycle. Contributors are primarily from academic institutions, and are directly involved in novel research activities. New developments are made available in a familiar format, i.e.: the R language, packaging, and build systems. The rich set of facilities in R - e.g.: for advanced statistical analysis or visualization - and the extensive resources in Bioconductor - e.g.: for annotation using third-party data such as Biomart (www.biomart.org[@Kasprzyk:2011p5853]) or UCSC genome browser tracks(http://genome.ucsc.edu/[@Meyer:2013p5854]) - mean that innovations can be directly incorporated into existing workflows. The ‘development’ branches of R and Bioconductor provide an environment where contributors can explore new approaches without alienating their user base.
R and Bioconductor also fair well in terms of accessibility. The software is freely available. The source code is easily and fully accessible for critical evaluation. The R packaging and check system requires that all functions are documented. Bioconductor requires that each package contain vignettes to illustrate the use of the software. There are very active R and Bioconductor mailing lists for immediate support, and regular training and conference activities for professional development.
quest(2) endQuest()
startQuest("Bioconductor for NGS")
h2("Bioconductor for high-throughput sequence analysis")
The table below enumerates many of the packages available
for sequence analysis. The table includes packages for representing
sequence-related data (e.g.: GenomicRanges
,
Biostrings
), as well as domain-specific analysis such
as RNA-seq (e.g.: edgeR
, DEXSeq
),
ChIP-seq (e.g,. ChIPpeakAnno
,
DiffBind
), and SNPs and copy number variation (e.g.:
genoset
, ggtools
,
VariantAnnotation
).
| Concept | Packages |
|---------|----------|
| Data representation | IRanges
, GenomicRanges
, GenomicFeatures
,Biostrings
, BSgenome
, girafe
,genomeIntervals
|
| Input / output | ShortRead
(fastq), Rsamtools
(bam), rtracklayer
(gff, wig, bed), VariantAnnotation
(vcf),R453Plus1Toolbox
(454) |
| Annotation | biomaRt
[@Durinck:2005p1990],GenomicFeatures
,ChIPpeakAnno
, VariantAnnotation
|
| Alignment | Biostrings
, gmapR
,Rsubread
, Rbowtie
|
| Visualization | ggbio
,Gviz
|
| Quality assessment | qrqc
,seqbias
,ReQON
, htSeqTools
,TEQC
, Rolexa
,ShortRead
|
| RNA-seq | BitSeq
,cqn
,cummeRbund
,DESeq
[@AndersHuber2010],DESeq2
[Love:2014p6358],DEXSeq
[dexseq],EDASeq
,edgeR
[@pmid19910308],gage
,goseq
, iASeq
,tweeDEseq
|
| ChIP-seq | BayesPeak
,baySeq
, ChIPpeakAnno
,chipseq
, ChIPseqR
,ChIPsim
, CSAR
,DiffBind
,MEDIPS
, mosaics
,NarrowPeaks
,nucleR
,PICS
, PING
,REDseq
, Repitools
,TSSi
|
| Motifs | BCRANK
, cosmo
,cosmoGUI
, MotIV
,seqLogo
, rGADEM
|
| 3C | HiTC
,r3Cseq
|
| Copy number | cn.mops
,CNAnorm
,exomeCopy
, seqgmentSeq
|
| Microbiome | phyloseq
,DirichletMultinomial
[@Holmes2012],clstutils
, manta
,mcaGUI
|
| Workflows | ArrayExpressHTS
,Genominator
,easyRNASeq
,oneChannelGUI
,QuasR
,rnaSeqMap
[@Lesniewska:2011p4758]|
| Database | SRAdb
|
quest(3) endQuest()
startQuest("R basics")
h2("R")
R is an open-source statistical programming language. It is used to manipulate data, to perform statistical analysis, and to present graphical and other results. R consists of a core language, additional ‘packages’ distributed with the R language, and a very large number of packages contributed by the broader community. Packages add specific functionality to an R installation. R has become the primary language of academic statistical analysis, and is widely used in diverse areas of research, government, and industry.
R has several unique features. It has a surprisingly ‘old school’ interface: users type commands into a console; scripts in plain text represent workflows; tools other than R are used for editing and other tasks. R is a flexible programming language, so while one person might use functions provided by R to accomplish advanced analytic tasks, another might implement their own functions for novel data types. As a programming language, R adopts syntax and grammar that differ from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy on change’ and ‘pass by value’ semantics, reducing unexpected consequences for users at the expense of less efficient memory use; common paradigms in other languages, such as the ‘for’ loop, are encountered much less commonly in R. Many authors contribute to , so there can be a frustrating inconsistency of documentation and interface. R grew up in the academic community, so authors have not shied away from trying new approaches. Common statistical analysis functions are very well-developed.
h3("R data types")
Opening an R session results in a prompt. The user types instructions at the prompt. Here is an example:
## assign values 5, 4, 3, 2, 1 to variable 'x' x <- c(5, 4, 3, 2, 1) x
The first line starts with a #
to represent a
comment; the line is ignored by R. The next line creates a variable
x
. The variable is assigned (using
<-
, we could have used =
almost
interchangeably) a value. The value assigned is the result of a call to
the c
function. That it is a function call is
indicated by the symbol named followed by parentheses,
c()
. The c
function takes zero or
more arguments, and returns a vector. The vector is the value assigned
to x
. R responds to this line with a new prompt, ready
for the next input. The next line asks R to display the value of the
variable x
. R responds by printing
[1]
to indicate that the subsequent number is the
first element of the vector. It then prints the value of
x
.
R has many features to aid common operations. Entering sequences is a
very common operation, and expressions of the form 2:4
create a sequence from 2
to 4
.
Sub-setting one vector by another is enabled with [
.
Here we create an integer sequence from 2 to 4, and use the sequence as
an index to select the second, third, and fourth elements of
x
x[2:4]
Index values can be repeated, and if outside the domain of
x
return the special value NA
.
Negative index values remove elements from the vector. Logical and
character vectors (described below) can also be used for sub-setting.
R functions operate on variables. Functions are usually vectorized,
acting on all elements of their argument and obviating the need for
explicit iteration. Functions can generate warnings when performing
suspect operations, or errors if evaluation cannot proceed; try
log(-1)
.
log(x)
h4("Essential data types")
R has a number of standard data types, to represent
integer
, numeric
(floating point),
complex
, character
,
logical
(Boolean), and raw
(byte)
data. It is possible to convert between data types, and to discover the
type or mode of a variable.
c(1.1, 1.2, 1.3) # numeric c(FALSE, TRUE, FALSE) # logical c("foo", "bar", "baz") # character, single or double quote ok as.character(x) # convert 'x' to character typeof(x) # the number 5 is numeric, not integer typeof(2L) # append 'L' to force integer typeof(2:4) # ':' produces a sequence of integers
R includes data types particularly useful for statistical analysis,
including factor
to represent categories and
NA
(used in any vector) to represent missing values.
sex <- factor(c("Male", "Female", NA), levels=c("Female", "Male")) sex
h4("Lists, data frames, and matrices")
All of the vectors mentioned so far are homogeneous, consisting of a
single type of element. A list
can contain a
collection of different types of elements and, like all vectors, these
elements can be named to create a key-value association.
lst <- list(a=1:3, b=c("foo", "bar"), c=sex) lst
Lists can be subset like other vectors to get another list, or subset
with [[
to retrieve the actual list element; as with
other vectors, sub-setting can use names
lst[c(3, 1)] # another list lst[["a"]] # the element itself, selected by name
A data.frame
is a list of equal-length vectors,
representing a rectangular data structure not unlike a spread sheet.
Each column of the data frame is a vector, so data types must be
homogeneous within a column. A data.frame
can be
subset by row or column using [,]
, and columns can be
accessed with $
or [[
. In the
[,]
notation, row subset will be detailed before the
comma and column subset afterwards, i.e.: . Either subset can be left
empty.
df <- data.frame(age=c(27L, 32L, 19L), sex=factor(c("Male", "Female", "Male"))) df df[c(1, 3),] df[df$age > 20,]
A matrix
is also a rectangular data structure, but
subject to the constraint that all elements are the same type. A matrix
is created by taking a vector, and specifying the number of rows or
columns the vector is to represent. On sub-setting, R coerces a single
column data.frame
or single row or column
matrix
to a vector if possible; use
drop=FALSE
to stop this behavior.
m <- matrix(1:12, nrow=3) m m[c(1, 3), c(2, 4)] m[, 3] m[, 3, drop=FALSE]
An array
is a data structure for representing
homogeneous, rectangular data in higher dimensions.
h4("S3 and S4 classes")
More complicated data structures are represented using the ‘S3’ or ‘S4’
object system. Objects are often created by functions (for example,
lm
, below), with parts of the object extracted or
assigned using accessor functions. The following generates 1000 random
normal deviates as x
, and uses these to create another
1000 deviates y
that are linearly related to
x
but with some error. We fit a linear regression
using a ‘formula’ to describe the relationship between variables,
summarize the results in a familiar ANOVA table, and access
fit
(an S3 object) for the residuals of the
regression, using these as input first to the var
(variance) and then sqrt
(square-root) functions.
Objects can be interrogated for their class.
x <- rnorm(1000, sd=1) y <- x + rnorm(1000, sd=.5) fit <- lm(y ~ x) # formula describes linear regression fit # an 'S3' object anova(fit) sqrt(var(resid(fit))) # residuals accessor and subsequent transforms class(fit)
Many Bioconductor packages implement S4 objects to represent data. S3 and S4 systems are quite different from a programmer’s perspective, but fairly similar from a user’s perspective: both systems encapsulate complicated data structures, and allow for methods specialized to different data types; accessors are used to extract information from the objects.
h4("Functions")
R functions accept arguments, and return values. Arguments can be
required or optional. Some functions may take variable numbers of
arguments, e.g.: the columns in a data.frame
y <- 5:1 log(y) args(log) # arguments 'x' and 'base'; see ?log log(y, base=2) # 'base' is optional, with default value try(log()) # 'x' required; 'try' continues even on error args(data.frame) # ... represents variable number of arguments
Arguments can be matched by name or position. If an argument appears
after ...
, it must be named.
log(base=2, y) # match argument 'base' by name, 'x' by position
A function such as anova
is a generic that provides
an overall signature but dispatches the actual work to the method
corresponding to the class(es) of the arguments used to invoke the
generic. A generic may have fewer arguments than a method, as with the
S3 function anova
and its method
anova.glm
.
args(anova) args(stats:::anova.glm)
The ...
argument in the anova
generic means that additional arguments are possible; the
anova
generic hands these arguments to the method it
dispatches to. Note that in the previous example, the notation
“library:::function” is necessary as the anova.glm
function is only present within the stats
library
namespace (i.e.: the package self-environment) and is not exported to
the user environment. See subsection for more information about
libraries.
quest(4) endQuest()
startQuest("R functions")
h3("Useful functions")
R has a very large number of functions. The following is a brief list of those that might be commonly used and particularly useful.
dir
, read.table
(and friends), scan
: List files in a directory, read spreadsheet-like data into , efficiently read homogeneous data (e.g.: a file of numeric values) to be represented as a matrix.
c
, factor
, data.frame
, matrix
: Create a vector, factor, data frame or matrix.
summary
, table
, xtabs
: Summarize, create a table of the number of times elements occur in a vector, cross-tabulate two or more variables.
t.test
, aov
, lm
, anova
, chisq.test
: Basic comparison of two (t.test
) groups, or
several groups via analysis of variance / linear models
(aov
output is probably more familiar to
biologists), or compare simpler with more complicated models
(anova
); chi^2 tests.
dist
, hclust
: Cluster data.
plot
: Plot data.
ls
, str
, library
, search
: List objects in the current (or specified) workspace, or peak at the structure of an object; add a library to or describe the search path of attached packages.
lapply
, sapply
, mapply
, aggregate
: Apply a function to each element of a list
(lapply
, sapply
), to elements of
several lists (mapply
), or to elements of a list
partitioned by one or more factors (aggregate
).
with
: Conveniently access columns of a data frame or other element without having to repeat the name of the data frame.
match
, %in%
: Report the index or existence of elements from one vector that match another.
split
, cut
: Split one vector by an equal length factor, cut a single vector into intervals encoded as levels of a factor.
strsplit
, grep
, sub
: Operate on character vectors, splitting it into distinct fields,
searching for the occurrence of a patterns using regular expressions
(see ?regex
, or substituting a string for a
regular expression.
install.packages
: Install a package from an on-line repository into your .
traceback
, debug
, browser
: Report the sequence of functions under evaluation at the time of the error; enter a debugger when a particular function or statement is invoked.
?
, example
: See the help pages (e.g.: ?lm
) and examples
(example(match)
) for each of these functions
h4("Exemplary use case")
This example uses data describing 128 microarray samples as a basis for exploring R functions. Covariates such as age, sex, type, stage of the disease, etc., are in a data file .
The following command creates a variable that is the location of a comma-separated value (‘csv’) file to be used in the exercise. A csv file can be created using, e.g.: ‘Save as...’ in spreadsheet software.
pdataFile <- system.file(package="RnaSeqTutorial", "extdata", "pData.csv")
Input the csv file using read.table
, assigning the
input to a variable . Use dim
to find out the
dimensions (number of rows, number of columns) in the object. Are there
128 rows? Use names
or colnames
to
list the columns' name. Use summary
to
summarize each column of the data. What are the data types of each
column in the data frame?
quest(5) endQuest()
startQuest("R functions part 2")
h4("Exemplary use case (c'ed)")
pdata <- read.table(pdataFile) dim(pdata) names(pdata) summary(pdata)
Let's move onto the next topic:
[[
or $
.
Pause to explain to your neighbor why this sub-setting works. Since a
data frame is a list, use sapply
to ask about the
class of each column in the data frame. Explain to your neighbor what
this produces, and why.The solution is that a data frame can be subset as if it were a matrix, or a list of column vectors.
head(pdata[,"sex"], 3) head(pdata$sex, 3) head(pdata[["sex"]], 3) sapply(pdata, class)
table
to summarize the number of males and females
in the sample. Consult the help page ?table
to figure
out additional arguments required to include NA
values
in the tabulation.The number of males and females, including NA
, is
table(pdata$sex, useNA="ifany")
An alternative version of this uses the with
function:
with(pdata, table(sex, useNA=“ifany”))
.
mol.biol
column summarizes molecular biological
attributes of each sample. Use table
to summarize the
different molecular biology levels in the sample. The mol.biol
column contains the following samples:
with(pdata, table(mol.biol, useNA="ifany"))
%in%
to create a logical vector of the samples that
are either BCR/ABL
or NEG
. A logical vector indicating that the corresponding row is either
BCR/ABL
or NEG
is constructed as
ridx <- pdata$mol.biol %in% c("BCR/ABL", "NEG")
BCR/ABL
or NEG
.We can get a sense of the number of rows selected via
table
or sum
(discuss with your
neighbor what sum
does, and why the answer is the same
as the number of TRUE
values in the result of the
table
function).
table(ridx) sum(ridx)
mol.biol
column? Set the levels to be BCR/ABL
and
NEG
, i.e.: the levels in the subset.The original data frame can be subset to contain only
BCR/ABL
or NEG
samples using the
logical vector that we created.
pdata1 <- pdata[ridx,]
The levels of each factor reflect the levels in the original object, rather than the levels in the subset object, e.g.:
levels(pdata$mol.biol)
These can be re-coded by updating the new data frame to contain a factor with the desired levels.
pdata1$mol.biol <- factor(pdata1$mol.biol) table(pdata1$mol.biol)
t.test
to assess whether BCR/ABL
and
NEG
have individuals with similar age. To do this, use
a formula that describes the response age
in terms of the
predictor mol.biol
. If age is not independent of
molecular biology, what complications might this introduce into
subsequent analysis? Use a boxplot
to represent the
age depending on the molecular biology.To ask whether age differs between molecular biology types, we use a
formula age
mol.biol to describe the relationship
(‘age as a function of molecular biology’) that we wish to test
with(pdata1, t.test(age ~ mol.biol))
This summary can be visualized with, e.g.: the
boxplot
function
boxplot(age ~ mol.biol, pdata1)
For a fancier plot, see the following that uses some additional
arguments of the boxplot
function.
boxplot(age ~ mol.biol, pdata1,notch=TRUE,ylab="age (yr)", main="Age distribution by genotype",xlab="genotype")
quest(6) endQuest()
startQuest("Essential R")
h2("Packages")
Packages provide functionality beyond that available in base . There are almost 5,500 packages in CRAN (comprehensive R archive network) and more than 800 Bioconductor software packages. Packages are contributed by diverse members of the community; they vary in quality (many are excellent) and sometimes contain idiosyncratic aspects to their implementation. The following table outlines key base packages and selected contributed packages; see a local CRAN mirror (including the task views summarizing packages in different domains) and Bioconductor for additional contributed packages.
| Package | Description |
|---------|-------------|
| base
| Data input and essential manipulation; scripting and programming concepts. |
| stats
| Essential statistical and plotting functions. |
| lattice
, ggplot2
| Approaches to advanced graphics. |
| methods
| ‘S4’ classes and methods. |
| parallel
| Facilities for parallel evaluation. |
The lattice
package illustrates the value packages add
to base . lattice
is distributed with R but not loaded
by default. It provides a very expressive way to visualize data. The
following example plots yield for a number of barley varieties,
conditioned on site and grouped by year. The following figure is read
from the lower left corner. Note the common scales, efficient use of
space, and not-too-pleasing default color palette. The Morris sample
appears to be mis-labeled for ‘year’, an apparent error in the original
data. Find out about the built-in data set used in this example with
?barley
.
library(lattice) plt <- dotplot(variety ~ yield | site, data = barley, groups = year, xlab = "Barley Yield (bushels/acre)" , ylab=NULL, key = simpleKey(levels(barley$year), space = "top", columns=2), aspect=0.5, layout = c(2,3)) print(plt)
New packages can be added to an R installation using
install.packages
. A package is installed only once per
R installation, but needs to be loaded (with library
)
in each session in which it is used. Loading a package also loads any
package that it depends on. Packages loaded in the current session are
displayed with search
. The ordering of packages
returned by search
represents the order in which the
global environment (where commands entered at the prompt are evaluated)
and attached packages are searched for symbols; it is possible for a
package earlier in the search path to mask symbols later in the search
path; these can be disambiguated using ::
.
length(search()) search() base::log(1:3)
Use the library
function to load the
RnaSeqTutorial
package. Use the
sessionInfo
function to verify that you are using R
version 3.2.0 and current packages, similar to those reported here. What
other packages were loaded along with RnaSeqTutorial
?
library(RnaSeqTutorial) sessionInfo()
h3("Help")
Find help using the R help system. Start a web browser with:
help.start()
The ‘Search Engine and Keywords’ link is helpful in day-to-day use.
h4("Manual pages")
Use manual pages to find detailed descriptions of the arguments and return values of functions, and the structure and methods of classes. Find help within an R session as:
?data.frame ?lm ?anova # a generic function ?anova.lm # an S3 method, specialized for 'lm' objects
S3 methods can be queried interactively. For S3,
methods(anova) methods(class="glm")
It is often useful to view a method definition, either by typing the
method name at the command line or, for ‘non-visible’ methods, using
getAnywhere
:
ls getAnywhere("anova.loess")
Here we discover that the function head
(which returns
the first 6 elements of anything) defined in the utils
package, is an S3 generic (indicated by UseMethod
) and
has several methods. We use head
to look at the first
six lines of the head
method specialized for
matrix
objects.
utils::head methods(head) head(head.matrix)
S4 classes and generics are queried in a similar way to S3 classes and
generics, but with different syntax, as for the
complement
generic in the Biostrings
package:
library(Biostrings) showMethods(complement)
(Most) methods defined on the DNAStringSet
class of
Biostrings
and available on the current search path
can be found with
showMethods(class="DNAStringSet", where=search())
Obtaining help on S4 classes and methods requires syntax such as
class ? DNAStringSet method ? "complement,DNAStringSet"
The specification of method and class in the latter must not contain a space after the comma. The definition of a method can be retrieved as
selectMethod(complement, "DNAStringSet")
h4("Vignettes")
Vignettes, especially in Bioconductor packages, provide an extensive narrative describing overall package functionality. Use
vignette(package="RnaSeqTutorial")
to see, in your web browser, vignettes available in the
RnaSeqTutorial
package. Vignettes usually consist of text
with embedded R code, a form of literate programming. The vignette can
be read as a PDF document, while the R source code is present as a
script file ending with extension .R
. The script file can be sourced
or copied into an R session to evaluate exactly the commands used in the
vignette.
Use the following to open the vignette of interest:
vignette("RnaSeqTutorial")
Scavenger hunt. Spend five minutes tracking down the following information.
The package containing the library
function.
The author of the alphabetFrequency
function,
defined in the Biostrings
package.
A description of the GAlignments
class.
The number of vignettes in the GenomicRanges
package.
Possible solutions are found with the following R commands
?library library(Biostrings) ?alphabetFrequency class?GAlignments vignette(package="GenomicRanges")
h3("Efficient scripts")
There are often many ways to accomplish a result in R , but these different ways often have very different speed or memory requirements. For small data sets these performance differences are not that important, but for large data sets (e.g.: high-throughput sequencing; genome-wide association studies, GWAS) or complicated calculations (e.g.: bootstrapping) performance can be important. There are several approaches to achieving efficient R programming.
h4("Easy solutions")
Several common performance bottlenecks often have easy solutions; these are outlined here.
Text files often contain more information, for example 1000’s of
individuals at millions of SNPs, when only a subset of the data is
required, e.g.: during algorithm development. Reading in all the data
can be demanding in terms of both memory and time. A solution is to use
arguments such as colClasses
to specify the columns
and their data types that are required, and to use
nrow
to limit the number of rows input. For example,
the following ignores the first and fourth column, reading in only the
second and third (as type integer
and
numeric
).
# not evaluated colClasses <- c("NULL", "integer", "numeric", "NULL") df <- read.table("myfile", colClasses=colClasses)
R is vectorized, so traditional programming for
loops
are often not necessary. Rather than calculating 100000 random numbers
one at a time, or squaring each element of a vector, or iterating over
rows and columns in a matrix to calculate row sums, invoke the single
function that performs each of these operations.
x <- runif(100000); x2 <- x^2 m <- matrix(x2, nrow=1000); y <- rowSums(m)
This often requires a change of thinking, turning the sequence of
operations ‘inside-out’. For instance, calculate the log of the square
of each element of a vector by calculating the square of all elements,
followed by the log of all elements x2 \<- x
2; x3 \<-
log(x2), or simply logx2 \<- log(x
2).
It may sometimes be natural to formulate a problem as a
for
loop, or the formulation of the problem may
require that a for
loop be used. In these
circumstances the appropriate strategy is to pre-allocate the object,
and to fill the result in during loop iteration.
## not evaluated result <- numeric(nrow(df)) for (i in seq_len(nrow(df))) result[[i]] <- some_calc(df[i,])
Some R operations are helpful in general, but misleading or inefficient
in particular circumstances. An example is the behavior of
unlist
when the list is named – R creates new names
that have been made unique. This can be confusing (e.g.: when Entrez
gene identifiers are ‘mangled’ to unintentionally look like other
identifiers) and expensive (when a large number of new names need to be
created). Avoid creating unnecessary names, e.g.:
unlist(list(a=1:2)) # name 'a' becomes 'a1', 'a2' unlist(list(a=1:2), use.names=FALSE) # no names
Names can be very useful for avoiding book-keeping errors, but are inefficient for repeated look-ups; use vectorized access or numeric indexing.
h4("Moderate solutions")
Several solutions to inefficient code require greater knowledge to implement.
Using appropriate functions can greatly influence performance; it takes
experience to know when an appropriate function exists. For instance,
the lm
function could be used to assess differential
expression of each gene on a microarray, but the limma
package implements this operation in a way that takes advantage of the
experimental design that is common to each probe on the microarray, and
does so in a very efficient manner.
## not evaluated library(limma) # microarray linear models fit <- lmFit(eSet, design)
Using appropriate algorithms can have significant performance benefits, especially as data becomes larger. This solution requires moderate skills, because one has to be able to think about the complexity (e.g.: expected number of operations) of an algorithm, and to identify algorithms that accomplish the same goal in fewer steps. For example, a naive way of identifying which of 100 numbers are in a set of size 10 might look at all 100 combinations of numbers (i.e.: polynomial time), but a faster way is to create a ‘hash’ table of one of the set of elements and probe that for each of the other elements (i.e.: linear time). The latter strategy is illustrated with
x <- 1:100; s <- sample(x, 10) inS <- x %in% s
R is an interpreted language, and for very challenging computational problems it may be appropriate to write critical stages of an analysis in a compiled language like C or Fortran, or to use an existing programming library (e.g.: the BOOST graph library) that efficiently implements advanced algorithms. R has a well-developed interface to C or Fortran, so it is ‘easy’ to do this. This places a significant burden on the person implementing the solution, requiring knowledge of two or more computer languages and of the interface between them.
h4("Measuring performance")
When trying to improve performance, one wants to ensure (a) that the new code is actually faster than the previous code, and (b) both solutions arrive at the same, correct, answer.
The system.time
function is a straight-forward way to
measure the length of time a portion of code takes to evaluate. Here we
see that the use of apply
to calculate row sums of a
matrix is much less efficient than the specialized
rowSums
function.
m <- matrix(runif(200000), 20000) replicate(5, system.time(apply(m, 1, sum))[[1]]) replicate(5, system.time(rowSums(m))[[1]])
Usually it is appropriate to replicate timings to average over vagaries of system use, and to shuffle the order in which timings of alternative algorithms are calculated to avoid artifacts such as initial memory allocation.
Speed is an important metric, but equivalent results are also needed.
The functions identical
and
all.equal
provide different levels of assessing
equivalence, with all.equal
providing ability to
ignore some differences, e.g.: in the names of vector elements.
res1 <- apply(m, 1, sum) res2 <- rowSums(m) identical(res1, res2) identical(c(1, -1), c(x=1, y=-1)) all.equal(c(1, -1), c(x=1, y=-1), check.attributes=FALSE)
Two additional functions for assessing performance are
Rprof
and tracemem
; these are
mentioned only briefly here. The Rprof
function
profiles R code, presenting a summary of the time spent in each part of
several lines of R code. It is useful for gaining insight into the
location of performance bottlenecks when these are not readily apparent
from direct inspection. Memory management, especially copying large
objects, can frequently contribute to poor performance. The
tracemem
function allows one to gain insight into how
R manages memory; insights from this kind of analysis can sometimes be
useful in restructuring code into a more efficient sequence.
h3("Warnings, errors, and debugging")
R signals unexpected results through warnings and errors. Warnings occur
when the calculation produces an unusual result that nonetheless does
not preclude further evaluation. For instance log(-1)
results in a value NaN
(‘not a number’) that allows
computation to continue, but at the same time signals an warning
> log(-1) [1] NaN Warning message: In log(-1) : NaNs produced
Errors result when the inputs or outputs of a function are such that no further action can be taken, e.g.: trying to take the square root of a character vector
> sqrt("two") Error in sqrt("two") : Non-numeric argument to mathematical function
Warnings and errors occurring at the command prompt are usually easy to diagnose. They can be more enigmatic when occurring in a function, and exacerbated by sometimes cryptic (when read out of context) error messages.
An initial step in coming to terms with errors is to simplify the problem as much as possible, aiming for a ‘reproducible’ error. The reproducible error might involve a very small (even trivial) data set that immediately provokes the error. Often the process of creating a reproducible example helps to clarify what the error is, and what possible solutions might be.
Invoking traceback()
immediately after an error occurs
provides a ‘stack’ of the function calls that were in effect when the
error occurred. This can help understand the context in which the error
occurred. Knowing the context, one might use debug
to
enter into a browser (see ?browser
) that allows one to
step through the function in which the error occurred.
It can sometimes be useful to use global options (see
?options
) to influence what happens when an error
occurs. Two common global options are error
and
warn
. Setting error=recover
combines
the functionality of traceback
and
debug
, allowing the user to enter the browser at any
level of the call stack in effect at the time the error occurred.
Default error behavior can be restored with
options(error=NULL)
. Setting warn=2
causes warnings to be promoted to errors. For instance, initial
investigation of an error might show that the error occurs when one of
the arguments to a function has value NaN
. The error
might be accompanied by a warning message that the NaN
has been introduced, but because warnings are by default not reported
immediately it is not clear where the NaN
comes from.
warn=2
means that the warning is treated as an error,
and hence can be debugged using traceback
,
debug
, and so on.
Additional useful debugging functions include browser
,
trace
, and setBreakpoint
.
h3("Asking for help")
As already mentioned, help can be sought from mailing lists. For
question with regards to R, adress them to the:
R Mailing lists: http://www.r-project.org
For Bioconductor questions, adress them to the:
Bioconductor Mailing lists:
http://bioconductor.org/help/mailing-list/
. Providing this information will fasten the help and the issue resolution if there’s one.
h3("Resources")
h4("Publications and Books")
Dalgaard [@R:Dalgaard:2008] provides an introduction to statistical analysis with .
Kabaloff [@R:Kabacoff:2010] provides a broad survey of .
Matloff [@Matloff2011] introduces R programming concepts.
Chambers [@R:Chambers:2008] provides more advanced insights into .
Gentleman [@R:Gentleman:2008b] emphasizes use of R for bioinformatic programming tasks.
The R web site enumerates additional publications from the user community.
h4("Integrated Development Environment (IDE)")
The RStudio environment provides a nice, cross-platform environment for working in .
h4("Other resources")
Bioconductor http://bioconductor.org
Quick-R http://www.statmethods.net
R web site http://r-project.org
quest(7) endQuest()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.