startQuest("Bioconductor website")
    h1("Introduction to R / Bioconductor")

This chapter[^3] introduces R and Bioconductor in details. During the course, we will only scheme through it as you are supposed to be familiar with most of the concepts presented here. If that were not the case - and even if it were - consider it as homework.

  h2("Bioconductor")

Bioconductor is a collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor started more than 10 years ago and gained credibility for its statistically rigorous approach to microarray pre-processing and analysis of designed experiments, and integrative and reproducible approaches to bioinformatic tasks. There are now more than 800 Bioconductor packages for expression and other microarrays, sequence analysis, flow cytometry, imaging, and other domains. The Bioconductor web site provides installation, package repository, help, and other documentation.

The Bioconductor web site is at bioconductor.org. Features include:

quest(1)
endQuest()
startQuest("Bioconductor rationale")
    h2("Statistical programming")

You would have realized that all the links above are useful ones (and one is a shameless self-citation)

Many academic and commercial software products are available; why would one use R and ? One answer is to ask about the demands high-throughput genomic data places on effective computational biology software.

    h3("Effective computational biology software")
  1. High-throughput questions make use of large data sets. This applies both to the primary data (microarray expression values, sequenced reads, ) and also to the annotations on those data (coordinates of genes and features such as exons or regulatory regions; participation in biological pathways, ). Large data sets place demands on our tools that preclude some standard approaches, such as spread sheets. Likewise, intricate relationships between data and annotation, and the diversity of research questions, require flexibility typical of a programming language rather than a narrowly-enabled graphical user interface.

  2. Analysis of high-throughput data is necessarily statistical. The volume of data requires that it be appropriately summarized before any sort of comprehension is possible. The data are produced by advanced technologies, and these introduce artifacts (e.g.: probe-specific bias in microarrays; sequence or base calling bias in RNA-seq experiments) that need to be accommodated to avoid incorrect or inefficient inference. Data sets typically derive from designed experiments, requiring a statistical approach both to account for the design and to correctly address the large number of observed values (e.g.: gene expression or sequence tag counts) and small number of samples accessible in typical experiments.

  3. Research needs to be reproducible. Reproducibility is both an ideal of the scientific method, and a pragmatic requirement. The latter comes from the long-term and multi-participant nature of contemporary science. An analysis will be performed for the initial experiment, revisited again during manuscript preparation, and revisited during reviews or in determining next steps. Likewise, analyses typically involve a team of individuals with diverse domains of expertise. Effective collaborations result when it is easy to reproduce, perhaps with minor modifications, an existing result, and when sophisticated statistical or bioinformatic analysis can be effectively conveyed to other group members.

  4. Science moves very quickly. This is driven by the novel questions that are the hallmark of discovery, and by technological innovation and accessibility. Rapidity of scientific development places significant burdens on software, which must also move quickly. Effective software cannot be too polished, because that requires that the correct analyses are ‘known’ and that significant resources of time and money have been invested in developing the software; this implies software that is tracking the trailing edge of innovation. On the other hand, leading-edge software cannot be too idiosyncratic; it must be usable by a wider audience than the creator of the software, and fit in with other software relevant to the analysis.

  5. Effective software must be accessible. Affordability is one aspect of accessibility. Another is transparent implementation, where the novel software is sufficiently documented and source code accessible enough for the assumptions, approaches, practical implementation decisions, and inevitable coding errors to be assessed by other skilled practitioners. A final aspect of affordability is that the software is actually usable. This is achieved through adequate documentation, support forums, and training opportunities.

  h3("Bioconductor as effective computational biology software")

What features of R and Bioconductor contribute to its effectiveness as a software tool?

  1. Bioconductor is well suited to handle extensive data and annotation. Bioconductor ‘classes’ represent high-throughput data and their annotation in an integrated way. Bioconductor methods use advanced programming techniques or R resources (such as transparent data base or network access) to minimize memory requirements and integrate with diverse resources. Classes and methods coordinate complicated data sets with extensive annotation. Nonetheless, the basic model for object manipulation in R involves vectorized in-memory representations. For this reason, particular programming paradigms (e.g.: block processing of data streams; explicit parallelism) or hardware resources (e.g.: large-memory computers) are sometimes required when dealing with extensive data.

  2. R is ideally suited to addressing the statistical challenges of high-throughput data. Three examples include the development of the ‘RMA’ affy and other normalization algorithm for microarray pre-processing, use of moderated \(t\)-statistics for assessing microarray differential expression Limma, and development of negative binomial approaches to estimating dispersion read counts necessary for appropriate analysis of RNA-Seq designed experiments [@AndersHuber2010],[@pmid19910308].

  3. Many of the ‘old school’ aspects of R and Bioconductor facilitate reproducible research. An analysis is often represented as a text-based script. Reproducing the analysis involves re-running the script; adjusting how the analysis is performed involves simple text-editing tasks. Beyond this, R has the notion of a ‘vignette’, which represents an analysis as a LaTeX document with embedded R commands. The R commands are evaluated when the document is built, thus reproducing the analysis. The use of LaTeX, and more recently of simpler markdown languages in conjonction with pandoc to generate HTML formatted reports or vignettes means that the symbolic manipulations in the script are augmented with textual explanations and justifications for the approach taken; these include graphical and tabular summaries at appropriate places in the analysis. R includes facilities for reporting the exact version of R and associated packages used in an analysis so that, if needed, discrepancies between software versions can be tracked down and their importance evaluated. While users often think of R packages as providing new functionality, packages are also used to enhance reproducibility by encapsulating a single analysis. The package can contain data sets, vignette(s) describing the analysis, R functions that might have been written, scripts for key data processing stages, and documentation (via standard R help mechanisms) of what the functions, data, and packages are about.

  4. The Bioconductor project adopts practices that facilitate reproducibility. Versions of R and Bioconductor are released once and twice each year, respectively. Each Bioconductor release is the result of development, in a separate branch, during the previous six months. The release is built daily against the corresponding version of R on Linux, Mac, and Windows platforms, with an extensive suite of tests performed. The biocLite function ensures that each release of R uses the corresponding Bioconductor packages. The user thus has access to stable and tested package versions. R and Bioconductor are effective tools for reproducible research.

  5. R and Bioconductor exist on the leading portion of the software life cycle. Contributors are primarily from academic institutions, and are directly involved in novel research activities. New developments are made available in a familiar format, i.e.: the R language, packaging, and build systems. The rich set of facilities in R - e.g.: for advanced statistical analysis or visualization - and the extensive resources in Bioconductor - e.g.: for annotation using third-party data such as Biomart (www.biomart.org[@Kasprzyk:2011p5853]) or UCSC genome browser tracks(http://genome.ucsc.edu/[@Meyer:2013p5854]) - mean that innovations can be directly incorporated into existing workflows. The ‘development’ branches of R and Bioconductor provide an environment where contributors can explore new approaches without alienating their user base.

  6. R and Bioconductor also fair well in terms of accessibility. The software is freely available. The source code is easily and fully accessible for critical evaluation. The R packaging and check system requires that all functions are documented. Bioconductor requires that each package contain vignettes to illustrate the use of the software. There are very active R and Bioconductor mailing lists for immediate support, and regular training and conference activities for professional development.

quest(2)
endQuest()
startQuest("Bioconductor for NGS")
    h2("Bioconductor for high-throughput sequence analysis")

The table below enumerates many of the packages available for sequence analysis. The table includes packages for representing sequence-related data (e.g.: GenomicRanges, Biostrings), as well as domain-specific analysis such as RNA-seq (e.g.: edgeR, DEXSeq), ChIP-seq (e.g,. ChIPpeakAnno, DiffBind), and SNPs and copy number variation (e.g.: genoset, ggtools, VariantAnnotation).

| Concept | Packages | |---------|----------| | Data representation | IRanges, GenomicRanges, GenomicFeatures,Biostrings, BSgenome, girafe,genomeIntervals | | Input / output | ShortRead(fastq), Rsamtools (bam), rtracklayer (gff, wig, bed), VariantAnnotation (vcf),R453Plus1Toolbox(454) | | Annotation | biomaRt[@Durinck:2005p1990],GenomicFeatures,ChIPpeakAnno, VariantAnnotation |
| Alignment | Biostrings, gmapR,Rsubread, Rbowtie| | Visualization | ggbio,Gviz|
| Quality assessment | qrqc,seqbias,ReQON, htSeqTools,TEQC, Rolexa,ShortRead| | RNA-seq | BitSeq,cqn,cummeRbund,DESeq[@AndersHuber2010],DESeq2[Love:2014p6358],DEXSeq[dexseq],EDASeq,edgeR[@pmid19910308],gage,goseq, iASeq,tweeDEseq| | ChIP-seq | BayesPeak,baySeq, ChIPpeakAnno,chipseq, ChIPseqR,ChIPsim, CSAR,DiffBind,MEDIPS, mosaics,NarrowPeaks,nucleR,PICS, PING,REDseq, Repitools,TSSi| | Motifs | BCRANK, cosmo,cosmoGUI, MotIV,seqLogo, rGADEM| | 3C | HiTC,r3Cseq| | Copy number | cn.mops,CNAnorm,exomeCopy, seqgmentSeq| | Microbiome | phyloseq,DirichletMultinomial[@Holmes2012],clstutils, manta,mcaGUI| | Workflows | ArrayExpressHTS,Genominator,easyRNASeq,oneChannelGUI,QuasR,rnaSeqMap[@Lesniewska:2011p4758]| | Database | SRAdb |

quest(3)
endQuest()
startQuest("R basics")
  h2("R")

R is an open-source statistical programming language. It is used to manipulate data, to perform statistical analysis, and to present graphical and other results. R consists of a core language, additional ‘packages’ distributed with the R language, and a very large number of packages contributed by the broader community. Packages add specific functionality to an R installation. R has become the primary language of academic statistical analysis, and is widely used in diverse areas of research, government, and industry.

R has several unique features. It has a surprisingly ‘old school’ interface: users type commands into a console; scripts in plain text represent workflows; tools other than R are used for editing and other tasks. R is a flexible programming language, so while one person might use functions provided by R to accomplish advanced analytic tasks, another might implement their own functions for novel data types. As a programming language, R adopts syntax and grammar that differ from many other languages: objects in R are ‘vectors’, and functions are ‘vectorized’ to operate on all elements of the object; R objects have ‘copy on change’ and ‘pass by value’ semantics, reducing unexpected consequences for users at the expense of less efficient memory use; common paradigms in other languages, such as the ‘for’ loop, are encountered much less commonly in R. Many authors contribute to , so there can be a frustrating inconsistency of documentation and interface. R grew up in the academic community, so authors have not shied away from trying new approaches. Common statistical analysis functions are very well-developed.

  h3("R data types")

Opening an R session results in a prompt. The user types instructions at the prompt. Here is an example:

    ## assign values 5, 4, 3, 2, 1 to variable 'x'
    x <- c(5, 4, 3, 2, 1)
    x

The first line starts with a # to represent a comment; the line is ignored by R. The next line creates a variable x. The variable is assigned (using <-, we could have used = almost interchangeably) a value. The value assigned is the result of a call to the c function. That it is a function call is indicated by the symbol named followed by parentheses, c(). The c function takes zero or more arguments, and returns a vector. The vector is the value assigned to x. R responds to this line with a new prompt, ready for the next input. The next line asks R to display the value of the variable x. R responds by printing [1] to indicate that the subsequent number is the first element of the vector. It then prints the value of x.

R has many features to aid common operations. Entering sequences is a very common operation, and expressions of the form 2:4 create a sequence from 2 to 4. Sub-setting one vector by another is enabled with [. Here we create an integer sequence from 2 to 4, and use the sequence as an index to select the second, third, and fourth elements of x

    x[2:4]

Index values can be repeated, and if outside the domain of x return the special value NA. Negative index values remove elements from the vector. Logical and character vectors (described below) can also be used for sub-setting.

R functions operate on variables. Functions are usually vectorized, acting on all elements of their argument and obviating the need for explicit iteration. Functions can generate warnings when performing suspect operations, or errors if evaluation cannot proceed; try log(-1).

    log(x)
    h4("Essential data types")

R has a number of standard data types, to represent integer, numeric (floating point), complex, character, logical (Boolean), and raw (byte) data. It is possible to convert between data types, and to discover the type or mode of a variable.

    c(1.1, 1.2, 1.3)         # numeric
    c(FALSE, TRUE, FALSE)    # logical
    c("foo", "bar", "baz")   # character, single or double quote ok
    as.character(x)          # convert 'x' to character
    typeof(x)                # the number 5 is numeric, not integer
    typeof(2L)               # append 'L' to force integer
    typeof(2:4)              # ':' produces a sequence of integers

R includes data types particularly useful for statistical analysis, including factor to represent categories and NA (used in any vector) to represent missing values.

    sex <- factor(c("Male", "Female", NA), levels=c("Female", "Male"))
    sex
    h4("Lists, data frames, and matrices")

All of the vectors mentioned so far are homogeneous, consisting of a single type of element. A list can contain a collection of different types of elements and, like all vectors, these elements can be named to create a key-value association.

    lst <- list(a=1:3, b=c("foo", "bar"), c=sex)
    lst

Lists can be subset like other vectors to get another list, or subset with [[ to retrieve the actual list element; as with other vectors, sub-setting can use names

    lst[c(3, 1)]             # another list
    lst[["a"]]               # the element itself, selected by name

A data.frame is a list of equal-length vectors, representing a rectangular data structure not unlike a spread sheet. Each column of the data frame is a vector, so data types must be homogeneous within a column. A data.frame can be subset by row or column using [,], and columns can be accessed with $ or [[. In the [,] notation, row subset will be detailed before the comma and column subset afterwards, i.e.: . Either subset can be left empty.

    df <- data.frame(age=c(27L, 32L, 19L),
                     sex=factor(c("Male", "Female", "Male")))
    df
    df[c(1, 3),]
    df[df$age > 20,]

A matrix is also a rectangular data structure, but subject to the constraint that all elements are the same type. A matrix is created by taking a vector, and specifying the number of rows or columns the vector is to represent. On sub-setting, R coerces a single column data.frame or single row or column matrix to a vector if possible; use drop=FALSE to stop this behavior.

    m <- matrix(1:12, nrow=3)
    m
    m[c(1, 3), c(2, 4)]
    m[, 3]
    m[, 3, drop=FALSE]

An array is a data structure for representing homogeneous, rectangular data in higher dimensions.

    h4("S3 and S4 classes")

More complicated data structures are represented using the ‘S3’ or ‘S4’ object system. Objects are often created by functions (for example, lm, below), with parts of the object extracted or assigned using accessor functions. The following generates 1000 random normal deviates as x, and uses these to create another 1000 deviates y that are linearly related to x but with some error. We fit a linear regression using a ‘formula’ to describe the relationship between variables, summarize the results in a familiar ANOVA table, and access fit (an S3 object) for the residuals of the regression, using these as input first to the var (variance) and then sqrt (square-root) functions. Objects can be interrogated for their class.

    x <- rnorm(1000, sd=1)
    y <- x + rnorm(1000, sd=.5)
    fit <- lm(y ~ x)       # formula describes linear regression 
    fit                    # an 'S3' object
    anova(fit)
    sqrt(var(resid(fit)))  # residuals accessor and subsequent transforms
    class(fit)

Many Bioconductor packages implement S4 objects to represent data. S3 and S4 systems are quite different from a programmer’s perspective, but fairly similar from a user’s perspective: both systems encapsulate complicated data structures, and allow for methods specialized to different data types; accessors are used to extract information from the objects.

    h4("Functions")

R functions accept arguments, and return values. Arguments can be required or optional. Some functions may take variable numbers of arguments, e.g.: the columns in a data.frame

    y <- 5:1
    log(y)
    args(log)        # arguments 'x' and 'base'; see ?log
    log(y, base=2)   # 'base' is optional, with default value
    try(log())       # 'x' required; 'try' continues even on error
    args(data.frame) # ... represents variable number of arguments

Arguments can be matched by name or position. If an argument appears after ..., it must be named.

    log(base=2, y)   # match argument 'base' by name, 'x' by position

A function such as anova is a generic that provides an overall signature but dispatches the actual work to the method corresponding to the class(es) of the arguments used to invoke the generic. A generic may have fewer arguments than a method, as with the S3 function anova and its method anova.glm.

    args(anova)
    args(stats:::anova.glm)

The ... argument in the anova generic means that additional arguments are possible; the anova generic hands these arguments to the method it dispatches to. Note that in the previous example, the notation “library:::function” is necessary as the anova.glm function is only present within the stats library namespace (i.e.: the package self-environment) and is not exported to the user environment. See subsection for more information about libraries.

quest(4)
endQuest()
startQuest("R functions")
    h3("Useful functions")

R has a very large number of functions. The following is a brief list of those that might be commonly used and particularly useful.

dir, read.table (and friends), scan

: List files in a directory, read spreadsheet-like data into , efficiently read homogeneous data (e.g.: a file of numeric values) to be represented as a matrix.

c, factor, data.frame, matrix

: Create a vector, factor, data frame or matrix.

summary, table, xtabs

: Summarize, create a table of the number of times elements occur in a vector, cross-tabulate two or more variables.

t.test, aov, lm, anova, chisq.test

: Basic comparison of two (t.test) groups, or several groups via analysis of variance / linear models (aov output is probably more familiar to biologists), or compare simpler with more complicated models (anova); chi^2 tests.

dist, hclust

: Cluster data.

plot

: Plot data.

ls, str, library, search

: List objects in the current (or specified) workspace, or peak at the structure of an object; add a library to or describe the search path of attached packages.

lapply, sapply, mapply, aggregate

: Apply a function to each element of a list (lapply, sapply), to elements of several lists (mapply), or to elements of a list partitioned by one or more factors (aggregate).

with

: Conveniently access columns of a data frame or other element without having to repeat the name of the data frame.

match, %in%

: Report the index or existence of elements from one vector that match another.

split, cut

: Split one vector by an equal length factor, cut a single vector into intervals encoded as levels of a factor.

strsplit, grep, sub

: Operate on character vectors, splitting it into distinct fields, searching for the occurrence of a patterns using regular expressions (see ?regex, or substituting a string for a regular expression.

install.packages

: Install a package from an on-line repository into your .

traceback, debug, browser

: Report the sequence of functions under evaluation at the time of the error; enter a debugger when a particular function or statement is invoked.

?, example

: See the help pages (e.g.: ?lm) and examples (example(match)) for each of these functions

  h4("Exemplary use case")

This example uses data describing 128 microarray samples as a basis for exploring R functions. Covariates such as age, sex, type, stage of the disease, etc., are in a data file .

The following command creates a variable that is the location of a comma-separated value (‘csv’) file to be used in the exercise. A csv file can be created using, e.g.: ‘Save as...’ in spreadsheet software.

      pdataFile <- system.file(package="RnaSeqTutorial", "extdata", "pData.csv")

Input the csv file using read.table, assigning the input to a variable . Use dim to find out the dimensions (number of rows, number of columns) in the object. Are there 128 rows? Use names or colnames to list the columns' name. Use summary to summarize each column of the data. What are the data types of each column in the data frame?

quest(5)
endQuest()
startQuest("R functions part 2")
    h4("Exemplary use case (c'ed)")
    pdata <- read.table(pdataFile)  
    dim(pdata)
    names(pdata)
    summary(pdata)

Let's move onto the next topic:

The solution is that a data frame can be subset as if it were a matrix, or a list of column vectors.

    head(pdata[,"sex"], 3)
    head(pdata$sex, 3)
    head(pdata[["sex"]], 3)
    sapply(pdata, class)

The number of males and females, including NA, is

    table(pdata$sex, useNA="ifany")

An alternative version of this uses the with function: with(pdata, table(sex, useNA=“ifany”)).

The mol.biol column contains the following samples:

    with(pdata, table(mol.biol, useNA="ifany"))

A logical vector indicating that the corresponding row is either BCR/ABL or NEG is constructed as

    ridx <- pdata$mol.biol %in% c("BCR/ABL", "NEG")

We can get a sense of the number of rows selected via table or sum (discuss with your neighbor what sum does, and why the answer is the same as the number of TRUE values in the result of the table function).

    table(ridx)
    sum(ridx)

The original data frame can be subset to contain only BCR/ABL or NEG samples using the logical vector that we created.

    pdata1 <- pdata[ridx,]

The levels of each factor reflect the levels in the original object, rather than the levels in the subset object, e.g.:

    levels(pdata$mol.biol)

These can be re-coded by updating the new data frame to contain a factor with the desired levels.

    pdata1$mol.biol <- factor(pdata1$mol.biol)
    table(pdata1$mol.biol)

To ask whether age differs between molecular biology types, we use a formula age mol.biol to describe the relationship (‘age as a function of molecular biology’) that we wish to test

    with(pdata1, t.test(age ~ mol.biol))

This summary can be visualized with, e.g.: the boxplot function

    boxplot(age ~ mol.biol, pdata1)

For a fancier plot, see the following that uses some additional arguments of the boxplot function.

     boxplot(age ~ mol.biol, pdata1,notch=TRUE,ylab="age (yr)",
            main="Age distribution by genotype",xlab="genotype")
quest(6)
endQuest()
startQuest("Essential R")
    h2("Packages")

Packages provide functionality beyond that available in base . There are almost 5,500 packages in CRAN (comprehensive R archive network) and more than 800 Bioconductor software packages. Packages are contributed by diverse members of the community; they vary in quality (many are excellent) and sometimes contain idiosyncratic aspects to their implementation. The following table outlines key base packages and selected contributed packages; see a local CRAN mirror (including the task views summarizing packages in different domains) and Bioconductor for additional contributed packages.

| Package | Description | |---------|-------------| | base | Data input and essential manipulation; scripting and programming concepts. | | stats | Essential statistical and plotting functions. | | lattice, ggplot2 | Approaches to advanced graphics. | | methods | ‘S4’ classes and methods. |
| parallel | Facilities for parallel evaluation. |

The lattice package illustrates the value packages add to base . lattice is distributed with R but not loaded by default. It provides a very expressive way to visualize data. The following example plots yield for a number of barley varieties, conditioned on site and grouped by year. The following figure is read from the lower left corner. Note the common scales, efficient use of space, and not-too-pleasing default color palette. The Morris sample appears to be mis-labeled for ‘year’, an apparent error in the original data. Find out about the built-in data set used in this example with ?barley.

    library(lattice)
    plt <- dotplot(variety ~ yield | site, data = barley, groups = year,
                   xlab = "Barley Yield (bushels/acre)" , ylab=NULL,
                   key = simpleKey(levels(barley$year), space = "top", 
                     columns=2),
                   aspect=0.5, layout = c(2,3))
    print(plt)

New packages can be added to an R installation using install.packages. A package is installed only once per R installation, but needs to be loaded (with library) in each session in which it is used. Loading a package also loads any package that it depends on. Packages loaded in the current session are displayed with search. The ordering of packages returned by search represents the order in which the global environment (where commands entered at the prompt are evaluated) and attached packages are searched for symbols; it is possible for a package earlier in the search path to mask symbols later in the search path; these can be disambiguated using ::.

    length(search())
    search()
    base::log(1:3)

Use the library function to load the RnaSeqTutorial package. Use the sessionInfo function to verify that you are using R version 3.2.0 and current packages, similar to those reported here. What other packages were loaded along with RnaSeqTutorial?

    library(RnaSeqTutorial)
    sessionInfo()
    h3("Help")

Find help using the R help system. Start a web browser with:

    help.start()

The ‘Search Engine and Keywords’ link is helpful in day-to-day use.

    h4("Manual pages")

Use manual pages to find detailed descriptions of the arguments and return values of functions, and the structure and methods of classes. Find help within an R session as:

    ?data.frame
    ?lm
    ?anova             # a generic function
    ?anova.lm          # an S3 method, specialized for 'lm' objects

S3 methods can be queried interactively. For S3,

    methods(anova)
    methods(class="glm")

It is often useful to view a method definition, either by typing the method name at the command line or, for ‘non-visible’ methods, using getAnywhere:

    ls
    getAnywhere("anova.loess")

Here we discover that the function head (which returns the first 6 elements of anything) defined in the utils package, is an S3 generic (indicated by UseMethod) and has several methods. We use head to look at the first six lines of the head method specialized for matrix objects.

    utils::head
    methods(head)
    head(head.matrix)

S4 classes and generics are queried in a similar way to S3 classes and generics, but with different syntax, as for the complement generic in the Biostrings package:

    library(Biostrings)
    showMethods(complement)

(Most) methods defined on the DNAStringSet class of Biostrings and available on the current search path can be found with

    showMethods(class="DNAStringSet", where=search())

Obtaining help on S4 classes and methods requires syntax such as

    class ? DNAStringSet
    method ? "complement,DNAStringSet"

The specification of method and class in the latter must not contain a space after the comma. The definition of a method can be retrieved as

    selectMethod(complement, "DNAStringSet")
    h4("Vignettes")

Vignettes, especially in Bioconductor packages, provide an extensive narrative describing overall package functionality. Use

    vignette(package="RnaSeqTutorial")

to see, in your web browser, vignettes available in the RnaSeqTutorial package. Vignettes usually consist of text with embedded R code, a form of literate programming. The vignette can be read as a PDF document, while the R source code is present as a script file ending with extension .R. The script file can be sourced or copied into an R session to evaluate exactly the commands used in the vignette.

Use the following to open the vignette of interest:

    vignette("RnaSeqTutorial")

Scavenger hunt. Spend five minutes tracking down the following information.

  1. The package containing the library function.

  2. The author of the alphabetFrequency function, defined in the Biostrings package.

  3. A description of the GAlignments class.

  4. The number of vignettes in the GenomicRanges package.

Possible solutions are found with the following R commands

    ?library
    library(Biostrings)
    ?alphabetFrequency
    class?GAlignments
    vignette(package="GenomicRanges")
    h3("Efficient scripts")

There are often many ways to accomplish a result in R , but these different ways often have very different speed or memory requirements. For small data sets these performance differences are not that important, but for large data sets (e.g.: high-throughput sequencing; genome-wide association studies, GWAS) or complicated calculations (e.g.: bootstrapping) performance can be important. There are several approaches to achieving efficient R programming.

    h4("Easy solutions")

Several common performance bottlenecks often have easy solutions; these are outlined here.

Text files often contain more information, for example 1000’s of individuals at millions of SNPs, when only a subset of the data is required, e.g.: during algorithm development. Reading in all the data can be demanding in terms of both memory and time. A solution is to use arguments such as colClasses to specify the columns and their data types that are required, and to use nrow to limit the number of rows input. For example, the following ignores the first and fourth column, reading in only the second and third (as type integer and numeric).

    # not evaluated
    colClasses <-
     c("NULL", "integer", "numeric", "NULL")
    df <- read.table("myfile", colClasses=colClasses)

R is vectorized, so traditional programming for loops are often not necessary. Rather than calculating 100000 random numbers one at a time, or squaring each element of a vector, or iterating over rows and columns in a matrix to calculate row sums, invoke the single function that performs each of these operations.

    x <- runif(100000); x2 <- x^2
    m <- matrix(x2, nrow=1000); y <- rowSums(m)

This often requires a change of thinking, turning the sequence of operations ‘inside-out’. For instance, calculate the log of the square of each element of a vector by calculating the square of all elements, followed by the log of all elements x2 \<- x2; x3 \<- log(x2), or simply logx2 \<- log(x2).

It may sometimes be natural to formulate a problem as a for loop, or the formulation of the problem may require that a for loop be used. In these circumstances the appropriate strategy is to pre-allocate the object, and to fill the result in during loop iteration.

    ## not evaluated
    result <- numeric(nrow(df))
    for (i in seq_len(nrow(df)))
     result[[i]] <- some_calc(df[i,])

Some R operations are helpful in general, but misleading or inefficient in particular circumstances. An example is the behavior of unlist when the list is named – R creates new names that have been made unique. This can be confusing (e.g.: when Entrez gene identifiers are ‘mangled’ to unintentionally look like other identifiers) and expensive (when a large number of new names need to be created). Avoid creating unnecessary names, e.g.:

    unlist(list(a=1:2)) # name 'a' becomes 'a1', 'a2'
    unlist(list(a=1:2), use.names=FALSE)   # no names

Names can be very useful for avoiding book-keeping errors, but are inefficient for repeated look-ups; use vectorized access or numeric indexing.

    h4("Moderate solutions")

Several solutions to inefficient code require greater knowledge to implement.

Using appropriate functions can greatly influence performance; it takes experience to know when an appropriate function exists. For instance, the lm function could be used to assess differential expression of each gene on a microarray, but the limma package implements this operation in a way that takes advantage of the experimental design that is common to each probe on the microarray, and does so in a very efficient manner.

    ## not evaluated
    library(limma) # microarray linear models
    fit <- lmFit(eSet, design)

Using appropriate algorithms can have significant performance benefits, especially as data becomes larger. This solution requires moderate skills, because one has to be able to think about the complexity (e.g.: expected number of operations) of an algorithm, and to identify algorithms that accomplish the same goal in fewer steps. For example, a naive way of identifying which of 100 numbers are in a set of size 10 might look at all 100 combinations of numbers (i.e.: polynomial time), but a faster way is to create a ‘hash’ table of one of the set of elements and probe that for each of the other elements (i.e.: linear time). The latter strategy is illustrated with

    x <- 1:100; s <- sample(x, 10)
    inS <- x %in% s

R is an interpreted language, and for very challenging computational problems it may be appropriate to write critical stages of an analysis in a compiled language like C or Fortran, or to use an existing programming library (e.g.: the BOOST graph library) that efficiently implements advanced algorithms. R has a well-developed interface to C or Fortran, so it is ‘easy’ to do this. This places a significant burden on the person implementing the solution, requiring knowledge of two or more computer languages and of the interface between them.

    h4("Measuring performance")

When trying to improve performance, one wants to ensure (a) that the new code is actually faster than the previous code, and (b) both solutions arrive at the same, correct, answer.

The system.time function is a straight-forward way to measure the length of time a portion of code takes to evaluate. Here we see that the use of apply to calculate row sums of a matrix is much less efficient than the specialized rowSums function.

    m <- matrix(runif(200000), 20000)
    replicate(5, system.time(apply(m, 1, sum))[[1]])
    replicate(5, system.time(rowSums(m))[[1]])

Usually it is appropriate to replicate timings to average over vagaries of system use, and to shuffle the order in which timings of alternative algorithms are calculated to avoid artifacts such as initial memory allocation.

Speed is an important metric, but equivalent results are also needed. The functions identical and all.equal provide different levels of assessing equivalence, with all.equal providing ability to ignore some differences, e.g.: in the names of vector elements.

    res1 <- apply(m, 1, sum)
    res2 <- rowSums(m)
    identical(res1, res2)
    identical(c(1, -1), c(x=1, y=-1))
    all.equal(c(1, -1), c(x=1, y=-1),
              check.attributes=FALSE)

Two additional functions for assessing performance are Rprof and tracemem; these are mentioned only briefly here. The Rprof function profiles R code, presenting a summary of the time spent in each part of several lines of R code. It is useful for gaining insight into the location of performance bottlenecks when these are not readily apparent from direct inspection. Memory management, especially copying large objects, can frequently contribute to poor performance. The tracemem function allows one to gain insight into how R manages memory; insights from this kind of analysis can sometimes be useful in restructuring code into a more efficient sequence.

 h3("Warnings, errors, and debugging")

R signals unexpected results through warnings and errors. Warnings occur when the calculation produces an unusual result that nonetheless does not preclude further evaluation. For instance log(-1) results in a value NaN (‘not a number’) that allows computation to continue, but at the same time signals an warning

> log(-1)
[1] NaN
Warning message:
In log(-1) : NaNs produced

Errors result when the inputs or outputs of a function are such that no further action can be taken, e.g.: trying to take the square root of a character vector

> sqrt("two")
Error in sqrt("two") : Non-numeric argument to mathematical function

Warnings and errors occurring at the command prompt are usually easy to diagnose. They can be more enigmatic when occurring in a function, and exacerbated by sometimes cryptic (when read out of context) error messages.

An initial step in coming to terms with errors is to simplify the problem as much as possible, aiming for a ‘reproducible’ error. The reproducible error might involve a very small (even trivial) data set that immediately provokes the error. Often the process of creating a reproducible example helps to clarify what the error is, and what possible solutions might be.

Invoking traceback() immediately after an error occurs provides a ‘stack’ of the function calls that were in effect when the error occurred. This can help understand the context in which the error occurred. Knowing the context, one might use debug to enter into a browser (see ?browser) that allows one to step through the function in which the error occurred.

It can sometimes be useful to use global options (see ?options) to influence what happens when an error occurs. Two common global options are error and warn. Setting error=recover combines the functionality of traceback and debug, allowing the user to enter the browser at any level of the call stack in effect at the time the error occurred. Default error behavior can be restored with options(error=NULL). Setting warn=2 causes warnings to be promoted to errors. For instance, initial investigation of an error might show that the error occurs when one of the arguments to a function has value NaN. The error might be accompanied by a warning message that the NaN has been introduced, but because warnings are by default not reported immediately it is not clear where the NaN comes from. warn=2 means that the warning is treated as an error, and hence can be debugged using traceback, debug, and so on.

Additional useful debugging functions include browser, trace, and setBreakpoint.

    h3("Asking for help")

As already mentioned, help can be sought from mailing lists. For question with regards to R, adress them to the:
R Mailing lists: http://www.r-project.org For Bioconductor questions, adress them to the:
Bioconductor Mailing lists: http://bioconductor.org/help/mailing-list/

. Providing this information will fasten the help and the issue resolution if there’s one.

    h3("Resources")
    h4("Publications and Books")
    h4("Integrated Development Environment (IDE)")

The RStudio environment provides a nice, cross-platform environment for working in .

    h4("Other resources")
quest(7)
endQuest()


UPSCb/RnaSeqTutorial documentation built on Nov. 24, 2020, 12:40 a.m.