SparseSummarizedExperiment: SparseSummarizedExperiment Container

Combining SummarizedExperiment objects

Peter Hickey 09 December 2015

NOTE

As of 11efea7, combine,GRanges,GRanges-method has been modified and generalised to combine,GenomicRanges,GenomicRanges-method and combine,SummarizedExperiment,SummarizedExperiment-method has been simplified. These changes are not yet reflected in the following.

Motivation

I often find myself with multiple SE objects (I'm using SE as a shorthand for the SummarizedExperiment and RangedSummarizedExeriment classes), each with potentially non-distinct samples and potentially non-overlapping features/ranges. Currently, it is difficult to combine these objects; rbind() can only combine objects with the same samples but distinct features/ranges and cbind() can only combine objects with the same features/ranges but distinct samples. I think it would be useful to have a "combine" method for SE objects that handles the most general situation where each object has potentially non-distinct samples and potentially non-overlapping features/ranges.

NOTE: All RangedSummarizedExperiment objects are SummarizedExperiment objects but the converse is not true.

Methods

My initial attempt was posted to https://gist.github.com/PeteHaitch/8993b096cfa7ccd08c13/2c82a279ee7d56cec3e9dadb8024cbb9183da576 and cross-posted to the Bioconductor development mailing list (https://stat.ethz.ch/pipermail/bioc-devel/2015-October/008128.html). Following feedback from Martin Morgan (https://stat.ethz.ch/pipermail/bioc-devel/2015-October/008144.html), I revised to make the following major changes:

Provide a combine() method for each component of the SE object. This means I wrote: combine,DataFrame,DataFrame-method (for the colData and elementMetadata slots), combine,GRanges,GRanges-method and combine,GRangesList,GRangesList-method (for the rowRanges slot), and combine,SimpleList,SimpleList-method (for the assays slot). There is also a helper function .combine.NAMES() to combine the NAMES slot of a non-ranged SE.
All combine() methods are initially designed to "combine" based on the names/rownames/dimnames of the objects. This follows the existing combine,data.frame,data.frame-method and combine,matrix,matrix-method from the BiocGenerics package.

With respect to (2), in some cases I realised there were equivalent definitions that did not rely on these "names" that were faster, and other scenarios where it might be desirable not to rely on these "names". For example, in my experience, it is common for a RangedSummarizedExperiment to have NULL names, i.e. identical(names(x), NULL) is TRUE where x is a RangedSummarizedExperiment (note that names(x) calls names(rowRanges(x)) and so it is the rowRanges(x) that commonly have NULL names). A consequence of this is that we cannot combine these RangedSummarizedExperiment objects based on "names" alone. However, there is an obvious way to construct meaningful rownames for these objects by a findOverlaps()-based naming scheme; this is implemented in combine,SummarizedExperiment,SummarizedExperiment-method for RangedSummarizedExperiment objects.

Comments on all aspects of the implementation are appreciated. I have highlighted particular concerns with RFC.

suppressPackageStartupMessages(library(SummarizedExperiment))
#> Warning: multiple methods tables found for 'as.vector'

`combine,DataFrame,DataFrame-method`

This coerces the DataFrame to a data.frame, calls the combine,data.frame,data.frame-method, and then coerces back to a DataFrame.

RFC: Do these coercions incur a copy operation(s)? Can these be avoided, e.g., should I basically copy the code from combine,data.frame-method to define combine,DataFrame,DataFrame-method?

setMethod("combine", c("DataFrame", "DataFrame"),
  function(x, y, ...) {

    if (all(dim(x) == 0L) && all(dim(y) == 0L)) {
      return(x)
    } else if (all(dim(x) == 0L)) {
      return(y)
    } else if (all(dim(y) == 0L)) {
      return(x)
    }

    if (is.null(row.names(x)) || is.null(row.names(y))) {
      stop("'row.names' of 'x' and 'y' must be non-NULL")
    }
    as(combine(as.data.frame(x), as.data.frame(y)), "DataFrame")
  }
)
#> [1] "combine"

`combine,GRanges,GRanges-method`

There are two strategies available (let x and y be GRanges objects in what follows):

Rely on names(x) and names(y) to identify "shared" elements.
Ignore names(x) and names(y) and simply identify unique elements of a combined x with y.

Option 1 is more restrictive (since it requires names(x) and names(y) to be non-NULL) but matches the behaviour of other combine methods that are defined in terms of the names/rownames/dimnames of the objects.

Option 2 will work on all GRanges objects even when names(x) or names(y) are NULL since unique,GRanges-method ignores names. The names of the "shared" elements are taken from x in the returned object. This option is approximately 2x faster.

I prefer option 2 since it is more general and faster.

RFC: Which option is preferrable?

RFC: Could/should we add a ignore.mcols argument to combine,GRanges,GRanges-method like that available in c,GRanges,GRanges-method? Also, should this argument be ignore.mcols or use.mcols (like that available in granges,GRanges-method)?

setMethod("combine", c("GRanges", "GRanges"),
          function(x, y, ...) {

            if (length(y) == 0L) {
              return(x)
            } else if (length(x) == 0L) {
              return(y)
            }

            if (is.null(names(x)) || is.null(names(y))) {
              stop("'names' of 'x' and 'y' must be non-NULL")
            }

            shared_elements <- intersect(names(x), names(y))
            ok <- all.equal(x[shared_elements], y[shared_elements])
            if (!isTRUE(ok)) {
              stop("GRanges shared elements differ: ", ok)
            }

            c(x, y[setdiff(names(y), shared_elements)], ignore.mcols = FALSE)
          }
)
#> [1] "combine"

setMethod("combine", c("GRanges", "GRanges"),
          function(x, y, ...) {

            if (length(y) == 0L) {
              return(x)
            } else if (length(x) == 0L) {
              return(y)
            }

            unique(c(x, y))
          }
)
#> [1] "combine"

`combine,GRangesList,GRangesList-method`

RFC: What does it mean to combine two GRangesList objects? For example, consider two GRangesList objects x and y; what should the result of calling combine(x, y) be? Since these are List-derived objects, it seems to make sense to mendoapply() combine to each element of x and y. This works when x and y have identical element names (case A), but not when x or y contains elements not found in the other (case B).

x <- GRangesList(gr1 = GRanges("chr1", IRanges(1, 10)))
x
#> GRangesList object of length 1:
#> $gr1 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   [1, 10]      *
#> 
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths

y <- GRangesList(gr1 = GRanges("chr1", IRanges(12, 16)))
y
#> GRangesList object of length 1:
#> $gr1 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1  [12, 16]      *
#> 
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Case A (works)
mendoapply(combine, x, y)
#> GRangesList object of length 1:
#> $gr1 
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1  [ 1, 10]      *
#>   [2]     chr1  [12, 16]      *
#> 
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Case B (expect error)
y[["gr2"]] <- GRanges("chr1", IRanges(66, 99))
mendoapply(combine, x, y)
#> Error in validObject(.Object): invalid class "GRangesList" object: 1: 'x@elementMetadata' is not parallel to 'x'
#> invalid class "GRangesList" object: 2: 'names(x)' must be NULL or have the length of 'x'

It seems like in case B that the user might want for y$gr2 to be included in the returned value, so perhaps the following definition is suitable:

setMethod("combine", c("GRangesList", "GRangesList"),
          function(x, y, ...) {

            if (length(y) == 0L) {
              return(x)
            } else if (length(x) == 0L) {
              return(y)
            }           

            if (is.null(names(x)) || is.null(names(y))) {
              stop("'names' of 'x' and 'y' must be non-NULL")
            }

            shared_elements <- intersect(names(x), names(y))
            x[shared_elements] <- mendoapply(combine, x[shared_elements], 
                                             y[shared_elements])
            c(x, y[setdiff(names(y), shared_elements)])
          }
)
#> [1] "combine"

This will appropriately combine GRangesList objects in case B but requires that both objects have non-NULL names:

# Works
combine(x, y)
#> GRangesList object of length 2:
#> $gr1 
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1  [ 1, 10]      *
#>   [2]     chr1  [12, 16]      *
#> 
#> $gr2 
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames   ranges strand
#>   [1]     chr1 [66, 99]      *
#> 
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Expect error
combine(unname(x), unname(y))
#> Error in combine(unname(x), unname(y)): 'names' of 'x' and 'y' must be non-NULL

Unlike combine,GRanges,GRanges-method, it seems difficult to extend combine,GRangesList,GRangesList-method to work when either names(x) or names(y) is NULL.

RFC: Any suggestions on how to extend combine,GRangesList,GRangesList-method to work when either names(x) or names(y) is NULL? Is this even desirable? Without this, we won't be able to use combine() on DESeqDataSet objects, for example.

`combine,SimpleList,SimpleList-method`

This calls mendoapply() with combine() on x and y. This means it will only work if a suitable combine() method is defined for class(x) and class(y). Unlike the proposed combine,GRangesList,GRangesList-method, combine,SimpleList,SimpleList-method requires that x and y have an identical number of elements with identical names.

setMethod("combine", c("SimpleList", "SimpleList"),
          function(x, y, ...) {

            if (length(y) == 0L) {
              return(x)
            } else if (length(x) == 0L) {
              return(y)
            }

            if (length(names(x)) != length(names(y))) {
              stop("'", class(x), "' objects have different number of elements:",
                   "\n\t", paste(names(x), collapse = " "), "\n\t",
                   paste(names(y), collapse = " "))
            }

            if (!all(names(x) == names(y))) {
              stop("'", class(x), "' objects have different element names:\n\t",
                   paste(names(x), collapse = " "), "\n\t",
                   paste(names(y), collapse = " "))
            }

            mendoapply(combine, x, y)
          }
)
#> [1] "combine"

RFC: Should this be combine,List,List-method or might this make it too general?

`combine,SummarizedExperiment,SummarizedExperiment-method`

This applies the combine to each slot of the SummarizedExperiment objects. There is a helper function, .combine.NAMES() to combine the NAMES slot of non-ranged SummarizedExperiment objects; this isn't a method since the NAMES slot is NULL or a character() object rather than a formal S4 class.

.combine.NAMES <- function(x, y, ...) {
  shared_names <- intersect(x, y)
  c(x, setdiff(y, shared_names))
}

There are two strategies available (let x and y be SummarizedExperiment objects in what follows):

Rely on dimnames(x) and dimnames(y) to identify "shared" features/ranges and samples.
If x and y are RangedSummarizedExperiment-based objects, then ignore rownames(x) and rownames(y) and simply identify unique ranges of a combined x with y. This still requires that both x and y have colnames. Currently, this will not work for a RangedSummarizedExperiment-based object with GRangesList-based rowRanges (this would require a version of combine,GRangesList,GRangesList-method that doesn't rely on names).

Each option has limitations.

Option 1 is restrictive since it requires dimnames(x) and dimnames(y) to be non-NULL, but matches the behaviour of other combine methods that are defined in terms of the names/rownames/dimnames of the objects.

Option 2 will work on wider class of RangedSummarizedExperiments objects since the names of such objects will commonly be NULL (it will not work for RangedSummarizedExperiment objects with GRangesList-derived rowRanges slots since the combine method only ). This option is generally slower. The names of the "shared" elements are taken from x in the returned object. Option 2 will effectively use Option 1 for non-ranged SummarizedExperiment-based objects.

I prefer option 2 since it is more general for a common use case.

A third option is a hybrid: use the first option if names(x) and names(y) are both non-NULL and the second option otherwise. This could lead to surprising behaviour, however, if the user expects the option 2 to be used (e.g., they are unaware that their objects have non-NULL names).

RFC: Which option is preferrable?

setMethod("combine", c("SummarizedExperiment", "SummarizedExperiment"),
          function(x, y, ...) {

            if (any(dim(y) == 0L)) {
              return(x)
            } else if (any(dim(x) == 0L)) {
              return(y)
            }

            # Give a helpful error message if the user tries to combine a
            # RangedSummarizedExperiment object to an non-ranged
            # SummarizedExperiment.
            # NOTE: Can't simply check if object is SumamrizedExperiment0
            #       object since all RangedSummarizedExperiments are
            #       SummarizedExperiment objects but not all
            #       SummarizedExperiment objects are
            #       RangedSummarizedExperiment objects.
            if ((is(x, "RangedSummarizedExperiment") &&
                 !is(y, "RangedSummarizedExperiment")) ||
                (!is(x, "RangedSummarizedExperiment") &&
                 is(y, "RangedSummarizedExperiment"))) {
              stop("Cannot combine '", class(x), "' and '", class(y),
                   "' objects because only one of these has 'rowRanges'.")
            }

            # Check dimnames are set
            x_dim <- dimnames(x)
            y_dim <- dimnames(y)
            if (is.null(x_dim) || is.null(y_dim) ||
                any(sapply(x_dim, is.null)) || any(sapply(y_dim, is.null))) {
              stop("'", class(x), "' must have dimnames for 'combine'")
            }

            # Combine slots
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: Can't handle internal duplicate ranges in x or y.
              if (any(duplicated(rowRanges(x))) || 
                  any(duplicated(rowRanges(y)))) {
                stop("Cannot combine '", class(x), "' with internal ", 
                     "duplicate 'rowRanges'")
              }
              rowRanges <- combine(rowRanges(x), rowRanges(y))
            } else {
              if (anyDuplicated(names(x)) || anyDuplicated(names(y))) {
                stop("Cannot combine '", class(x), "' with internal ", 
                     "duplicate 'names'")
              }
              NAMES <- .combine.NAMES(names(x), names(y))
            }
            colData <- combine(colData(x), colData(y))
            assays <- Assays(combine(assays(x, withDimnames = TRUE),
                                     assays(y, withDimnames = TRUE)))
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: elementMetadata slot of a RangedSummarizedExperiment
              #       object must be a zero-column DataFrame with nrow equal to 
              #       the length of the RangedSummarizedExperiment objects at 
              #       all times.
              elementMetadata <- DataFrame()
              elementMetadata@nrows <- length(rowRanges)
            } else {
              # NOTE: Using mcols() rather than slot(x, "elementMetadata") so
              #       that the combined objects have rownames on which to combine.
              elementMetadata <- combine(mcols(x, use.names = TRUE),
                                         mcols(y, use.names = TRUE))
              # NOTE: Once combined, drop rownames of elementMetadata since these 
              #       are given by rownames(z) when z is a SummarizedExperiment 
              #       object.
              rownames(elementMetadata) <- NULL
            }
            metadata <- c(metadata(x), metadata(y))

            # Construct the combined SE
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: No need to update NAMES slot since it must be NULL in a
              #       valid RangedSummarizedExperiment object.
              BiocGenerics:::replaceSlots(x,
                                          rowRanges = rowRanges,
                                          colData = colData,
                                          assays = assays,
                                          elementMetadata = elementMetadata,
                                          metadata = metadata)
            } else {
              BiocGenerics:::replaceSlots(x,
                                          colData = colData,
                                          assays = assays,
                                          NAMES = NAMES,
                                          elementMetadata = elementMetadata,
                                          metadata = metadata)
            }
          }
)
#> [1] "combine"

setMethod("combine", c("SummarizedExperiment", "SummarizedExperiment"),
          function(x, y, ...) {

            if (any(dim(y) == 0L)) {
              return(x)
            } else if (any(dim(x) == 0L)) {
              return(y)
            }

            # Give a helpful error message if the user tries to combine a
            # RangedSummarizedExperiment object to an non-ranged
            # SummarizedExperiment.
            # NOTE: Can't simply check if object is SumamrizedExperiment0
            #       object since all RangedSummarizedExperiments are
            #       SummarizedExperiment objects but not all
            #       SummarizedExperiment objects are
            #       RangedSummarizedExperiment objects.
            if ((is(x, "RangedSummarizedExperiment") &&
                 !is(y, "RangedSummarizedExperiment")) ||
                (!is(x, "RangedSummarizedExperiment") &&
                 is(y, "RangedSummarizedExperiment"))) {
              stop("Cannot combine '", class(x), "' and '", class(y),
                   "' objects because only one of these has 'rowRanges'")
            }

            # Check colnames are set
            x_cn <- colnames(x)
            y_cn <- colnames(y)
            if (is.null(x_cn) || is.null(y_cn)) {
              stop("Cannot combine '", class(x), "' with NULL 'colnames'")
            }

            # Combine slots
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: This method doesn't work if rowRanges(x) or rowRanges(y) 
              #       are GRangesList-derived objects since the 
              #       combine,GRangesList,GRangesList-method currently requires 
              #       that these have non-NULL names.
              if (is(rowRanges(x), "GRangesList") || 
                  is(rowRanges(y), "GRangesList")) {
                stop("Cannot combine '", class(x), "' objects with ", 
                     "'GRangesList'-based 'rowRanges'")
              }
              # NOTE: Can't handle internal duplicate ranges in x or y.
              if (any(duplicated(rowRanges(x))) || 
                  any(duplicated(rowRanges(y)))) {
                stop("Cannot combine '", class(x), "' with internal ", 
                     "duplicate 'rowRanges'")
              }
              # NOTE: mcols(x) and mcols(y) 
              #       [i.e. mcols(rowRanges(x)) and mcols(rowRanges(y))] 
              #       are separately combined. This could be simplified 
              #       if combine,GRanges,GRanges-method had an 
              #       ignore.mcols argument.
              x_em <- mcols(x, use.names = TRUE)
              mcols(x) <- NULL
              y_em <- mcols(y, use.names = TRUE)
              mcols(y) <- NULL

              # NOTE: names(rowRanges) are set to NULL
              rowRanges <- combine(rowRanges(x), rowRanges(y))
              # Set rownames based on findOverlaps() of rowRanges()
              # NOTE: We want ranges that are identical, hence 
              #       type = "equal". Also, we want to identify identical 
              #       zero-width ranges, hence minoverlap = 0
              x_ol <- findOverlaps(rowRanges(x), rowRanges, 
                                   type = "equal", minoverlap = 0L)
              rownames(x) <- subjectHits(x_ol)
              rownames(x_em) <- subjectHits(x_ol)
              y_ol <- findOverlaps(rowRanges(y), rowRanges, 
                                   type = "equal", minoverlap = 0L)
              rownames(y) <- subjectHits(y_ol)
              rownames(y_em) <- subjectHits(y_ol)
              mcols(rowRanges) <- combine(x_em, y_em)
            } else {
              if (anyDuplicated(names(x)) || anyDuplicated(names(y))) {
                stop("Cannot combine '", class(x), "' with internal ", 
                     "duplicate 'names'")
              }
              NAMES <- .combine.NAMES(names(x), names(y))
            }
            colData <- combine(colData(x), colData(y))
            assays <- Assays(combine(assays(x, withDimnames = TRUE),
                                     assays(y, withDimnames = TRUE)))
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: elementMetadata slot of a RangedSummarizedExperiment
              #       object must be a zero-column DataFrame with nrow equal to 
              #       the length of the RangedSummarizedExperiment objects at 
              #       all times.
              elementMetadata <- DataFrame()
              elementMetadata@nrows <- length(rowRanges)
            } else {
              # NOTE: Using mcols() rather than slot(x, "elementMetadata") so
              #       that the combined objects have rownames on which to combine.
              elementMetadata <- combine(mcols(x, use.names = TRUE),
                                         mcols(y, use.names = TRUE))
              # NOTE: Once combined, drop rownames of elementMetadata since these 
              #       are given by rownames(z) when z is a SummarizedExperiment 
              #       object.
              rownames(elementMetadata) <- NULL
            }
            metadata <- c(metadata(x), metadata(y))

            # Construct the combined SE
            if (is(x, "RangedSummarizedExperiment")) {
              # NOTE: No need to update NAMES slot since it must be NULL in a
              #       valid RangedSummarizedExperiment object.
              BiocGenerics:::replaceSlots(x,
                                          rowRanges = rowRanges,
                                          colData = colData,
                                          assays = assays,
                                          elementMetadata = elementMetadata,
                                          metadata = metadata)
            } else {
              BiocGenerics:::replaceSlots(x,
                                          colData = colData,
                                          assays = assays,
                                          NAMES = NAMES,
                                          elementMetadata = elementMetadata,
                                          metadata = metadata)
            }
          }
)
#> [1] "combine"

RFC: Should the check of whether the user is trying to combine a non-ranged SummarizedExperiment-based object with a RangedSummarizedExeriment-based object be the more restrictive if(class(x) != class(y)) stop() (this is what combine,eSet,eSet-method uses)?

RFC: Are non-matrix objects allowed as elements in an Assays object? If so, will need a combine() method for each of these allowed classes.

RFC: metadata from all objects are combined into a list with no name checking (following cbind,SummarizedExperiment-method and rbind,SummarizedExperiment-method). Does this seem reasonable?

RFC: If nrow(x) (resp. nrow(y)) is zero (i.e. no features/ranges in that object) or ncol(x) (resp. ncol(y)) is zero (i.e. no samples in that object) then y (resp. x) is returned (or x if the condition occurs for both samples); does this seem reasonable? In particular, the zero-column object could have features/ranges that the user wants to add to the other object (or the zero-row object could have samples that the user wants to add to the other object). How to handle these scenarios?

RFC: I thought that colnames were a requirement of a SummarizedExperiment object, but this is not true, at least according to the validity method (the SummarizedExperiment() constructor does enforce non-NULL colnames). Is this a bug in the validity method?

example("RangedSummarizedExperiment", echo = FALSE)
rse
#> class: RangedSummarizedExperiment 
#> dim: 200 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(1): feature_id
#> colnames(6): A B ... E F
#> colData names(1): Treatment
validObject(rse)
#> [1] TRUE

colnames(rse) <- NULL
rse
#> class: RangedSummarizedExperiment 
#> dim: 200 6 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(1): feature_id
#> colnames: NULL
#> colData names(1): Treatment
validObject(rse)
#> [1] TRUE

RFC: Option 2 uses findOverlaps() with type = "equal" (because we want identical ranges) and minoverlap = 0 (because otherwise zero-width identical ranges are not found). Are the missed ranges when minoverlap = 1 (the default) intended behaviour? For example:

x <- GRanges("chr1", IRanges(10, width = 0))
x
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1   [10, 9]      *
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# No hits
findOverlaps(x, x, type = "equal")
#> Hits object with 0 hits and 0 metadata columns:
#>    queryHits subjectHits
#>    <integer>   <integer>
#>   -------
#>   queryLength: 1
#>   subjectLength: 1

# 1 hit
findOverlaps(x, x, type = "equal", minoverlap = 0L)
#> Hits object with 1 hit and 0 metadata columns:
#>       queryHits subjectHits
#>       <integer>   <integer>
#>   [1]         1           1
#>   -------
#>   queryLength: 1
#>   subjectLength: 1

RFC: Both Option 1 and 2 will return an error if either x or y contains "internal" duplicate features/ranges. I don't know how else you would handle this situation; suppose x contains an "internal" duplicate, it would first require an "internal combine" of x followed by a "combine" with y (see example below).

Discuss the signature of the `combine` generic

My understanding is that the combine() generic defined in BiocGenerics does not allow for additional arguments (at least, additional arguments that are not objects to be "combined"). It might be useful to be able to pass ignore.mcols to combine,GRanges,GRanges-method, combine,GRangesList,GRangesList, and combine,SummarizedExperiment,SummarizedExperiment-method. I believe this would require the combine() generic to be redefined.

Examples

I use data from the SummarizedExperiment::SummarizedExeriment and SummarizedExperimentRangedSummarizedExperiment man pages.

# Need to set seed because example("RangedSummarizedExperiment") and 
# example("SummarizedExperiment") both call runif().
set.seed(667)

`SummarizedExperiment`

# Get some example data
example("SummarizedExperiment", echo = FALSE)

# NOTE: SummarizedExperiment objects must have non-NULL NAMES slot in order 
#       to combine
names(se) <- as.character(1:200)

# Create data to combine
A <- se[1:4, "A"]
B <- se[3:6, "B"]
C <- se[5:8, "C"]
BC <- se[c(4:7, 9), c("B", "C")]
D <- se[7:10, "D"]
E <- se[9:12, "E"]
F <- se[11:14, "F"]

# Sanity check: identical to cbind when given compatible arguments
all.equal(cbind(se[, 1], se[, 2], se[, 3]),
          combine(se[, 1], se[, 2], se[, 3]))
#> [1] TRUE

# Combining objects with non-overlapping features and non-distinct samples
z <- combine(A, B, C, BC)
z
#> class: RangedSummarizedExperiment 
#> dim: 9 3 
#> metadata(0):
#> assays(1): counts
#> rownames(9): 1 2 ... 8 9
#> rowData names(1): feature_id
#> colnames(3): A B C
#> colData names(1): Treatment
assay(z)
#>          A        B        C
#> 1 8.274185       NA       NA
#> 2 9.165680       NA       NA
#> 3 9.654497 9.698581       NA
#> 4 9.554457 9.773346 9.774785
#> 5       NA 7.065869 9.516848
#> 6       NA 9.552036 9.338527
#> 7       NA 5.872714 9.788340
#> 8       NA       NA 9.424306
#> 9       NA 9.901994 9.297545

# Incomplete elementMetadata are elegantly handled
a <- A
mcols(a) <- DataFrame(J = seq_len(nrow(a)))
z <- combine(a, B)
assay(z)
#>          A        B
#> 1 8.274185       NA
#> 2 9.165680       NA
#> 3 9.654497 9.698581
#> 4 9.554457 9.773346
#> 5       NA 7.065869
#> 6       NA 9.552036
mcols(z)
#> DataFrame with 6 rows and 2 columns
#>           J  feature_id
#>   <integer> <character>
#> 1         1          NA
#> 2         2          NA
#> 3         3       ID003
#> 4         4       ID004
#> 5        NA       ID005
#> 6        NA       ID006

# "Internal duplicates" are not allowed
x <- se[1:2, 1]
# Create an "internal duplicate" feature name
names(x)[2] <- names(x)[1]
y <- se[10, 2]
# Expect error
combine(x, y)
#> Error in `rownames<-`(`*tmp*`, value = c("1", "1")): duplicate rownames not allowed

`RangedSummarizedExperiment`

# Get some example data
example("RangedSummarizedExperiment", echo = FALSE)

# Create 
A <- rse[1:4, "A"]
B <- rse[3:6, "B"]
C <- rse[5:8, "C"]
BC <- rse[c(4:7, 9), c("B", "C")]
D <- rse[7:10, "D"]
E <- rse[9:12, "E"]
F <- rse[11:14, "F"]

# Sanity check: identical to cbind when given compatible arguments
all.equal(cbind(rse[, 1], rse[, 2], rse[, 3]),
          combine(rse[, 1], rse[, 2], rse[, 3]))
#> [1] TRUE

# Combining objects with non-overlapping features and non-distinct samples
z <- combine(A, B, C, BC)
z
#> class: RangedSummarizedExperiment 
#> dim: 9 3 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(1): feature_id
#> colnames(3): A B C
#> colData names(1): Treatment
assay(z)
#>              A        B        C
#>  [1,] 8.257307       NA       NA
#>  [2,] 6.378278       NA       NA
#>  [3,] 9.686589 9.883666       NA
#>  [4,] 9.770541 8.775756 9.585409
#>  [5,]       NA 6.651706 8.814927
#>  [6,]       NA 8.819111 8.129262
#>  [7,]       NA 6.344939 8.816540
#>  [8,]       NA       NA 9.321034
#>  [9,]       NA 9.425177 9.486555

# Incomplete elementMetadata are elegantly handled
a <- A
colnames(a) <- "a"
mcols(a) <- DataFrame(J = seq_len(nrow(a)))
z <- combine(a, B)
assay(z)
#>             a        B
#> [1,] 8.257307       NA
#> [2,] 6.378278       NA
#> [3,] 9.686589 9.883666
#> [4,] 9.770541 8.775756
#> [5,]       NA 6.651706
#> [6,]       NA 8.819111
mcols(z)
#> DataFrame with 6 rows and 2 columns
#>           J  feature_id
#>   <integer> <character>
#> 1         1          NA
#> 2         2          NA
#> 3         3       ID003
#> 4         4       ID004
#> 5        NA       ID005
#> 6        NA       ID006

# "Internal duplicates"" are not allowed
x <- rse[1:2, 1]
# Create an "internal duplicate" ranges within x
rowRanges(x)[2] <- rowRanges(x)[1]
y <- rse[10, 2]
# Expect error
combine(x, y)
#> Error in combine(x, y): Cannot combine 'RangedSummarizedExperiment' with internal duplicate 'rowRanges'

# Some examples of what happens when rownames are available on only on x or y
A <- rse[1:4, "A"]
B <- rse[3:6, "B"]
z <- combine(A, B)
rownames(z)
#> NULL
assay(z)
#>             A        B
#> [1,] 8.257307       NA
#> [2,] 6.378278       NA
#> [3,] 9.686589 9.883666
#> [4,] 9.770541 8.775756
#> [5,]       NA 6.651706
#> [6,]       NA 8.819111

# "Shared" ranges have names taken from 'x'
names(A) <- letters[seq_along(A)]
names(B) <- LETTERS[seq_along(B)]
zz <- combine(A, B)
rownames(zz)
#> [1] "a" "b" "c" "d" "C" "D"
assay(zz)
#>          A        B
#> a 8.257307       NA
#> b 6.378278       NA
#> c 9.686589 9.883666
#> d 9.770541 8.775756
#> C       NA 6.651706
#> D       NA 8.819111

# elementMetadata are still combined even though rownames differ for "shared"
# ranges
mcols(A) <- DataFrame(J = seq_along(A))
zzz <- combine(A, B)
rownames(zzz)
#> [1] "a" "b" "c" "d" "C" "D"
assay(zzz)
#>          A        B
#> a 8.257307       NA
#> b 6.378278       NA
#> c 9.686589 9.883666
#> d 9.770541 8.775756
#> C       NA 6.651706
#> D       NA 8.819111

# Can end up with an object with "incomplete" rownames
rownames(A) <- NULL
zzzz <- combine(A, B)
rownames(zzzz)
#> [1] ""  ""  ""  ""  "C" "D"
assay(zzzz)
#>          A        B
#>   8.257307       NA
#>   6.378278       NA
#>   9.686589 9.883666
#>   9.770541 8.775756
#> C       NA 6.651706
#> D       NA 8.819111

Summary

The combine() method for SE objects addresses the aim of being able to combine multiple SE objects when they have potentially different features/ranges and potentially different samples. Furthermore, I have also defined combine,DataFrame,DataFrame-method, combine,GRanges,GRanges-method, combine,GRangesList,GRangesList-method, and combine,SimpleList,SimpleList-method.

The chief limitation is an incomplete combine,GRangesList,GRangesList-method; we need to be able to appropriately "combine" GRangesList objects when names are NULL. Without this, the combine,SummarizedExperiment,SummarizedExperiment-method is also incomplete.

It is perhaps worth noting that identical(combine(x, y), combine(y, x)) is generally FALSE.

Complete support of RangedSummarizedExperiment objects where the rowRanges are GRangesList-derived objects, e.g., DESeqDataSet objects.
Unit tests
Documentation

Session info

devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                                             
#>  version  R Under development (unstable) (2015-11-28 r69714)
#>  system   x86_64, darwin13.4.0                              
#>  ui       X11                                               
#>  language (EN)                                              
#>  collate  en_AU.UTF-8                                       
#>  tz       Australia/Melbourne                               
#>  date     2015-12-09
#> Packages ------------------------------------------------------------------
#>  package              * version date      
#>  Biobase              * 2.31.1  2015-12-06
#>  BiocGenerics         * 0.17.2  2015-12-06
#>  devtools               1.9.1   2015-09-11
#>  digest                 0.6.8   2014-12-31
#>  evaluate               0.8     2015-09-18
#>  formatR                1.2.1   2015-09-18
#>  GenomeInfoDb         * 1.7.3   2015-11-03
#>  GenomicRanges        * 1.23.4  2015-11-25
#>  htmltools              0.2.6   2014-09-08
#>  IRanges              * 2.5.10  2015-12-07
#>  knitr                  1.11    2015-08-14
#>  magrittr               1.5     2014-11-22
#>  memoise                0.2.1   2014-04-22
#>  rmarkdown              0.8.1   2015-10-10
#>  S4Vectors            * 0.9.12  2015-12-07
#>  stringi                1.0-1   2015-10-22
#>  stringr                1.0.0   2015-04-30
#>  SummarizedExperiment * 1.1.6   2015-12-09
#>  XVector                0.11.1  2015-11-19
#>  yaml                   2.1.13  2014-06-12
#>  zlibbioc               1.17.0  2015-10-14
#>  source                                                   
#>  Bioconductor                                             
#>  Bioconductor                                             
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  Bioconductor                                             
#>  Bioconductor                                             
#>  CRAN (R 3.3.0)                                           
#>  Github (Bioconductor-mirror/IRanges@c935622)             
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  Github (Bioconductor-mirror/S4Vectors@800a65a)           
#>  CRAN (R 3.3.0)                                           
#>  CRAN (R 3.3.0)                                           
#>  Github (Bioconductor-mirror/SummarizedExperiment@c3de85c)
#>  Bioconductor                                             
#>  CRAN (R 3.3.0)                                           
#>  Bioconductor

PeteHaitch/SparseSummarizedExperiment documentation built on May 8, 2019, 1:31 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

PeteHaitch/SparseSummarizedExperiment
SparseSummarizedExperiment Container

ideas/combining-SummarizedExperiment-objects.md
In PeteHaitch/SparseSummarizedExperiment: SparseSummarizedExperiment Container

Combining SummarizedExperiment objects

NOTE

Motivation

Methods

Setup

`combine,DataFrame,DataFrame-method`

`combine,GRanges,GRanges-method`

`combine,GRangesList,GRangesList-method`

`combine,SimpleList,SimpleList-method`

`combine,SummarizedExperiment,SummarizedExperiment-method`

Discuss the signature of the `combine` generic

Examples

`SummarizedExperiment`

`RangedSummarizedExperiment`

Summary

Futher work

Session info

R Package Documentation

Browse R Packages

We want your feedback!

PeteHaitch/SparseSummarizedExperiment SparseSummarizedExperiment Container

ideas/combining-SummarizedExperiment-objects.md In PeteHaitch/SparseSummarizedExperiment: SparseSummarizedExperiment Container

Combining SummarizedExperiment objects

NOTE

Motivation

Methods

Setup

combine,DataFrame,DataFrame-method

combine,GRanges,GRanges-method

combine,GRangesList,GRangesList-method

combine,SimpleList,SimpleList-method

combine,SummarizedExperiment,SummarizedExperiment-method

Discuss the signature of the combine generic

Examples

SummarizedExperiment

RangedSummarizedExperiment

Summary

Futher work

Session info

R Package Documentation

Browse R Packages

We want your feedback!

PeteHaitch/SparseSummarizedExperiment
SparseSummarizedExperiment Container

ideas/combining-SummarizedExperiment-objects.md
In PeteHaitch/SparseSummarizedExperiment: SparseSummarizedExperiment Container

`combine,DataFrame,DataFrame-method`

`combine,GRanges,GRanges-method`

`combine,GRangesList,GRangesList-method`

`combine,SimpleList,SimpleList-method`

`combine,SummarizedExperiment,SummarizedExperiment-method`

Discuss the signature of the `combine` generic

`SummarizedExperiment`

`RangedSummarizedExperiment`