knitr::opts_chunk$set(crop = NULL)

Overview

The goal of the r Githubpkg("kevinrue/unisets") package is to provide a collection of S4 classes to store relationships between elements and sets, with a particular emphasis on gene sets.

Getting started

The Sets class {#basesets-class}

This is a basic example which shows you how to create a Sets object, to store simple associations between elements and sets, along with optional metadata associated with each relation:

library(unisets)
sets_list <- list(
    geneset1 = c("A", "B"),
    geneset2 = c("B", "C", "D")
)
relations_table <- DataFrame(
    element = unlist(sets_list),
    set     = rep(names(sets_list), lengths(sets_list)),
    extra1  = rep(c("ABC", "DEF"), c(3L, 2L)),
    extra2  = seq(0, 1, length.out = 5L)
)
base_sets <- Sets(relations_table)
base_sets

Metadata for each element and set can be provided as separate IdVector objects. The IdVector class stores a vector of identifiers as a character vector, and associated metadata as a DataFrame.

element_data <- IdVector(ids = c("A", "B", "C", "D"))
mcols(element_data) <- DataFrame(
    GeneStat1     = c( 1,   2,   3,   4 ),
    GeneInfo1     = c("a", "b", "c", "d")
)
set_data <- IdVector(ids = c("geneset1", "geneset2"))
mcols(set_data) <- DataFrame(
    SetStat1     = c( 100,   200 ),
    SetInfo1     = c("abc", "def")
)
base_sets <- Sets(relations_table, element_data, set_data)
base_sets

The elementInfo and setInfo slots each store an IdVector that describes the identifier and metadata associated with each unique element and set, respectively. Those metadata can be directly accessed and updated using the corresponding accessor methods.

elementInfo(base_sets)
setInfo(base_sets)

Note that relations between elements and sets are internally stored as an r Biocpkg("S4Vectors") Hits object. This container efficiently represents edges between a set of left nodes and a set of right nodes, with optional metadata that describe each edge.

To do so, the DataFrame provided as the relations argument of the Sets constructor is divided in two pieces of information:

relations(base_sets)
mcols(relations(base_sets))

Conveniently, the as method can be used to format relations and associated metadata as a DataFrame substituting hits for their corresponding element and set identifiers. Metadata for relations, elements, and sets are returned as DataFrame nested in the "relationData", "elementInfo", and "setInfo" columns.

as(base_sets, "DataFrame")

Similarly, as.data.frame can be used to obtain a flattened data.frame, with columns "element", "set", and any column in the relation metadata columns.

as.data.frame(base_sets)

The FuzzySets class {#fuzzysets-class}

Classes derived from Hits may add additional constraints on the relations to define special types of relationships between elements and sets.

For instance, the FuzzyHits class is a direct extension of the Hits class where the metadata accompanying each relation must include at least a column called "membership" that holds the "membership function", a numeric value in the interval [0,1] that provides a measure of partial membership between elements and sets.

Simultaneously, the FuzzySets class is a direct extension of the Sets class where the relations slot must contain FuzzyHits. As such, FuzzySets can be constructed exactly like Sets, with the only additional constraint that the relations table must contains a "membership" column with numeric values in the interval [0,1].

relations_table$membership <- round(runif(nrow(relations_table)), 2)
fuzzy_sets <- FuzzySets(relations_table, element_data, set_data)
fuzzy_sets

The membership function associated with each relation can be directly obtained and modified using the corresponding accessor.

membership(fuzzy_sets)

Identically to Sets, the relations accessor returns fuzzy relations as Hits, while the as method may be used to format the information as a DataFrame, both of which include the "membership" column, as metadata column and nested under "relationData", respectively.

relations(fuzzy_sets)
as(fuzzy_sets, "DataFrame")

The GOSets class

The GOSets class is another direct extension of the Sets class where the relations slot must contain GOHits. Similary to FuzzyHits, the GOHits class extends the Hits class, but with the distinct contraint that each relation metadata must include at least 2 columns called "evidence" and "ontology" holding the Gene Ontology evidence code and ontology code, respectively.

Examples of GOSets usage are described in a dedicated vignette.

Subsetting

The subset method can be applied to Sets objects and derivatives (e.g. FuzzySets, GOSets), using a logical expression that may refer to the "element" and "set" columns as well as any metadata associated with the relations, indicating rows to keep.

subset(base_sets, set == "geneset1" & element %in% c("B") & extra1 == "ABC")

Similarly, the subset method can be also applied to objects derived from Sets, such as FuzzySets, in which case the logical expression may also refer to the additional "membership" metadata that is guaranted by the class validity method.

subset(fuzzy_sets, set == "geneset2" & membership > 0.3)

Note that the default behaviour of the subset method is to drop elements and sets that are not represented in the relations from the elementInfo and setInfo slots, respectively. This behaviour can be controlled using the drop argument, which accepts a single logical value.

out1 <- subset(base_sets, set == "geneset1", drop=TRUE)
ids(setInfo(out1))
out2 <- subset(base_sets, set == "geneset1", drop=FALSE)
ids(setInfo(out2))

Converting to other formats {#convert-to}

List {#convert-to-list}

It is possible to extract the gene sets as a list, for use with functions such as lapply.

as(fuzzy_sets, "list")

Matrix {#convert-to-matrix}

It is also possible to visualize membership between gene and gene sets as a matrix.

Notably, Sets objects produce a logical matrix of binary membership that indicates whether each element is associated at least once with each set:

base_matrix <- as(base_sets, "matrix")
base_matrix

In contrast, FuzzySets objects produce a double matrix displaying the membership function for each relation. Relations that are not described in the FuzzySets are filled with NA, to contrast with relations explictly associated with a membership function of 0.

membership(fuzzy_sets)[1] <- 0
fuzzy_matrix <- as(fuzzy_sets, "matrix")
fuzzy_matrix

Converting from other formats {#convert-from}

Matrix {#convert-from-matrix}

It is possible to convert incidence matrices into objects derived from the Sets class.

Notably, the Sets class is suitable for logical matrices indicating binary membership.

as(base_matrix, "Sets")

Similarly, the FuzzySets class is suitable for double matrices indicating the membership function for each relation. Importantly, relations described as NA are not imported into the FuzzySets object (consistently with the as.matrix method described above). In contrast, relations with a membership function of 0 are imported and described as such.

fuzzy_matrix[1, 1] <- 0
as(fuzzy_matrix, "FuzzySets")

Additional information

Dimensions: count of relations, elements, and sets

The count of relations between elements and sets can be obtained using the length method.

length(base_sets)

The count of unique elements and sets can be obtained using the nElements and nSets methods.

nElements(base_sets)
nSets(base_sets)

The size of each gene set can be obtained using the setLengths method.

setLengths(fuzzy_sets)

Conversely, the number of sets associated with each gene is returned by the elementLengths function.

elementLengths(fuzzy_sets)

Names of elements and sets

The identifiers of elements and sets can be inspected and renamed using ids accessor on the IdVector object returned by each of the elementInfo or setInfo accessors.

ids(elementInfo(base_sets)) <- paste0("Gene", seq_len(nElements(base_sets)))
ids(elementInfo(base_sets))
ids(setInfo(base_sets)) <- paste0("Geneset", seq_len(nSets(base_sets)))
ids(setInfo(base_sets))

Importing and exporting sets

Gene Matrix Transpose (GMT) Format

A common representation of gene sets is the GMT format, which is a non-rectangular format where each line is a set. The first column is the name of the set, the second column is a description of the source of the set (such as a URL), and the third column onwards are the elements of the set, such that each set may have a variable number of elements.

Importing from and exporting to GMT files is performed using the generic import and export methods, which recognize the ".gmt" file extenson as a trigger to import from and export to the GMT file format. Alternatively, the import.gmt and import.gmt functions may be used to explicitly export to the GMT file format.

Any object that inherits from the Sets class may be exported to the GMT file format. However, any information that is not supported by the GMT file format will be lost during the export. Reciprocally, the import function produces a Sets object, which adequately represents all the information present in the GMT file format.

gmt_file <- system.file(package="unisets", "extdata", "example.gmt")
base_sets_from_gmt <- import(gmt_file)
base_sets_from_gmt

The additional metadata corresponding to the source (second column of the GMT) per set is also added as metadata corresponding to the sets, accessible via setInfo, which returns an IdVector class object.

setInfo(base_sets_from_gmt)

To access the internal DataFrame representation, the accessor mcols can additionally be applied.

mcols(setInfo(base_sets_from_gmt))
## elementMetadata(setInfo(base_sets_from_gmt)) # equivalent to above

To export Sets objects in GMT file format, the export generic may be used if the file extension is ".gmt". Alternatively, data in GMT format may be exported to files with different extensions (e.g., ".txt") using the export.gmt function. Note that if "source" heading is not found in the set metadata (i.e., mcols(setInfo(x))), this value will be filled with "unisets" in the exported file.

tmp_file <- tempfile(fileext=".gmt")
export(base_sets_from_gmt, tmp_file)

Additional information

Bug reports can be posted as issues in the r Githubpkg("kevinrue/unisets") GitHub repository. The GitHub repository is the primary source for development versions of the package, where new functionality is added over time. The authors appreciate well-considered suggestions for improvements or new features, or even better, pull requests.

If you use r Githubpkg("kevinrue/unisets") for your analysis, please cite it as shown below:

citation("unisets")

Session info {.unnumbered}

sessionInfo()


kevinrue/unisets documentation built on May 15, 2020, 10:48 p.m.