Create a gene set collection


Builds an object containing the collection of all gene sets to be used by the setRankAnalysis function.


buildSetCollection(..., referenceSet = NULL, maxSetSize = 500)



Optional but very strongly, recommended. A vector of geneIDs specifying the background gene set against which to test for over-representation of genesets. The default is to use all genes present in the supplied gene annotation tables. However, many experiments are intrinsically biased for certain pathways e.g. because they only contain samples from a specific tissue. Supplying a suitable reference set will remove this bias. See the vignette for more details.


The maximum number of genes in a gene set. Any gene sets with more genes will not be considered during the analysis.


One or more data frame objects containing the annotation of genes with pathway identifiers and descriptions. The idea is to provide one data frame per pathway database. Several gene set databases are provided in the organism-specific GeneSets packages. Alternatively, you can specify your own annotation tables. See the Details section for more information.


A gene set collection which is a list object containing the following fields:

  • maxSetSizeThe maximum set size applied when constructing the collection.

  • referenceSetA vector listing all gene IDS that are part of the reference.

  • setsA list of vectors. The list names are the pathway IDs as supplied in the termID column of the annotation frame(s) supplied.. Each vector contains all geneIDs of the gene set and has three attributes set: ID, name, and db which correspond respectively to the termID, termName, and dbName fields of the annotation frame.

  • gThe size of the reference set.

  • bigSetsA list of pathway IDs of gene sets with sizes bigger than the specified maximum set size.

  • intersection.p.cutoffThe p-value cutoff used to determine which intersections of pairs of gene sets (see Details) are significant.

  • intersectionsA data frame listing all significant intersections together with the p-value.

Execution time

This function typically takes some time to execute as it pre-calculates all significant intersections between pairs of gene sets in the collection. An intersection between two gene sets is considered significant if it contains more elements than expected by chance, given the sizes of both sets. Computation time can be sped up dramatically by running this function on multiple CPU-cores. To do so, simply set the mc.cores option to the desired number of cores to use, like so: options("mc.cores=4") Performing this calculation beforehand allows to re-use the same setCollection object for different analysis. It is therefore recommended to separate the creation of the setCollection object and the actual analysis in different scripts. Once the collection is created, it can be stored on disk using the save command. The analysis script can then load the collection using the load command.

Creation of custom annotation tables


The gene identifier. Can be any type of identifier, but one must make sure that all annotation frames passed to buildSetCollection use the same identifier. As the packages created by the GeneSets package use Entrez Gene identifiers, it is best to use these in your own annotation frames as well. Also, make sure the identifiers as passed as character and not as integer values.


Pathway identifier. Make sure each pathway identifier is unique across all pathway databases used. You can do this by prefixing the IDs with a namespace identifier like "REACTOME:".


Name of the pathway. A string describing the pathway, e.g. "negative regulation of sterol metabolism"


Pathway description. A longer description of the pathway. This field can be a full paragraph describing what this pathway does.


A short string given the name of the pathway database used for the annotation. E.g. "KEGG".


Cedric Simillion


referenceSet = sprintf("gene_%02d", 1:50)
geneSets = lapply(1:9, function(i) sample(referenceSet[((i-1)*5):((i+1)*5)], 5))
annotationTable = data.frame(termID=sprintf("set_%02d", rep(1:9, each=5)), 
        termName = sprintf("dummy gene set %d", rep(1:9, each=5)),
        dbName = "dummyDB",
        description = "A dummy gene set DB for testing purposes")
collection = buildSetCollection(annotationTable, referenceSet=referenceSet)
comments powered by Disqus