Builds an object containing the collection of all gene sets to be used
Optional but very strongly, recommended. A vector of geneIDs specifying the background gene set against which to test for over-representation of genesets. The default is to use all genes present in the supplied gene annotation tables. However, many experiments are intrinsically biased for certain pathways e.g. because they only contain samples from a specific tissue. Supplying a suitable reference set will remove this bias. See the vignette for more details.
The maximum number of genes in a gene set. Any gene sets with more genes will not be considered during the analysis.
One or more data frame objects containing the annotation of genes with pathway identifiers and descriptions. The idea is to provide one data frame per pathway database. Several gene set databases are provided in the organism-specific GeneSets packages. Alternatively, you can specify your own annotation tables. See the Details section for more information.
A gene set collection which is a list object containing the following fields:
maxSetSizeThe maximum set size applied when constructing the collection.
referenceSetA vector listing all gene IDS that are part of the reference.
setsA list of vectors. The list names are the pathway IDs as supplied
termID column of the annotation frame(s) supplied.. Each
vector contains all geneIDs of the gene set and has three attributes set:
db which correspond respectively to the
dbName fields of the annotation
gThe size of the reference set.
bigSetsA list of pathway IDs of gene sets with sizes bigger than the specified maximum set size.
intersection.p.cutoffThe p-value cutoff used to determine which intersections of pairs of gene sets (see Details) are significant.
intersectionsA data frame listing all significant intersections together with the p-value.
This function typically takes some time to execute as it pre-calculates all
significant intersections between pairs of gene sets in the collection. An
intersection between two gene sets is considered significant if it contains
more elements than expected by chance, given the sizes of both sets.
Computation time can be sped up dramatically by running this function on
multiple CPU-cores. To do so, simply set the
mc.cores option to the
desired number of cores to use, like so:
Performing this calculation beforehand allows to re-use the same
setCollection object for different analysis. It is therefore recommended to
separate the creation of the setCollection object and the actual analysis in
different scripts. Once the collection is created, it can be stored on disk
save command. The analysis script can then load the
collection using the
The gene identifier. Can be any type of identifier, but one
must make sure that all annotation frames passed to
buildSetCollection use the same identifier. As the packages
created by the GeneSets package use Entrez Gene identifiers, it
is best to use these in your own annotation frames as well. Also, make
sure the identifiers as passed as character and not as integer values.
Pathway identifier. Make sure each pathway identifier is
unique across all pathway databases used. You can do this by prefixing
the IDs with a namespace identifier like
Name of the pathway. A string describing the pathway, e.g. "negative regulation of sterol metabolism"
Pathway description. A longer description of the pathway. This field can be a full paragraph describing what this pathway does.
A short string given the name of the pathway database used for the annotation. E.g. "KEGG".
1 2 3 4 5 6 7 8 9
options(mc.cores=1) referenceSet = sprintf("gene_%02d", 1:50) geneSets = lapply(1:9, function(i) sample(referenceSet[((i-1)*5):((i+1)*5)], 5)) annotationTable = data.frame(termID=sprintf("set_%02d", rep(1:9, each=5)), geneID=unlist(geneSets), termName = sprintf("dummy gene set %d", rep(1:9, each=5)), dbName = "dummyDB", description = "A dummy gene set DB for testing purposes") collection = buildSetCollection(annotationTable, referenceSet=referenceSet)