readGMT: Import Gene Sets from a GMT File
In rcastelo/GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data

readGMT

R Documentation

Import Gene Sets from a GMT File

Description

Imports a list of gene sets from a GMT (Gene Matrix Transposed) format file, offering a choice of ways to handle duplicated gene set names.

Usage

readGMT(
  con,
  sep = "\t",
  geneIdType = "auto",
  collectionType = NullCollection(),
  valueType = c("GeneSetCollection", "list"),
  deduplUse = c("first", "drop", "union", "smallest", "largest"),
  ...
)

Arguments

`con`	A connection object or a non-empty character string of length 1 containing e.g. the filename or URL of a (possibly compressed) GMT file.
`sep`	The character string separating members of each gene set in the GMT file.
`geneIdType`	By default a character vector of length 1 with the special value `"auto"` or an object of a subclass of `GeneIdentifierType`. If set to `"auto"`, the function will try to derive the gene ID type from argument `geneIdsList` using `guessGeneIdType`. Other values, including `NULL`, will be ignored with a warning and `geneIdType=NullIdentifier()` will be used instead. Depending on the value of argument `valueType`, the gene ID type of the resulting list or of all `GeneSet` objects in the resulting `GeneSetCollection` will be set to this value.
`collectionType`	Only used when `valueType == "GeneSetCollection"`. See `getGmt` for more information.
`valueType`	A character vector of length 1 specifying the desired type of return value. It must be one of: `GeneSetCollection` (the default): a `GeneSetCollection` object as defined and described by package `GSEABase`. `list`: a named list of gene sets represented as character vectors of gene IDs. This format is much simpler and cannot store the metadata required for automatic mapping of gene IDs.
`deduplUse`	A character vector of length 1 specifying one of several methods to handle duplicated gene set names. Duplicated gene set names are explicitly forbidden by the GMT file format specification but can nevertheless be encountered in the wild. The available choices are: `first` (the default): drops all gene sets whose names are `duplicated` according to the base R function and retains only the first occurence of a gene set name. `drop`: removes all gene sets that have a duplicated name, including its first occurrence. `union`: replaces gene sets with duplicated names by a single gene set containing the union of all their gene IDs. `smallest`: drops gene sets with duplicated names and retains only the smallest of them, i.e. the one with the fewest gene IDs. If there are several smallest gene sets, the first will be selected. `largest`: drops gene sets with duplicated names and retains only the largest of them, i.e. the one with the most gene IDs. If there are several largest gene sets, the first will be selected.
`...`	Further arguments passed on to `readLines()`

Value

The gene sets imported from the GMT file, with duplicate gene sets resolved according to argument deduplUse and in the format determined by argument valueType.

Examples

library(GSVA)
suppressPackageStartupMessages(library(GSVAdata))

fname <- file.path(system.file("extdata", package="GSVAdata"),
   "c2.subsetdups.v7.5.symbols.gmt.gz")

## by default, guess geneIdType from content and return a GeneSetCollection
genesets <- readGMT(fname)
genesets

## how to manually override the geneIdType
genesets <- readGMT(fname, geneIdType=NullIdentifier())
genesets

## how to drop *all* gene sets with duplicated names (instead of ignoring
## only the duplicated one)
genesets <- readGMT(fname, deduplUse="drop")
genesets

## return a simple list instead of a GeneSetCollection
genesets <- readGMT(fname, valueType="list")
head(genesets, 2)

## the list has a geneIdType, too
gsvaAnnotation(genesets)

rcastelo/GSVA documentation built on June 14, 2025, 6:38 p.m.