readGMT: Import Gene Sets from a GMT File

readGMTR Documentation

Import Gene Sets from a GMT File

Description

Imports a list of gene sets from a GMT (Gene Matrix Transposed) format file, offering a choice of ways to handle duplicated gene set names.

Usage

readGMT(
  con,
  sep = "\t",
  geneIdType = "auto",
  collectionType = NullCollection(),
  valueType = c("GeneSetCollection", "list"),
  deduplUse = c("first", "drop", "union", "smallest", "largest"),
  ...
)

Arguments

con

A connection object or a non-empty character string of length 1 containing e.g. the filename or URL of a (possibly compressed) GMT file.

sep

The character string separating members of each gene set in the GMT file.

geneIdType

By default a character vector of length 1 with the special value "auto" or an object of a subclass of GeneIdentifierType. If set to "auto", the function will try to derive the gene ID type from argument geneIdsList using guessGeneIdType. Other values, including NULL, will be ignored with a warning and geneIdType=NullIdentifier() will be used instead. Depending on the value of argument valueType, the gene ID type of the resulting list or of all GeneSet objects in the resulting GeneSetCollection will be set to this value.

collectionType

Only used when valueType == "GeneSetCollection". See getGmt for more information.

valueType

A character vector of length 1 specifying the desired type of return value. It must be one of:

  • GeneSetCollection (the default): a GeneSetCollection object as defined and described by package GSEABase.

  • list: a named list of gene sets represented as character vectors of gene IDs. This format is much simpler and cannot store the metadata required for automatic mapping of gene IDs.

deduplUse

A character vector of length 1 specifying one of several methods to handle duplicated gene set names. Duplicated gene set names are explicitly forbidden by the GMT file format specification but can nevertheless be encountered in the wild. The available choices are:

  • first (the default): drops all gene sets whose names are duplicated according to the base R function and retains only the first occurence of a gene set name.

  • drop: removes all gene sets that have a duplicated name, including its first occurrence.

  • union: replaces gene sets with duplicated names by a single gene set containing the union of all their gene IDs.

  • smallest: drops gene sets with duplicated names and retains only the smallest of them, i.e. the one with the fewest gene IDs. If there are several smallest gene sets, the first will be selected.

  • largest: drops gene sets with duplicated names and retains only the largest of them, i.e. the one with the most gene IDs. If there are several largest gene sets, the first will be selected.

...

Further arguments passed on to readLines()

Value

The gene sets imported from the GMT file, with duplicate gene sets resolved according to argument deduplUse and in the format determined by argument valueType.

See Also

readLines, GeneSetCollection, getGmt

Examples

library(GSVA)
library(GSVAdata)

fname <- system.file("extdata", "c7.immunesigdb.v2024.1.Hs.symbols.gmt.gz",
                     package="GSVAdata")

## by default, guess geneIdType from content and return a GeneSetCollection
genesets <- readGMT(fname)
genesets

## how to manually override the geneIdType
genesets <- readGMT(fname, geneIdType=NullIdentifier())
genesets

## return a simple list instead of a GeneSetCollection
genesets <- readGMT(fname, valueType="list")
head(genesets, 2)

## the list has a geneIdType, too
gsvaAnnotation(genesets)


rcastelo/GSVA documentation built on Jan. 18, 2025, 6:36 a.m.