deduplicateGeneSets: Handling of Duplicated Gene Set Names
In rcastelo/GSVA: Gene Set Variation Analysis for Microarray and RNA-Seq Data

deduplicateGeneSets

R Documentation

Handling of Duplicated Gene Set Names

Description

Offers a choice of ways for handling duplicated gene set names that may not be suitable as input to other gene set analysis functions.

Usage

deduplicateGeneSets(
  geneSets,
  deduplUse = c("first", "drop", "union", "smallest", "largest")
)

Arguments

geneSets

A named list of gene sets represented as character vectors of gene IDs as e.g. returned by readGMT.

deduplUse

A character vector of length 1 specifying one of several methods to handle duplicated gene set names. Duplicated gene set names are explicitly forbidden by the GMT file format specification but can nevertheless be encountered in the wild. The available choices are:

first (the default): drops all gene sets whose names are duplicated according to the base R function and retains only the first occurence of a gene set name.
drop: removes all gene sets that have a duplicated name, including its first occurrence.
union: replaces gene sets with duplicated names by a single gene set containing the union of all their gene IDs.
smallest: drops gene sets with duplicated names and retains only the smallest of them, i.e. the one with the fewest gene IDs. If there are several smallest gene sets, the first will be selected.
largest: drops gene sets with duplicated names and retains only the largest of them, i.e. the one with the most gene IDs. If there are several largest gene sets, the first will be selected.