remove_duplicate_genes: Remove Duplicate Gene Symbols in Gene Expression Data
In IOBR: Immune Oncology Biological Research

remove_duplicate_genes

R Documentation

Remove Duplicate Gene Symbols in Gene Expression Data

Description

This function addresses duplicate gene symbols in a gene expression dataset by selecting the highest-expressing instance among duplicates. Users can choose between mean, standard deviation, or sum as the ranking criterion for selection. This is useful for preparing data where duplicates can lead to issues in downstream analyses.

Usage

remove_duplicate_genes(eset, column_of_symbol, method = c("mean", "sd", "sum"))

Arguments

`eset`	A data frame or matrix representing gene expression data, with gene symbols as one of the columns.
`column_of_symbol`	The name of the column containing gene symbols in 'eset'.
`method`	The ranking method to use for selecting among duplicate gene symbols: '"mean"' for mean expression, '"sd"' for standard deviation, or '"sum"' for sum of expression values. Default is '"mean"'.

Value

A modified version of 'eset' where duplicate gene symbols have been reduced to a single entry (the highest-ranking one). The gene symbols are set as row names in the returned data frame.

Note

Important: This function performs selection, not aggregation. For duplicate genes, it retains only the highest-ranking instance (based on the specified method) and discards others.

Author(s)

Dongqiang Zeng

Examples

set.seed(123)
test_eset <- data.frame(
  symbol = c("GeneA", "GeneA", "GeneB", "GeneC"),
  S1 = c(10, 5, 20, 15),
  S2 = c(12, 7, 22, 17)
)
# Remove duplicates using mean expression
test_eset_unique <- remove_duplicate_genes(
  eset = test_eset,
  column_of_symbol = "symbol",
  method = "mean"
)
print(test_eset_unique)

IOBR documentation built on May 30, 2026, 5:07 p.m.