count2tpm: Convert read counts to transcripts per million (TPM)

View source: R/count2tpm.R

count2tpmR Documentation

Convert read counts to transcripts per million (TPM)

Description

The count2tpm function is used to transform gene expression count data into Transcripts Per Million (TPM) values. This function supports gene IDs of type "Ensembl", "Entrez", or "Symbol", and retrieves gene length information using either an online connection to the bioMart database or a local dataset (specified by the source parameter). Missing values in count data can be checked and removed if check_data is set to TRUE. If gene length information is not provided through the effLength parameter, it will be obtained from the specified source. Based on the idType and org, the function identifies the matching identifiers in the count matrix to the annotation database and replaces them accordingly. After processing the gene names, lengths are obtained based on the idType. The function then calculates TPM values and removes any genes that do not have corresponding length information. The resulting TPM values are returned in a dataframe format. Additionally, duplicated genes are removed using the remove_duplicate_genes function before the final data frame is returned.

Usage

count2tpm(
  countMat,
  idType = "Ensembl",
  org = "hsa",
  source = "local",
  effLength = NULL,
  id = "id",
  gene_symbol = "symbol",
  length = "eff_length",
  check_data = FALSE
)

Arguments

countMat

The count matrix that needs to be transformed to TPM.

idType

(Optional, defaults to "Ensembl"): Type of the gene identifier, it can be "Ensembl", "Entrez" or "Symbol".

org

(Optional, defaults to "hsa"): The organism for which the analysis is needed, options include "hsa" (Human), "mmus" (Mouse), and others.

source

(Optional, defaults to "local"): The source from where the gene lengths are retrieved; it can be either "biomart" or "local". Other option is 'biomart'. user can also provide 'effLength' manually, if 'idType' is 'ensembl', and source is set to 'local', 'effLength' was provided by IOBR which was estimated by function 'getGeneLengthAndGCContent' of EDASeq package at 2023-02-10.

effLength

(Optional, defaults to NULL): The effective gene length used for TPM transformation.

id

(Optional, defaults to "id"): The column name in effLength that represents the gene identifier.

gene_symbol

(Optional, defaults to "symbol"): The column name in effLength that represents the gene symbol.

length

(Optional, defaults to "eff_length"): The column name in effLength that represents the gene length.

check_data

(Optional, defaults to FALSE): Whether to check if there are missing values in the count matrix.

Value

A TPM expression profile.

Author(s)

Wubing Zhang

Dongqiang Zeng

Examples

# Using the TCGA count data as an example
data(eset_stad, package = "IOBR")
# Transformation is accompanied by gene annotation
eset <- count2tpm(countMat = eset_stad, source = "local", idType = "ensembl")
head(eset)

# TPM transformations can also be performed using the gene symbol, but are not recommended
data("anno_grch38", package = "IOBR")
eset <- anno_eset(eset = eset_stad, annotation = anno_grch38, probe = "id")
eset <- count2tpm(countMat = eset, source = "local", idType = "symbol")
head(eset)

IOBR/IOBR documentation built on April 4, 2024, 1:07 a.m.