calculate_diversity: Main function for calculating splicing diversity
In esebesty/SplicingFactory: Splicing Diversity Analysis for Transcriptome Data

Description Usage Arguments Details Value Examples

Main function for calculating splicing diversity

calculate_diversity(
  x,
  genes = NULL,
  method = "laplace",
  norm = TRUE,
  tpm = FALSE,
  assayno = 1,
  verbose = FALSE
)

`x`	A numeric `matrix`, `data.frame`, `tximport` list, `DGEList`, `SummarizedExperiment` or `ExpressionSet`.
`genes`	Character vector with equal length to the number of rows of the input dataset with transcript-level expression values. The values in `x` are grouped into genes based on this vector.
`method`	Method to use for splicing diversity calculation, including naive entropy (`naive`), Laplace entropy (`laplace`), Gini index (`gini`), Simpson index (`simpson`) and inverse Simpson index (`invsimpson`). The default method is Laplace entropy.
`norm`	If `TRUE`, the entropy values are normalized to the number of transcripts for each gene. The normalized entropy values are always between 0 and 1. If `FALSE`, genes cannot be compared to each other, due to possibly different maximum entropy values.
`tpm`	In the case of a tximport list, TPM values or raw read counts can serve as an input. If `TRUE`, TPM values will be used, if `FALSE`, read counts will be used.
`assayno`	An integer value. In case of multiple assays in a `SummarizedExperiment` input, the argument specifies the assay number to use for diversity calculations.
`verbose`	If `TRUE`, the function will print additional diagnostic messages, besides the warnings and errors.

The function is intended to process transcript-level expression data from RNA-seq or similar datasets.

Given a N x M matrix or similar data structure, where the N rows are transcripts and the M columns are samples, and a vector of gene ids, used for aggregating the transcript level data, the function calculates transcript diversity values for each gene in each sample. These diversity values can be used to investigate the dominance of a specific transcript for a gene, the diversity of transcripts in a gene, and analyze changes in diversity.

There are a number of diversity values implemented in the package. These include the following:

Naive entropy: Shannon entropy using the transcript frequencies as probabilities. 0 entropy means a single dominant transcript, higher values mean a more diverse set of transcripts for a gene.
Laplace entropy: Shannon entropy where the transcript frequencies are replaced by a Bayesian estimate, using Laplace's prior.
Gini index: a measure of statistical dispersion originally used in economy. This measurement ranges from 0 (complete equality) to 1 (complete inequality). A value of 1 (complete inequality) means a single dominant transcript.
Simpson index: a measure of diversity, characterizing the number of different species (transcripts of a gene) in a dataset. Originally, this measurement calculates the probability that randomly selected individuals belong to different species. Simpson index ranges between 0 and 1; the higher the value, the higher the diversity.
Inverse Simpson index: Similar concept as the Simpson index, although a higher inverse-Simpson index means greater diversity. It ranges between 1 and the total number of transcripts for a gene.

The function can calculate the gene level diversity index using any kind of expression measure, including raw read counts, FPKM, RPKM or TPM values, although results may vary.

Gene-level splicing diversity values in a SummarizedExperiment object.

# matrix with RNA-seq read counts
x <- matrix(rpois(60, 10), ncol = 6)
colnames(x) <- paste0("Sample", 1:6)

# gene names used for grouping the transcript level data
gene <- c(rep("Gene1", 3), rep("Gene2", 2), rep("Gene3", 3), rep("Gene4", 2))

# calculating normalized Laplace entropy
result <- calculate_diversity(x, gene, method = "laplace", norm = TRUE)