test_codiu_genes: Statistical testing of candidate co-DIU genes

View source: R/coDIU_genes.R

test_codiu_genesR Documentation

Statistical testing of candidate co-DIU genes

Description

Pairwise statistical testing of co-Differential Isoform Usage relationships. For a selected set of gene pairs showing co-expression of isoforms across clusters (see find_codiu_genes), this function tests the significance of the detected co-DIU patterns.

Warning: this function may take a long time to run, especially if applied to all pairs of co-DIU genes returned by find_codiu_genes.

Usage

test_codiu_genes(
  data,
  cluster_list,
  shared_genes,
  gene_tr_table,
  id_table,
  isoform_col = NULL,
  parallel = TRUE,
  t = 4
)

Arguments

data

A data.frame or tibble object including isoforms as rows and cells as columns. Isoform IDs can be included as row names (data.frame) or as an additional column (tibble).

cluster_list

A list of character vectors containing isoform IDs. Each element of the list represents a cluster of isoforms.

shared_genes

A two-row matrix containing n candidate co-DIU gene pairs as column. Typically the result of running find_codiu_genes.

gene_tr_table

A data.frame or tibble object containing two columns named transcript_id and gene_id, indicating gene-isoform correspondence.

id_table

A data frame including two columns named cell and cell_type, in which correspondence between cell ID and cell type should be provided. The number of rows should be equal to the total number of cell columns in data, and the order of the cell column should match column (i.e. cell) order in data.

isoform_col

When a tibble is provided in data, a character object indicating the name of the column where isoform IDs are specified. Otherwise, isoform identifiers will be assumed to be defined as rownames, and this argument will not need to be provided.

parallel

A logical. When TRUE, parallelization is enabled. The future_map_lgl function in the furrr is used.

t

An integer indicating the number of threads to be used for parallelization. This will be passed to the plan function from the future package via the workers argument.

Details

A set of potentially co-DIU genes will have at least two of their isoforms assigned to the same clusters, i.e. show detectable isoform-level co-expression. However, since clustering allows isoforms with slightly variable expression patterns to be clustered together, some isoforms might be assigned to clusters that do not faithfully represent their expression profile, leading to inaccuracies in co-DIU detection. To avoid false-positive co-DIU genes, the present function applies a regression model and a statistical test to each of the candidate pair of genes (hereby named gene 1 and gene 2), where at least two of the isoforms of each gene must belong to the same two clusters (hereby named cluster 1 and cluster 2).

Briefly, we need to assess whether expression values for the isoforms follow a correct co-DIU pattern, that is, the average profile across cell types of the two isoforms in cluster 1 must be significantly different to the average profile of the two isoforms in cluster 2, indicating distinct expression profiles for the two isoforms of each gene. In addition, the average profile of the two isoforms of gene 1 must not be different to the average profile of the two isoforms of gene 2, indicating that co-expression is only detectable when isoform-level expression is considered.

Internally, the function fits a generalized linear regression model (GLM) via the glm function, using the negative.binomial function in the MASS package to set the error distribution and link function of the model via the family argument. To test the significance of the cluster*cell type and gene*cell type interactions (as described above), we calculated type-II analysis-of-variance (ANOVA) tables for the model using a likelihood-ratio chi-square test using the Anova function in the car package (given the unbalanced design).

Value

A list containing one tibble per tested gene pair, as generated by make_test. Each tibble will include two columns, cluster:cell_type and gene:cell_type, containing the p-value obtained when testing each of these interactions in the type-II ANOVA test.

NOTE: In some cases the assumptions required for fitting the GLM are not met, and an NA value is returned instead. These are output to allow users to control for untested gene pairs, but can easily be removed from the output.

References

\insertRef

Venables2002acorde

\insertRef

Fox2019acorde

See Also

For details, see internal functions: make_design, make_test.


ConesaLab/acorde documentation built on Feb. 25, 2024, 4:16 a.m.