universal_null: Filtering topological features with the universal null...

View source: R/inference.R

universal_nullR Documentation

Filtering topological features with the universal null distribution.

Description

An inference procedure to determine which topological features (if any) of a datasets are likely signal (i.e. significant) vs noise (not).

Usage

universal_null(
  X,
  FUN_diag = "calculate_homology",
  maxdim = 1,
  thresh,
  distance_mat = FALSE,
  ripser = NULL,
  ignore_infinite_cluster = TRUE,
  calculate_representatives = FALSE,
  alpha = 0.05,
  return_pvals = FALSE,
  infinite_cycle_inference = FALSE
)

Arguments

X

the input dataset, must either be a matrix or data frame.

FUN_diag

a string representing the persistent homology function to use for calculating the full persistence diagram, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.

maxdim

the integer maximum homological dimension for persistent homology, default 0.

thresh

the positive numeric maximum radius of the Vietoris-Rips filtration.

distance_mat

a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).

ripser

the imported ripser module when 'FUN_diag' is 'PyH'.

ignore_infinite_cluster

a boolean indicating whether or not to ignore the infinitely lived cluster when 'FUN_diag' is 'PyH'. If infinite cycle inference is to be performed, this parameter should be set to FALSE.

calculate_representatives

a boolean representing whether to calculate representative (co)cycles, default FALSE. Note that representatives cant be calculated when using the 'calculate_homology' function. Note that representatives cannot be computed for (significant) infinite cycles.

alpha

the type-1 error threshold, default 0.05.

return_pvals

a boolean representing whether or not to return p-values for features in the subsetted diagram as well as a list of p-value thresholds, default FALSE. Infinite cycles that are significant (see below) will have p-value NA in this list, as the true value is unknown but less than its dimension's p-value threshold.

infinite_cycle_inference

a boolean representing whether or not to perform inference for features with infinite (i.e. 'thresh') death values, default FALSE. If 'FUN_diag' is 'calculate_homology' (the default) then no infinite cycles will be returned by the persistent homology calculation at all.

Details

For each feature in a diagram we compute its persistence ratio \pi = death/birth, and a test statistic A log log \pi + B (where A and B are constants). This statistic is compared to a left-skewed Gumbel distribution to get a p-value. A Bonferroni correction is applied to all the p-values across all features, so when 'return_pvals' is TRUE a list of p-value thresholds is also returned, one for each dimension, which is 'alpha' divided by the number of features in that dimension. If desired, infinite cycles (i.e. cycles whose death value is equal to the maximum distance threshold parameter for the persistent homology calculation) can be anaylzed for significance by determining their minimum distance thresholds where they might be significant (using the Gumbel distribution again), calculating the persistence diagram up to those thresholds and seeing if they are still infinite (i.e. significant) or not. This function is significantly faster than the bootstrap_persistence_thresholds function. Note that the 'calculate_homology' function does not seem to store infinite cycles (i.e. cycles that have death value equal to 'thresh').

Value

a list containing the full persistence diagram, the subsetted diagram, representatives and/or subsetted representatives if desired, the p-values of subsetted features and the Bonferroni p-value thresholds in each dimension if desired.

Author(s)

Shael Brown - shaelebrown@gmail.com

References

Bobrowski O, Skraba P (2023). "A universal null-distribution for topological data analysis." https://www.nature.com/articles/s41598-023-37842-2.

Examples


if(require("TDA"))
{
  # create dataset
  theta <- runif(n = 100,min = 0,max = 2*pi)
  x <- cos(theta)
  y <- sin(theta)
  circ <- data.frame(x = x,y = y)

  # add noise
  x_noise <- -0.1 + 0.2*stats::runif(n = 100)
  y_noise <- -0.1 + 0.2*stats::runif(n = 100)
  circ$x <- circ$x + x_noise
  circ$y <- circ$y + y_noise

  # determine significant topological features
  library(TDA)
  res <- universal_null(circ, thresh = 2,alpha = 0.1,return_pvals = TRUE,FUN_diag = "ripsDiag")
  res$subsetted_diag
  res$pvals
  res$alpha_thresh

  # at a lower threshold we can check for 
  # infinite cycles
  res2 <- universal_null(circ, thresh = 1.1, 
                         infinite_cycle_inference = TRUE,
                         alpha = 0.1,
                         FUN_diag = "ripsDiag")
  res2$subsetted_diag
}

shaelebrown/TDAML documentation built on Nov. 1, 2024, 8:59 a.m.