permutation_model_inference: Model inference with permutation test.
In shaelebrown/TDAML: Machine Learning and Inference for Topological Data Analysis

permutation_model_inference

R Documentation

Model inference with permutation test.

Description

An inference procedure to determine if two datasets were unlikely to be generated by the same process (i.e. if the persistence diagram of one dataset is a good model of the persistence diagram of the other dataset).

Usage

permutation_model_inference(
  D1,
  D2,
  iterations,
  num_samples,
  dims = c(0, 1),
  samp = NULL,
  paired = F,
  num_workers = parallelly::availableCores(omit = 1),
  verbose = F,
  FUN_boot = "calculate_homology",
  thresh,
  distance_mat = FALSE,
  ripser = NULL,
  return_diagrams = FALSE
)

Arguments

`D1`	the first dataset (a data frame).
`D2`	the second dataset (a data frame).
`iterations`	the number of iterations for permuting group labels, default 20.
`num_samples`	the number of bootstrap iterations, default 30.
`dims`	a non-negative integer vector of the homological dimensions in which the test is to be carried out, default c(0,1).
`samp`	an optional list of row-number samples of 'D1', default NULL. See details and examples for more information. Ignored when 'paired' is FALSE.
`paired`	a boolean flag for if there is a second-order pairing between diagrams at the same index in different groups, default FALSE.
`num_workers`	the number of cores used for parallel computation, default is one less than the number of cores on the machine.
`verbose`	a boolean flag for if the time duration of the function call should be printed, default FALSE
`FUN_boot`	a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.
`thresh`	the positive numeric maximum radius of the Vietoris-Rips filtration.
`distance_mat`	a boolean representing if 'X' is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).
`ripser`	the imported ripser module when 'FUN_boot' is 'PyH'.
`return_diagrams`	whether or not to return the two lists of bootstrapped persistence diagrams, default FALSE.

Details

Inference is carried out by generating bootstrap resampled persistence diagrams from the two datasets and carrying out a permutation test on the resulting two groups. A small p-value in a certain dimension suggests that the datasets are not good models of each other. 'samp' should only be provided when 'paired'is TRUE in order to generate the same row samplings of 'D1' and 'D2' for the bootstrapped persistence diagrams. This makes a paired permutation test more appropriate, which has higher statistical power for detecting topological differences. See the examples for how to properly supply 'samp'.

Value

a list which contains the output of the call to permutation_test and the two groups of bootstrapped persistence diagrams if desired, in entries called 'diagrams1' and 'diagrams2'.

Author(s)

Shael Brown - shaelebrown@gmail.com

References

Robinson T, Turner K (2017). "Hypothesis testing for topological data analysis." https://link.springer.com/article/10.1007/s41468-017-0008-7.

Chazal F et al (2017). "Robust Topological Inference: Distance to a Measure and Kernel Distance." https://www.jmlr.org/papers/volume18/15-484/15-484.pdf.

Abdallah H et al. (2021). "Statistical Inference for Persistent Homology applied to fMRI." https://github.com/hassan-abdallah/Statistical_Inference_PH_fMRI/blob/main/Abdallah_et_al_Statistical_Inference_PH_fMRI.pdf.

Examples


if(require("TDAstats"))
{
  # create two datasets
  D1 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,10),],
                                     dim = 0,threshold = 2)
  D2 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,10),],
                                     dim = 0,threshold = 2)

  # do model inference test with 1 iteration (for speed, more
  # iterations should be used in practice)
  model_test <- permutation_model_inference(D1, D2, iterations = 1,
                                            thresh = 1.75,num_samples = 3,
                                            num_workers = 2L)
  
  # with more iterations, p-values show a difference in the 
  # clustering of points but not in the arrangement of loops
  model_test$p_values
  
  # to supply samp, when we believe there is a correspondence between
  # the rows in D1 and the rows in D2
  # note that the number of entries of samp (3 in this case) must
  # match the num_samples parameter to the function call
  samp <- lapply(X = 1:3,FUN = function(X){

           return(unique(sample(1:nrow(D1),size = nrow(D1),replace = TRUE)))

          })
  
  # model inference will theoretically have higher power now for a
  # paired test 
  model_test2 <- permutation_model_inference(D1, D2, iterations = 1,
                                             thresh = 1.75,num_samples = 3,
                                             paired = TRUE,samp = samp,
                                             num_workers = 2L)
  model_test2$p_values
}

shaelebrown/TDAML documentation built on Nov. 1, 2024, 8:59 a.m.