method.table | R Documentation |
The dataset contains the subset of methods that are implemented in the DataSimilarity package from the results table of Stolte et al. (2024).
data("method.table")
A data frame with 42 observations on the following 30 variables that include information on whether or not the method fulfills the theoretical criteria of Stolte et al. (2024).
Some criteria are only fulfilled for certain parameter choices of the method ("Conditionally Fulfilled") or do not apply to the method.
NA
values mean that there is no information available on whether or not the respective criterion is fulfilled.
Method
a character vector giving the reference or method name
Implementation
a character vector giving the function name of the implementation in the DataSimilarity package
Target.Inclusion
a character vector. Can the method handle datasets that include a target variable in a meaningful way?
Numeric
a character vector. Can the method handle numeric data?
Categorical
a character vector. Can the method handle categorical data?
Unequal.Sample.Sizes
a character vector. Can the method handle datasets of different sample sizes?
p.Larger.N
a character vector. Can the method handle datasets with more variables than observations?
Multiple.Samples
a character vector. Can the method handle k > 2
datasets simultaneously?
Without.training
a character vector. Does the method work without holding out training data?
No.assumptions
a character vector. Does the method work without further assumptions?
No.parameters
a character vector. Does the method work without the specification or tuning of additional parameters?
Implemented
a character vector. Is the method implemented elsewhere? (NA if no other implementations are known)
Complexity
a character vector giving the computational complexity of the method.
Interpretable.units
a character vector. Can a one unit increase of the output value be interpreted?
Lower.bound
a character vector. Are the output values lower bounded? If known the lower bound is given.
Upper.bound
a character vector. Are the output values upper bounded? If known the upper bound is given.
Rotation.invariant
a character vector. Is the method invariant to rotation of all datasets?
Location.change.invariant
a character vector. Is the method invariant to shifting all datasets?
Homogeneous.scale.invariant
a character vector. Is the method invariant to scaling all datasets?
Positive.definite
a character vector. Is the method positive definite, i.e. d(F_1, F_2) \ge 0
and d(F_1, F_2) = 0 \Leftrightarrow F_1 = F_2
for any two distributions F_1, F_2
?
Symmetric
a character vector. Ist the method symmetric, i.e. d(F_1, F_2) = d(F_2, F_1)
for any two distributions F_1, F_2
?
Triangle.inequality
a character vector. Does the method fulfill the triangle inequality, i.e. d(F_1, F_2) \le d(F_1, F_3) + d(F_3, F_2)
for any three distributions F_1, F_2, F_3
?
Consistency.N
a character vector. Is the corresponding test consistent for N\to\infty
?
Consistency.p
a character vector. Is the corresponding test consistent for p\to\infty
?
Number.Fulfilled
a numeric vector. Number of fulfilled criteria.
Number.Cond.Fulfilled
a numeric vector. Number of conditionally fulfilled criteria.
Number.Unfulfilled
a numeric vector. Number of unfulfilled criteria.
Number.NA
a numeric vector. Number of criteria for which it is unknown if they are fulfilled.
Class
a character vector. Class of the taxonomy of Stolte et al. (2024) that the method is assigned to based on its underlying idea.
Subclass
a character vector. Subclass of the taxonomy of Stolte et al. (2024) that the method is assigned to based on its underlying idea.
The dataset is based on the results of Stolte et al. (2024). For explanations on the criteria and on the taxonomy and classes, please refer to that publication. A full version of the table can also be found at https://shiny.statistik.tu-dortmund.de/data-similarity/.
Article describing the criteria and taxonomy: Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
Full interactive results table: https://shiny.statistik.tu-dortmund.de/data-similarity/
data("method.table")
# Workflow for using the DataSimilarity package:
# Prepare data example: comparing species in iris dataset
data("iris")
iris.split <- split(iris[, -5], iris$Species)
setosa <- iris.split$setosa
versicolor <- iris.split$versicolor
virginica <- iris.split$virginica
# 1. Find appropriate methods that can be used to compare 3 numeric datasets:
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)
# get more information
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, only.names = FALSE)
# 2. Choose a method and apply it:
# All suitable methods
possible.methds <- findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE,
only.names = FALSE)
# Select, e.g., method with highest number of fulfilled criteria
possible.methds$Implementation[which.max(possible.methds$Number.Fulfilled)]
set.seed(1234)
if(requireNamespace("KMD")) {
DataSimilarity(setosa, versicolor, virginica, method = "KMD")
}
# or directly
set.seed(1234)
if(requireNamespace("KMD")) {
KMD(setosa, versicolor, virginica)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.