| method.table | R Documentation |
The dataset contains the subset of methods that are implemented in the DataSimilarity package from the results table of Stolte et al. (2024).
data("method.table")
A data frame with 42 observations on the following 30 variables that include information on whether or not the method fulfills the theoretical criteria of Stolte et al. (2024).
Some criteria are only fulfilled for certain parameter choices of the method ("Conditionally Fulfilled") or do not apply to the method.
NA values mean that there is no information available on whether or not the respective criterion is fulfilled.
Methoda character vector giving the reference or method name
Implementationa character vector giving the function name of the implementation in the DataSimilarity package
Target.Inclusiona character vector. Can the method handle datasets that include a target variable in a meaningful way?
Numerica character vector. Can the method handle numeric data?
Categoricala character vector. Can the method handle categorical data?
Unequal.Sample.Sizesa character vector. Can the method handle datasets of different sample sizes?
p.Larger.Na character vector. Can the method handle datasets with more variables than observations?
Multiple.Samplesa character vector. Can the method handle k > 2 datasets simultaneously?
Without.traininga character vector. Does the method work without holding out training data?
No.assumptionsa character vector. Does the method work without further assumptions?
No.parametersa character vector. Does the method work without the specification or tuning of additional parameters?
Implementeda character vector. Is the method implemented elsewhere? (NA if no other implementations are known)
Complexitya character vector giving the computational complexity of the method.
Interpretable.unitsa character vector. Can a one unit increase of the output value be interpreted?
Lower.bounda character vector. Are the output values lower bounded? If known the lower bound is given.
Upper.bounda character vector. Are the output values upper bounded? If known the upper bound is given.
Rotation.invarianta character vector. Is the method invariant to rotation of all datasets?
Location.change.invarianta character vector. Is the method invariant to shifting all datasets?
Homogeneous.scale.invarianta character vector. Is the method invariant to scaling all datasets?
Positive.definitea character vector. Is the method positive definite, i.e. d(F_1, F_2) \ge 0 and d(F_1, F_2) = 0 \Leftrightarrow F_1 = F_2 for any two distributions F_1, F_2?
Symmetrica character vector. Ist the method symmetric, i.e. d(F_1, F_2) = d(F_2, F_1) for any two distributions F_1, F_2?
Triangle.inequalitya character vector. Does the method fulfill the triangle inequality, i.e. d(F_1, F_2) \le d(F_1, F_3) + d(F_3, F_2) for any three distributions F_1, F_2, F_3?
Consistency.Na character vector. Is the corresponding test consistent for N\to\infty?
Consistency.pa character vector. Is the corresponding test consistent for p\to\infty?
Number.Fulfilleda numeric vector. Number of fulfilled criteria.
Number.Cond.Fulfilleda numeric vector. Number of conditionally fulfilled criteria.
Number.Unfulfilleda numeric vector. Number of unfulfilled criteria.
Number.NAa numeric vector. Number of criteria for which it is unknown if they are fulfilled.
Classa character vector. Class of the taxonomy of Stolte et al. (2024) that the method is assigned to based on its underlying idea.
Subclassa character vector. Subclass of the taxonomy of Stolte et al. (2024) that the method is assigned to based on its underlying idea.
The dataset is based on the results of Stolte et al. (2024). For explanations on the criteria and on the taxonomy and classes, please refer to that publication. A full version of the table can also be found at https://shiny.statistik.tu-dortmund.de/data-similarity/.
Article describing the criteria and taxonomy: Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
Full interactive results table: https://shiny.statistik.tu-dortmund.de/data-similarity/
data("method.table")
# Workflow for using the DataSimilarity package:
# Prepare data example: comparing species in iris dataset
data("iris")
iris.split <- split(iris[, -5], iris$Species)
setosa <- iris.split$setosa
versicolor <- iris.split$versicolor
virginica <- iris.split$virginica
# 1. Find appropriate methods that can be used to compare 3 numeric datasets:
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE)
# get more information
findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE, only.names = FALSE)
# 2. Choose a method and apply it:
# All suitable methods
possible.methds <- findSimilarityMethod(Numeric = TRUE, Multiple.Samples = TRUE,
only.names = FALSE)
# Select, e.g., method with highest number of fulfilled criteria
possible.methds$Implementation[which.max(possible.methds$Number.Fulfilled)]
set.seed(1234)
if(requireNamespace("KMD")) {
DataSimilarity(setosa, versicolor, virginica, method = "KMD")
}
# or directly
set.seed(1234)
if(requireNamespace("KMD")) {
KMD(setosa, versicolor, virginica)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.