GGRL | R Documentation |
Calculates Decision-Tree Based Measure of Dataset Distance by Ganti et al. (2002).
GGRL(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.a,
agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
GGRLCat(X1, X2, target1 = "y", target2 = "y", n.perm = 0, m = 1, diff.fun = f.aCat,
agg.fun = sum, tune = TRUE, k = 5, n.eval = 100, seed = 42, ...)
f.a(sec.parti, X1, X2)
f.s(sec.parti, X1, X2)
f.aCat(sec.parti, X1, X2)
f.sCat(sec.parti, X1, X2)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
target1 |
Character specifying the column name of the class variable in the first dataset (default: |
target2 |
Character specifying the column name of the class variable in the second dataset (default: |
n.perm |
Number of permutations for permuation test (default: 0, no permutation test performed) |
m |
subsampling rate for Bootstrap test (default: 1). Ganti et al. (2002) suggest that 0.2-0.3 is sufficient in many cases. Ignored if |
diff.fun |
Difference function as function (default: |
agg.fun |
Aggregate function (default: |
tune |
Should the decision tree parameters be tuned? (default: |
k |
Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if |
n.eval |
Number of evaluations for random search used for parameter tuning (default: 100). Ignored if |
seed |
Random seed (default: 42) |
... |
Further arguments passed to |
sec.parti |
Intersected partition as output by |
The method first calculates the greatest common refinement (GCR), that is the intersection of the sample space partitions induced by a decision tree fit to the first dataset and a decision tree fit to the second dataset. The proportions of samples falling into each section of the GCR is calculated for each dataset. These proportions are compared using a difference function and the results of this are aggregated by the aggregate function.
The implementation uses rpart
for fitting classification trees to each dataset.
best.rpart
is used for hyperparameter tuning if tune = TRUE
. The parameters are tuned using cross-validation and random search. The parameter minsplit
is tuned over 2^(1:7)
, minbucket
is tuned over 2^(0:6)
and cp
is tuned over 10^seq(-4, -1, by = 0.001)
.
Pre-implemented methods for the difference function are
f_a(\kappa_1, \kappa_2, n_1, n_2) = |\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|,
and
f_s(\kappa_1, \kappa_2, n_1, n_2) = \frac{|\frac{\kappa_1}{n_1} - \frac{\kappa_2}{n_2}|}{(\frac{\kappa_1}{n_1} + \frac{\kappa_2}{n_2}) / 2}, \text{ if }\kappa_1+\kappa_2>0,
= 0 \text{ otherwise,}
where \kappa_i
is the number of observations from dataset i
in the respective region of the greatest common refinement and n_i
are the sample sizes, i = 1, 2
.
The aggregate function aggregates the results of the difference function over all regions in the greatest common refinement.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
Permutation p value |
alternative |
The alternative hypothesis |
method |
Description of the test |
data.name |
The dataset names |
Target variable? | Numeric? | Categorical? | K-sample? |
No | Yes | Yes | No |
The categorical method might not work properly if certain combinations of the categorical variables are not present in both datasets. This might happen e.g. for a large number of categories or variables and for small numbers of observations. In this case it might happen that the decision tree of the dataset where the combination is missing is unable to match a level of the split variable to one of the child nodes. Therefore, this combination is not part of the partition of the sample space induced by the tree and therefore also not of the greatest common refinement. Thus, some points of the other dataset cannot be sorted into any region of the greatest common refinement and the probabilities in the joint distribution calculated over the greatest common refinement do not sum up to one anymore. A warning is printed in these cases. It is unclear how this affects the performance.
Note that for small numbers of categories and deep trees it might also happen that the greatest common refinement reduces to all observed combinations of categories in the variables. Then the dataset distance measures is just a complicated way to measure the difference in frequencies of all observed combinations.
Ganti, V., Gehrke, J., Ramakrishnan, R. and Loh W.-Y. (2002). A Framework for Measuring Differences in Data Characteristics, Journal of Computer and System Sciences, 64(3), \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1006/jcss.2001.1808")}.
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
NKT
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
GGRL(X1, X2, "y", "y", tune = FALSE)
}
# Categorical case
set.seed(1234)
X1 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE)),
X2 = factor(sample(letters[1:4], 1000, TRUE)),
X3 = factor(sample(letters[1:3], 1000, TRUE)),
y = sample(0:1, 100, TRUE))
X2 <- data.frame(X1 = factor(sample(letters[1:5], 1000, TRUE, 1:5)),
X2 = factor(sample(letters[1:4], 1000, TRUE, 1:4)),
X3 = factor(sample(letters[1:3], 1000, TRUE, 1:3)),
y = sample(0:1, 100, TRUE))
# Calculate Ganti et al. statistic (without tuning and testing due to runtime)
if(requireNamespace("rpart", quietly = TRUE)) {
GGRLCat(X1, X2, "y", "y", tune = FALSE)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.