NKT: Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et...
In DataSimilarity: Quantifying Similarity of Datasets and Multivariate Two- And k-Sample Testing

View source: R/NKT.R

NKT	R Documentation

Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et al., 2008)

Description

Calculates Decision-Tree Based Measure of Dataset Similarity by Ntoutsi et al. (2008).

Usage

NKT(X1, X2, target1 = "y", target2 = "y", version = 1, tune = TRUE, k = 5, 
      n.eval = 100, seed = NULL, ...)

Arguments

`X1`	First dataset as matrix or data.frame
`X2`	Second dataset as matrix or data.frame
`target1`	Character specifying the column name of the class variable in the first dataset (default: `"y"`)
`target2`	Character specifying the column name of the class variable in the second dataset (default: `"y"`)
`version`	Number in `1:3` specifying the version for calculating dataset similarity (default:1). See details.
`tune`	Should the decision tree parameters be tuned? (default: `TRUE`)
`k`	Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if `tune = FALSE`.
`n.eval`	Number of evaluations for random search used for parameter tuning (default: 100). Ignored if `tune = FALSE`.
`seed`	Random seed (default: NULL). A random seed will only be set if one is provided.
`...`	Further arguments passed to `rpart`. Ignored if `tune = TRUE`.

Details

Ntoutsi et al. (2008) define three measures of datset similarity based on the intersection of the partitions of the sample space defined by the two decision trees fit to each dataset. Denote by \hat{P}_X(\mathcal{X}) the proportion of observations in a dataset that fall into each segment of the joint partition and by P_X(Y,\mathcal{X}) the proportion of observations in a dataset that fall into each segment of the joint partition and belong to each class.

s(p, q) = \sum_{i} \sqrt{p_i \cdot q_i}

defines the similarity index for two vectors p and q. Then the measures of similarity are defined by

\text{NTO1} = s(\hat{P}_{X_1}(\mathcal{X}), \hat{P}_{X_2}(\mathcal{X})),

\text{NTO2} = s(\hat{P}_{X_1}(Y, \mathcal{X}), \hat{P}_{X_2}(Y, \mathcal{X})),

\text{NTO3} = S(Y|\mathcal{X})^{T} \hat{P}_{X_1 \cup X_2}(\mathcal{X}),

where S(Y|\mathcal{X}) is the similarity vector with elements

S(Y|\mathcal{X})_i = s(\hat{P}_{X_1}(Y|\mathcal{X})_{i \bullet}, \hat{P}_{X_2}(Y|\mathcal{X})_{i \bullet})

and index i \bullet denotes the i-th row.

The implementation uses rpart for fitting classification trees to each dataset.

best.rpart is used for hyperparameter tuning if tune = TRUE. The parameters are tuned using cross-validation and random search. The parameter minsplit is tuned over 2^(1:7), minbucket is tuned over 2^(0:6) and cp is tuned over 10^seq(-4, -1, by = 0.001).

High values of each measure indicate similarity of the datasets. The measures are bounded between 0 and 1.

Value

An object of class htest with the following components:

`statistic`	Observed value of the test statistic
`p.value`	NA (no p value calculated)
`method`	Description of the test
`data.name`	The dataset names
`alternative`	The alternative hypothesis

Applicability

Target variable?	Numeric?	Categorical?	K-sample?
Yes	Yes	No	No

References

Ntoutsi, I., Kalousis, A. and Theodoridis, Y. (2008). A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees. Proceedings of the 2008 SIAM International Conference on Data Mining, 810-821. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1137/1.9781611972788.7")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

Examples

set.seed(1234)
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
if(requireNamespace("rpart", quietly = TRUE)) {
  # Calculate all three similarity measures (without tuning the trees due to runtime)
  NKT(X1, X2, "y", version = 1, tune = FALSE)
  NKT(X1, X2, "y", version = 2, tune = FALSE)
  NKT(X1, X2, "y", version = 3, tune = FALSE)
}

DataSimilarity documentation built on June 16, 2025, 5:08 p.m.