NKT: Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et...

View source: R/NKT.R

NKTR Documentation

Decision-Tree Based Measure of Dataset Similarity (Ntoutsi et al., 2008)

Description

Calculates Decision-Tree Based Measure of Dataset Similarity by Ntoutsi et al. (2008).

Usage

NKT(X1, X2, target1 = "y", target2 = "y", method = 1, tune = TRUE, k = 5, 
      n.eval = 100, seed = 42, ...)

Arguments

X1

First dataset as matrix or data.frame

X2

Second dataset as matrix or data.frame

target1

Character specifying the column name of the class variable in the first dataset (default: "y")

target2

Character specifying the column name of the class variable in the second dataset (default: "y")

method

Number in 1:3 specifying the method for calculating dataset similarity (default:1). See details.

tune

Should the decision tree parameters be tuned? (default: TRUE)

k

Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if tune = FALSE.

n.eval

Number of evaluations for random search used for parameter tuning (default: 100). Ignored if tune = FALSE.

seed

Random seed (default: 42)

...

Further arguments passed to rpart. Ignored if tune = TRUE.

Details

Ntoutsi et al. (2008) define three measures of datset similarity based on the intersection of the partitions of the sample space defined by the two decision trees fit to each dataset. Denote by \hat{P}_X(\mathcal{X}) the proportion of observations in a dataset that fall into each segment of the joint partition and by P_X(Y,\mathcal{X}) the proportion of observations in a dataset that fall into each segment of the joint partition and belong to each class.

s(p, q) = \sum_{i} \sqrt{p_i \cdot q_i}

defines the similarity index for two vectors p and q. Then the measures of similarity are defined by

\text{NTO1} = s(\hat{P}_{X_1}(\mathcal{X}), \hat{P}_{X_2}(\mathcal{X})),

\text{NTO2} = s(\hat{P}_{X_1}(Y, \mathcal{X}), \hat{P}_{X_2}(Y, \mathcal{X})),

\text{NTO3} = S(Y|\mathcal{X})^{T} \hat{P}_{X_1 \cup X_2}(\mathcal{X}),

where S(Y|\mathcal{X}) is the similarity vector with elements

S(Y|\mathcal{X})_i = s(\hat{P}_{X_1}(Y|\mathcal{X})_{i \bullet}, \hat{P}_{X_2}(Y|\mathcal{X})_{i \bullet})

and index i \bullet denotes the i-th row.

The implementation uses rpart for fitting classification trees to each dataset.

best.rpart is used for hyperparameter tuning if tune = TRUE. The parameters are tuned using cross-validation and random search. The parameter minsplit is tuned over 2^(1:7), minbucket is tuned over 2^(0:6) and cp is tuned over 10^seq(-4, -1, by = 0.001).

High values of each measure indicate similarity of the datasets. The measures are bounded between 0 and 1.

Value

An object of class htest with the following components:

statistic

Observed value of the test statistic

p.value

NA (no p value calculated)

method

Description of the test

data.name

The dataset names

alternative

The alternative hypothesis

Applicability

Target variable? Numeric? Categorical? K-sample?
Yes Yes No No

References

Ntoutsi, I., Kalousis, A. and Theodoridis, Y. (2008). A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees. Proceedings of the 2008 SIAM International Conference on Data Mining, 810-821. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1137/1.9781611972788.7")}

Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}

See Also

GGRL

Examples

# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
if(requireNamespace("rpart", quietly = TRUE)) {
  # Calculate all three similarity measures (without tuning the trees due to runtime)
  NKT(X1, X2, "y", method = 1, tune = FALSE)
  NKT(X1, X2, "y", method = 2, tune = FALSE)
  NKT(X1, X2, "y", method = 3, tune = FALSE)
}

DataSimilarity documentation built on April 3, 2025, 9:39 p.m.