NKT | R Documentation |
Calculates Decision-Tree Based Measure of Dataset Similarity by Ntoutsi et al. (2008).
NKT(X1, X2, target1 = "y", target2 = "y", method = 1, tune = TRUE, k = 5,
n.eval = 100, seed = 42, ...)
X1 |
First dataset as matrix or data.frame |
X2 |
Second dataset as matrix or data.frame |
target1 |
Character specifying the column name of the class variable in the first dataset (default: |
target2 |
Character specifying the column name of the class variable in the second dataset (default: |
method |
Number in |
tune |
Should the decision tree parameters be tuned? (default: |
k |
Number of folds used in cross-validation for parameter tuning (default: 5). Ignored if |
n.eval |
Number of evaluations for random search used for parameter tuning (default: 100). Ignored if |
seed |
Random seed (default: 42) |
... |
Further arguments passed to |
Ntoutsi et al. (2008) define three measures of datset similarity based on the intersection of the partitions of the sample space defined by the two decision trees fit to each dataset. Denote by \hat{P}_X(\mathcal{X})
the proportion of observations in a dataset that fall into each segment of the joint partition and by P_X(Y,\mathcal{X})
the proportion of observations in a dataset that fall into each segment of the joint partition and belong to each class.
s(p, q) = \sum_{i} \sqrt{p_i \cdot q_i}
defines the similarity index for two vectors p
and q
. Then the measures of similarity are defined by
\text{NTO1} = s(\hat{P}_{X_1}(\mathcal{X}), \hat{P}_{X_2}(\mathcal{X})),
\text{NTO2} = s(\hat{P}_{X_1}(Y, \mathcal{X}), \hat{P}_{X_2}(Y, \mathcal{X})),
\text{NTO3} = S(Y|\mathcal{X})^{T} \hat{P}_{X_1 \cup X_2}(\mathcal{X}),
where S(Y|\mathcal{X})
is the similarity vector with elements
S(Y|\mathcal{X})_i = s(\hat{P}_{X_1}(Y|\mathcal{X})_{i \bullet}, \hat{P}_{X_2}(Y|\mathcal{X})_{i \bullet})
and index i \bullet
denotes the i
-th row.
The implementation uses rpart
for fitting classification trees to each dataset.
best.rpart
is used for hyperparameter tuning if tune = TRUE
. The parameters are tuned using cross-validation and random search. The parameter minsplit
is tuned over 2^(1:7)
, minbucket
is tuned over 2^(0:6)
and cp
is tuned over 10^seq(-4, -1, by = 0.001)
.
High values of each measure indicate similarity of the datasets. The measures are bounded between 0 and 1.
An object of class htest
with the following components:
statistic |
Observed value of the test statistic |
p.value |
NA (no p value calculated) |
method |
Description of the test |
data.name |
The dataset names |
alternative |
The alternative hypothesis |
Target variable? | Numeric? | Categorical? | K-sample? |
Yes | Yes | No | No |
Ntoutsi, I., Kalousis, A. and Theodoridis, Y. (2008). A general framework for estimating similarity of datasets and decision trees: exploring semantic similarity of decision trees. Proceedings of the 2008 SIAM International Conference on Data Mining, 810-821. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1137/1.9781611972788.7")}
Stolte, M., Kappenberg, F., Rahnenführer, J., Bommert, A. (2024). Methods for quantifying dataset similarity: a review, taxonomy and comparison. Statist. Surv. 18, 163 - 298. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1214/24-SS149")}
GGRL
# Draw some data
X1 <- matrix(rnorm(1000), ncol = 10)
X2 <- matrix(rnorm(1000, mean = 0.5), ncol = 10)
y1 <- rbinom(100, 1, 1 / (1 + exp(1 - X1 %*% rep(0.5, 10))))
y2 <- rbinom(100, 1, 1 / (1 + exp(1 - X2 %*% rep(0.7, 10))))
X1 <- data.frame(X = X1, y = y1)
X2 <- data.frame(X = X2, y = y2)
if(requireNamespace("rpart", quietly = TRUE)) {
# Calculate all three similarity measures (without tuning the trees due to runtime)
NKT(X1, X2, "y", method = 1, tune = FALSE)
NKT(X1, X2, "y", method = 2, tune = FALSE)
NKT(X1, X2, "y", method = 3, tune = FALSE)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.