dkss | R Documentation |
This function calculates the pairwise distances between mixed-type observations
consisting of continuous (numeric
), nominal (factor
), and ordinal
(ordered
) variables using the method described in Ghashti (2024).
This kernel metric learning methodology calculates a kernel sum similarity
function, with a variety of options for kernel functions associated with each
variable type and returns a distance matrix that can be used in any distance-
based algorithm.
dkss(df, bw = "mscv", cFUN = "c_gaussian", uFUN = "u_aitken",
oFUN = "o_wangvanryzin", stan = TRUE, verbose = FALSE)
df |
a |
bw |
numeric bandwidth vector of length |
cFUN |
character value specifying the continuous kernel function. Options include
|
uFUN |
character value specifying the nominal kernel function for unordered
factors. Options include |
oFUN |
character value specifying the ordinal kernel function for ordered factors.
Options include |
stan |
a logical value which specifies whether to scale the resulting distance
matrix between 0 and 1 using min-max normalization. If set to |
verbose |
a logical value which specifies whether to print procedural steps to the
console. If set to |
dkss
implements the distance using summation similarity distance (DKSS)
as described by Ghashti (2024). This approach uses summation kernels for
continuous, nominal and ordinal data, which are then summed over all variable
types to return the pairwise distance between mixed-type data.
There are several kernels to select from. The continuous kernel functions may be found in Cameron and Trivedi (2005), Härdle et al. (2004) or Silverman (1986). Nominal kernels use a variation on Aitchison and Aitken's (1976) kernel, while ordinal kernels use a variation of the Wang and van Ryzin (1981) kernel. Both nominal and ordinal kernel functions can be found in Li and Racine (2007), Li and Racine (2003), Ouyan et al. (2006), and Titterington and Bowman (1985).
Each kernel requires a bandwidth specification, which can either be a user
defined numeric vector of length p
from alternative methodologies for
bandwidth selection, or through two bandwidth selection methods can be
specified. The mscv
bandwidth selection is based on maximum similarity
cross-validation by Ghashti and Thompson (2024), invoked by the function
mscv.dkss
. The np
bandwidth selection follows the maximum
likelihood cross-validation method described by Li and Racine (2007) and Li
and Racine (2003) for kernel density estimation of mixed-type data.
Data contained in the data frame df
may constitute any combinations of
continuous, nominal, or ordinal data, which is to be specified in the data
frame df
using factor
for nominal data, and
ordered
for ordinal data. Data types can be in any order and
will be detected automatically. User-inputted vectors of
bandwidths bw
must be specified in the same order as the variables in
the data frame df
, as to ensure they sorted accordingly by the routine.
dkss
returns a list
object, with the
following components:
distances |
an |
bandwidths |
a |
John R. J. Thompson john.thompson@ubc.ca, Jesse S. Ghashti jesse.ghashti@ubc.ca
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420.
Cameron, A. and P. Trivedi (2005), “Microeconometrics: Methods and Applications”, Cambridge University Press.
Ghashti, J.S. (2024), “Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-type Data (T)”, University of British Columbia.
Härdle, W., and M. Müller and S. Sperlich and A. Werwatz (2004), “Nonparametric and Semiparametric Models”, (Vol. 1). Berlin: Springer.
Li, Q. and J.S. Racine (2007), “Nonparametric Econometrics: Theory and Practice”, Princeton University Press.
Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data”, Journal of Multivariate Analysis, 86, 266-292.
Ouyang, D. and Q. Li and J.S. Racine (2006), “Cross-validation and the estimation of probability distributions with categorical data”, Journal of Nonparametric Statistics, 18, 69-100.
Silverman, B.W. (1986), “Density Estimation”, London: Chapman and Hall.
Titterington, D.M. and A.W. Bowman (1985), “A comparative study of smoothing procedures for ordered categorical data”, Journal of Statistical
Computation and Simulation, 21(3-4), 291-312.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions”, Biometrika, 68, 301-309.
mscv.dkps
, dkps
, mscv.dkss
# example data frame with mixed numeric, nominal, and ordinal data.
levels = c("Low", "Medium", "High")
df <- data.frame(
x1 = runif(100, 0, 100),
x2 = factor(sample(c("A", "B", "C"), 100, TRUE)),
x3 = factor(sample(c("A", "B", "C"), 100, TRUE)),
x4 = rnorm(100, 10, 3),
x5 = ordered(sample(c("Low", "Medium", "High"), 100, TRUE), levels = levels),
x6 = ordered(sample(c("Low", "Medium", "High"), 100, TRUE), levels = levels))
# minimal implementation requires just the data frame, and will automatically be
# defaulted to the mscv bandwidth specification technique and default kernel
# function
d1 <- dkss(df = df)
# d$bandwidths to see the mscv obtained bandwidths
# d$distances to see the distance matrix
# try using the np package, which has few continuous and ordinal kernels
# to choose from. Recommended using default kernel functions
d2 <- dkss(df = df, bw = "np")
# precomputed bandwidth example
# note that continuous variables requires bandwidths > 0
# ordinal variables requires bandwidths in [0,1]
# for nominal variables, u_aitken requires bandwidths in [0,1]
# and u_aitchisonaitken in [0,(c-1)/c]
# where c is the number of unique values in the i-th column of df.
# any bandwidths outside this range will result in a warning message
bw_vec <- c(1.0, 0.5, 0.5, 5.0, 0.3, 0.3)
d3 <- dkss(df = df, bw = bw_vec)
# user-specific kernel functions example
d5 <- dkss(df = df, bw = "mscv", cFUN = "c_epanechnikov", uFUN = "u_aitken",
oFUN = "o_habbema")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.