# KdCov: Kernel Distance Covariance Statistics In GiniDistance: A New Gini Correlation Between Quantitative and Qualitative Variables

## Description

Computes Kernel distance covariance statistics, in which Xs are quantitative, Y are categorical, sigma is kernel standard deviation and returns the measures of dependence.

## Usage

 1  KdCov(x, y, sigma) 

## Arguments

 x data y label of data or univariate response variable sigma kernel standard deviation

## Details

KdCov compute distance correlation statistics. The sample size (number of rows) of the data must agree with the length of the label vector, and samples must not contain missing values. Arguments x, y are treated as data and labels.

Distance covariance was introduced in (Szekely07) as a dependence measure between random variables X \in {R}^p and Y \in {R}^q. If X and Y are embedded into RKHS's induced by κ_X and κ_Y, respectively, the generalized distance covariance of X and Y is (Sejdinovic13):

\begin{array}{c} \mathrm{dCov}_{κ_X,κ_Y}(X,Y) = {E}d_{κ_X}(X,X^{\prime})d_{κ_Y}(Y,Y^{\prime}) + {E}d_{κ_X}(X,X^{\prime}){E}d_{κ_Y}(Y,Y^{\prime}) \\ - 2{E}≤ft[{E}_{X^{\prime}}d_{κ_X}(X,X^{\prime}) {E}_{Y^{\prime}}d_{κ_Y}(Y,Y^{\prime})\right].\label{dCovkk} \end{array}

In the case of Y being categorical, one may embed it using a set difference kernel κ_Y,

\label{setdiff} κ_Y(y,y^{\prime}) = ≤ft\{ \begin{array}{cc} \frac{1}{2} & if \;y = y^{\prime},\\ 0 & otherwise. \end{array} \right.

This is equivalent to embedding Y as a simplex with edges of unit length (Lyons13), i.e., L_k is represented by a K dimensional vector of all zeros except its k-th dimension, which has the value \frac{√{2}}{2}. The distance induced by κ_Y is called the set distance, i.e., d_{κ_Y}(y,y^{\prime})=0 if y=y^{\prime} and 1 otherwise. Using the set distance, we have the following results on the generalized distance covariance between a numerical and a categorical random variable.

\mathrm{dCov}_{κ_X,κ_Y}(X,Y) := \mathrm{dCov}_{κ_X}(X,Y) \nonumber = ∑_{k=1}^{K} p_k^2 ≤ft[2 {E}d_{κ_X}(X_k,X) - {E}d_{κ_X}(X_k,{X_k}^{\prime}) - {E}d_{κ_X}(X,X^{\prime}) \right].\label{dCovk}

## Value

KdCov returns the sample kernel distance correlation

## References

Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of Distance-based and RKHS-based Statistics in Hypothesis Testing, The Annals of Statistics, 41 (5), 2263-2291.

Zhang, S., Dang, X., Nguyen, D. and Chen, Y. (2019). Estimating feature - label dependence using Gini distance statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence (submitted).

KgCov KgCor dCov
 1 2 3  x<-iris[,1:4] y<-unclass(iris[,5]) KdCov(x, y, sigma=1)