dataSimilarity: Evaluate statistical similarity of two data sets

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/dataQuality.R

Description

Use mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test to compare similarity of two data sets.

Usage

1
dataSimilarity(data1, data2, dropDiscrete=NA)

Arguments

data1

A data.frame containing the reference data.

data2

A data.frame with the same number and names of columns as data1.

dropDiscrete

A vector discrete attribute indices to skip in comparison. Typically we might skip class, because its distribution was forced by the user.

Details

The function compares data stored in data1 with data2 on per attribute basis by computing several statistics: mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test.

Value

The method returns a list of statistics computed on both data sets:

equalInstances

The number of instances in data2 equal to the instances in data1.

stats1num

A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data1.

stats2num

A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data2.

ksP

A vector with p-values of Kolmogorov-Smirnov two sample tests, performed on matching attributes from data1 and data2.

freq1

A list with value frequencies for discrete attributes in data1.

freq2

A list with value frequencies for discrete attributes in data2.

dfreq

A list with differences in frequencies of discrete attributes' values between data1 and data2.

dstatsNorm

A matrix with rows containing difference between statistics (mean, standard deviation, skewness, and kurtosis) computed on [0,1] normalized numeric attributes for data1 and data2.

hellingerDist

A vector with Hellinger distances between matching attributes from data1 and data2.

Author(s)

Marko Robnik-Sikonja

See Also

newdata.RBFgenerator.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# use iris data set, split into training and testing data
set.seed(12345)
train <- sample(1:nrow(iris),size=nrow(iris)*0.5)
irisTrain <- iris[train,]
irisTest <- iris[-train,]

# create RBF generator
irisGenerator<- rbfDataGen(Species~.,irisTrain)

# use the generator to create new data
irisNew <- newdata(irisGenerator, size=100)

# compare statistics of original and new data
dataSimilarity(irisTest, irisNew)

semiArtificial documentation built on Sept. 24, 2021, 1:07 a.m.