dataSimilarity: Evaluate statistical similarity of two data sets
In semiArtificial: Generator of Semi-Artificial Data

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/dataQuality.R

Use mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test to compare similarity of two data sets.

1	dataSimilarity(data1, data2, dropDiscrete=NA)

`data1`	A `data.frame` containing the reference data.
`data2`	A `data.frame` with the same number and names of columns as `data1`.
`dropDiscrete`	A vector discrete attribute indices to skip in comparison. Typically we might skip class, because its distribution was forced by the user.

The function compares data stored in data1 with data2 on per attribute basis by computing several statistics: mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test.

The method returns a list of statistics computed on both data sets:

`equalInstances`	The number of instances in `data2` equal to the instances in `data1`.
`stats1num`	A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of `data1`.
`stats2num`	A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of `data2`.
`ksP`	A vector with p-values of Kolmogorov-Smirnov two sample tests, performed on matching attributes from `data1` and `data2`.
`freq1`	A list with value frequencies for discrete attributes in `data1`.
`freq2`	A list with value frequencies for discrete attributes in `data2`.
`dfreq`	A list with differences in frequencies of discrete attributes' values between `data1` and `data2`.
`dstatsNorm`	A matrix with rows containing difference between statistics (mean, standard deviation, skewness, and kurtosis) computed on [0,1] normalized numeric attributes for `data1` and `data2.`
`hellingerDist`	A vector with Hellinger distances between matching attributes from `data1` and `data2`.

Marko Robnik-Sikonja

newdata.RBFgenerator.

# use iris data set, split into training and testing data
set.seed(12345)
train <- sample(1:nrow(iris),size=nrow(iris)*0.5)
irisTrain <- iris[train,]
irisTest <- iris[-train,]

# create RBF generator
irisGenerator<- rbfDataGen(Species~.,irisTrain)

# use the generator to create new data
irisNew <- newdata(irisGenerator, size=100)

# compare statistics of original and new data
dataSimilarity(irisTest, irisNew)