splitData: Split Data

View source: R/genSamples.r

splitDataR Documentation

Split Data

Description

Splits a data set into two sets with desired proportions.

Usage

splitData(dataset, prop, keep.mprop = FALSE, num.non = 0, des.mprop = 0, 
use.pred = FALSE)

Arguments

dataset

Object of class RecLinkData. Data pairs to split.

prop

Real number between 0 and 1. Proportion of data pairs to form the training set.

keep.mprop

Logical. Whether the ratio of matches should be retained.

num.non

Positive Integer. Desired number on non-matches in the training set.

des.mprop

Real number between 0 and 1. Desired proportion of matches to non-matches in the training set.

use.pred

Logical. Whether to apply match ratio to previous classification results instead of true matching status.

Value

A list of RecLinkData objects.

train

The sampled training data.

valid

All other record pairs

The sampled data are stored in the pairs attributes of train and valid. If present, the attributes prediction and Wdata are split and the corresponding values saved. All other attributes are copied to both data sets.

If the number of desired matches or non-matches is higher than the number actually present in the data, the maximum possible number is chosen and a warning issued.

Author(s)

Andreas Borg, Murat Sariyar

See Also

genSamples for generating training data based on unsupervised classification.

Examples

data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500, 
  blockfld=list(1,3,5,6,7))

# split into halves, do not enforce match ratio
l=splitData(pairs, prop=0.5)
summary(l$train)
summary(l$valid)

# split into 1/3 and 2/3, retain match ration
l=splitData(pairs, prop=1/3, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)

# generate a training set with 100 non-matches and 10 matches
l=splitData(pairs, num.non=100, des.mprop=0.1, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)


RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.