splitData: Split Data
In RecordLinkage: Record Linkage Functions for Linking and Deduplicating Data Sets

splitData

R Documentation

Split Data

Description

Splits a data set into two sets with desired proportions.

Usage

splitData(dataset, prop, keep.mprop = FALSE, num.non = 0, des.mprop = 0, 
use.pred = FALSE)

Arguments

`dataset`	Object of class `RecLinkData`. Data pairs to split.
`prop`	Real number between 0 and 1. Proportion of data pairs to form the training set.
`keep.mprop`	Logical. Whether the ratio of matches should be retained.
`num.non`	Positive Integer. Desired number on non-matches in the training set.
`des.mprop`	Real number between 0 and 1. Desired proportion of matches to non-matches in the training set.
`use.pred`	Logical. Whether to apply match ratio to previous classification results instead of true matching status.

Value

A list of RecLinkData objects.

`train`	The sampled training data.
`valid`	All other record pairs

The sampled data are stored in the pairs attributes of train and valid. If present, the attributes prediction and Wdata are split and the corresponding values saved. All other attributes are copied to both data sets.

If the number of desired matches or non-matches is higher than the number actually present in the data, the maximum possible number is chosen and a warning issued.

Author(s)

Andreas Borg, Murat Sariyar

Examples

data(RLdata500)
pairs=compare.dedup(RLdata500, identity=identity.RLdata500, 
  blockfld=list(1,3,5,6,7))

# split into halves, do not enforce match ratio
l=splitData(pairs, prop=0.5)
summary(l$train)
summary(l$valid)

# split into 1/3 and 2/3, retain match ration
l=splitData(pairs, prop=1/3, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)

# generate a training set with 100 non-matches and 10 matches
l=splitData(pairs, num.non=100, des.mprop=0.1, keep.mprop=TRUE)
summary(l$train)
summary(l$valid)

RecordLinkage documentation built on Aug. 8, 2025, 6:05 p.m.