genSamples: Generate Training Set

genSamplesR Documentation

Generate Training Set

Description

Generates training data by unsupervised classification.

Usage

genSamples(dataset, num.non, des.mprop = 0.1)

Arguments

dataset

Object of class RecLinkData. Data pairs from which to sample.

num.non

Positive Integer. Number of desired non-links in the training set.

des.mprop

Real number in the range [0,1]. Ratio of number of links to number of non-links in the training set.

Details

The application of supervised classifiers (via classifySupv) requires a training set of record pairs with known matching status. Where no such data are available, genSamples can be used to generate training data. The matching status is determined by unsupervised clustering with bclust. Subsequently, the desired number of links and non-links are sampled.

If the requested numbers of matches or non-matches is not feasible, a warning is issued and the maximum possible number is considered.

Value

A list of "RecLinkResult" objects.

train

The sampled training data.

valid

All other record pairs

Record pairs are split into the respective pairs components. The prediction components represent the clustering result. If weights are present in dataset, the corresponding values of Wdata are stored to train and valid. All other components are copied from dataset.

Note

Unsupervised clustering may lead to a poor quality of classification, all subsequent results should be evaluated critically.

Author(s)

Andreas Borg, Murat Sariyar

See Also

splitData for splitting data sets without clustering.


RecordLinkage documentation built on Nov. 10, 2022, 5:42 p.m.