split_train_and_test: Split train and test data

Description Usage Arguments Value Examples

View source: R/split_train_and_test.R

Description

Splits a data matrix into two subsets, returning train and test subsets along with some basic summary statistics. Users may specify preferred sampling proportions, and may provide an integer vector of seed values which will be used in selecting the split that most closely approaches the desired proportions. By default, split_train_and_test splits into 50% TRAIN and 50% TEST.

Usage

1
2
3
4
5
6
7
split_train_and_test(
  matrix,
  obs = NROW(matrix),
  train.prop = 0.5,
  test.prop = 0.5,
  seeds = 1
)

Arguments

matrix

An N x M matrix, containing N rows (observations) and M columns (data features)

obs

An integer value specifying how many rows from the original data set should be preserved. (See size parameter to sample for details)

train.prop

A real number between 0 and 1, describing the preferred proportion of training data

test.prop

A real number between 0 and 1, describing the preferred proportion of test data

seeds

An integer vector of seed values, for use in randomly sampling observations into train and test subsets

Value

A list containing the following results:

counts

a table containing raw counts of train and test observations

contingency.table

a contingency table, describing the proportion of observations assigned to train and test sets

test

a matrix holding the test data subset, with the same number of columns as the input matrix, and one row per sampled observation

train

a matrix holding the train data subset, with the same number of columns as the input matrix, and one row per sampled observation

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# By default, split_train_and_test attempts to split into 50% TRAIN and 50% TEST, and iterates once.
default <- split_train_and_test(iris)
default$contingency.table

# Increasing the number of seeds increases the likelihood of an optimal split
many.seeds <- split_train_and_test(iris, seeds=1:113)
many.seeds$contingency.table

# Pass proportions if you prefer a different split
eighty.twenty <- split_train_and_test(iris, train.prop=0.8, test.prop=0.2)
eighty.twenty$contingency.table

# Use `obs` to subset your data while sampling
subset <- split_train_and_test(iris, obs=100)  ## produces ~50 TRAIN and ~50 TEST observations
subset$counts

ChrisKeefe/UnsupLP1 documentation built on Oct. 8, 2020, 5:37 a.m.