factory_data_load_and_split_r: Split dataset into training and test sets

Description Usage Arguments Value References Examples

View source: R/factory_data_load_and_split_r.R

Description

Splits the dataset with Scikit-learn and returns the train/test data and their row/position indices.

Usage

1
2
3
4
5
6
7
8
factory_data_load_and_split_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  reduce_criticality = FALSE,
  theme = NULL
)

Arguments

filename

A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").

target

String. The name of the response variable.

predictor

String. The name of the predictor variable.

test_size

Numeric. Proportion of data that will form the test dataset.

reduce_criticality

Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If TRUE, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to FALSE.

theme

String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to NULL. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Value

A list of length 6: x_train (data frame), x_test (data frame), y_train (array), y_test (array), index_training_data (integer vector), and index_test_data (integer vector). The row names (names) of x_train and x_test (y_train and y_test) are index_training_data and index_test_data respectively.

References

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
data_splits <- pxtextmineR::factory_data_load_and_split_r(
  filename = pxtextmineR::text_data,
  target = "label",
  predictor = "feedback",
  test_size = 0.33)

# Let's take a look at the returned list
str(data_splits)

# Each record in the split data is tagged with the row index of the original dataset
head(rownames(data_splits$x_train))
head(names(data_splits$y_train))

# Note that, in Python, indices start from 0 and go up to number_of_records - 1
all_indices <- data_splits$y_train %>%
  names() %>%
  c(names(data_splits$y_test)) %>%
  as.numeric() %>%
  sort()
head(all_indices) # Starts from zero
tail(all_indices) # Ends in nrow(text_data) - 1
length(all_indices) == nrow(text_data)

nhs-r-community/pxtextmineR documentation built on Dec. 22, 2021, 2:10 a.m.