factory_data_load_and_split_r: Split dataset into training and test sets
In nhs-r-community/pxtextmineR: An R Wrapper for Python's "pxtextmining" library

Description Usage Arguments Value References Examples

View source: R/factory_data_load_and_split_r.R

Splits the dataset with Scikit-learn and returns the train/test data and their row/position indices.

factory_data_load_and_split_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  reduce_criticality = FALSE,
  theme = NULL
)

`filename`	A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").
`target`	String. The name of the response variable.
`predictor`	String. The name of the predictor variable.
`test_size`	Numeric. Proportion of data that will form the test dataset.
`reduce_criticality`	Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If `TRUE`, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to `FALSE`.
`theme`	String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to `NULL`. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

A list of length 6: x_train (data frame), x_test (data frame), y_train (array), y_test (array), index_training_data (integer vector), and index_test_data (integer vector). The row names (names) of x_train and x_test (y_train and y_test) are index_training_data and index_test_data respectively.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830

data_splits <- pxtextmineR::factory_data_load_and_split_r(
  filename = pxtextmineR::text_data,
  target = "label",
  predictor = "feedback",
  test_size = 0.33)

# Let's take a look at the returned list
str(data_splits)

# Each record in the split data is tagged with the row index of the original dataset
head(rownames(data_splits$x_train))
head(names(data_splits$y_train))

# Note that, in Python, indices start from 0 and go up to number_of_records - 1
all_indices <- data_splits$y_train %>%
  names() %>%
  c(names(data_splits$y_test)) %>%
  as.numeric() %>%
  sort()
head(all_indices) # Starts from zero
tail(all_indices) # Ends in nrow(text_data) - 1
length(all_indices) == nrow(text_data)