Description Usage Arguments Value References Examples
View source: R/factory_data_load_and_split_r.R
Splits the dataset with Scikit-learn
and returns the train/test data and their row/position indices.
1 2 3 4 5 6 7 8 | factory_data_load_and_split_r(
filename,
target,
predictor,
test_size = 0.33,
reduce_criticality = FALSE,
theme = NULL
)
|
filename |
A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv"). |
target |
String. The name of the response variable. |
predictor |
String. The name of the predictor variable. |
test_size |
Numeric. Proportion of data that will form the test dataset. |
reduce_criticality |
Logical. For internal use by Nottinghamshire
Healthcare NHS Foundation Trust or other trusts that hold data on
criticality. If |
theme |
String. For internal use by Nottinghamshire Healthcare NHS
Foundation Trust or other trusts that use theme labels ("Access",
"Environment/ facilities" etc.). The column name of the theme variable.
Defaults to |
A list of length 6: x_train (data frame), x_test (data frame),
y_train (array), y_test (array), index_training_data
(integer vector), and index_test_data (integer vector). The row names
(names) of x_train and x_test (y_train and y_test) are
index_training_data and index_test_data respectively.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–2830
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | data_splits <- pxtextmineR::factory_data_load_and_split_r(
filename = pxtextmineR::text_data,
target = "label",
predictor = "feedback",
test_size = 0.33)
# Let's take a look at the returned list
str(data_splits)
# Each record in the split data is tagged with the row index of the original dataset
head(rownames(data_splits$x_train))
head(names(data_splits$y_train))
# Note that, in Python, indices start from 0 and go up to number_of_records - 1
all_indices <- data_splits$y_train %>%
names() %>%
c(names(data_splits$y_test)) %>%
as.numeric() %>%
sort()
head(all_indices) # Starts from zero
tail(all_indices) # Ends in nrow(text_data) - 1
length(all_indices) == nrow(text_data)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.