text_classification_pipeline_r: Fit and evaluate the pipeline
In nhs-r-community/pxtextmineR: An R Wrapper for Python's "pxtextmining" library

Description Usage Arguments Details Value References Examples

View source: R/text_classification_pipeline_r.R

Split the data, build and fit the pipeline, produce performance metrics.

text_classification_pipeline_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2,
  n_iter = 10,
  n_jobs = 1,
  verbose = 3,
  learners = c("SGDClassifier"),
  reduce_criticality = FALSE,
  theme = NULL
)

`filename`	A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").
`target`	String. The name of the response variable.
`predictor`	String. The name of the predictor variable.
`test_size`	Numeric. Proportion of data that will form the test dataset.
`ordinal`	Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities.
`tknz`	Tokenizer to use ("spacy" or "wordnet").
`metric`	String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score").
`cv`	Number of cross-validation folds.
`n_iter`	Number of parameter settings that are sampled (see `sklearn.model_selection.RandomizedSearchCV`).
`n_jobs`	Number of jobs to run in parallel (see `sklearn.model_selection.RandomizedSearchCV`). NOTE: If your machine does not have the number of cores specified in `n_jobs`, then an error will be returned.
`verbose`	Controls the verbosity (see `sklearn.model_selection.RandomizedSearchCV`).
`learners`	Vector. `Scikit-learn` names of the learners to tune. Must be one or more of "SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "KNeighborsClassifier", "NearestCentroid", "RandomForestClassifier". When a single model is used, it can be passed as a string.
`reduce_criticality`	Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If `TRUE`, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to `FALSE`.
`theme`	String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to `NULL`. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

This function brings together the three functions that run chunks of the process independently, namely splitting data into training and test sets (factory_data_load_and_split_r), building and fitting the pipeline (factory_pipeline_r) on the whole dataset (train and test), and assessing pipeline performance (factory_model_performance_r).

For details on what the pipeline does/how it works, see factory_pipeline_r's Details section.

A list of length 7:

A fitted Scikit-learn pipeline containing a number of objects that can be accessed with the $ sign (see examples). For a partial list see "Atributes" in sklearn.model_selection.RandomizedSearchCV. Do not be surprised if more objects are in the pipeline than those in the aforementioned "Attributes" list. Python objects can contain several objects, from numeric results (e.g. the pipeline's accuracy), to methods (i.e. functions in the R lingo) and classes. In Python, these are normally accessed with object.<whatever>, but in R the command is object$<whatever>. For instance, one can access method predict() to make to make predictions on unseen data. See Examples.
tuning_results Data frame. All (hyper)parameter values and models tried during fitting.
pred Vector. The predictions on the test set.
accuracy_per_class Data frame. Accuracies per class.
p_compare_models_bar A bar plot comparing the mean scores (of the user-supplied metric parameter) from the cross-validation on the training set, for the best (hyper)parameter values for each learner.
index_training_data The row names/indices of the training data. Note that, in Python, indices start from 0 and go up to number_of_records - 1. See Examples.
index_test_data The row names/indices of the test data. Note that, in Python, indices start from 0 and go up to number_of_records - 1. See Examples.

Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145–156.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.

# We can prepare the data, build and fit the pipeline, and get performance
# metrics, in two ways. One way is to run the factory_* functions independently
# The commented out script right below would do exactly that.

# Prepare training and test sets
# data_splits <- pxtextmineR::factory_data_load_and_split_r(
#   filename = pxtextmineR::text_data,
#   target = "label",
#   predictor = "feedback",
#   test_size = 0.90) # Make a small training set for a faster run in this example
#
# # Fit the pipeline
# pipe <- pxtextmineR::factory_pipeline_r(
#   x = data_splits$x_train,
#   y = data_splits$y_train,
#   tknz = "spacy",
#   ordinal = FALSE,
#   metric = "class_balance_accuracy_score",
#   cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
#   learners = c("SGDClassifier", "MultinomialNB")
# )
# (SGDClassifier represents both logistic regression and linear SVM. This
# depends on the value of the "loss" hyperparameter, which can be "log" or
# "hinge". This is set internally in factory_pipeline_r).
#
# # Assess model performance
# pipe_performance <- pxtextmineR::factory_model_performance_r(
#   pipe = pipe,
#   x_train = data_splits$x_train,
#   y_train = data_splits$y_train,
#   x_test = data_splits$x_test,
#   y_test = data_splits$y_test,
#   metric = "accuracy_score")

# Alternatively, we can use text_classification_pipeline_r() to do everything in
# one go.
text_pipe <- pxtextmineR::text_classification_pipeline_r(
  filename = pxtextmineR::text_data,
  target = 'label',
  predictor = 'feedback',
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
  learners = c("SGDClassifier", "MultinomialNB"),
  reduce_criticality = FALSE,
  theme = NULL
)

names(text_pipe)

# Let's compare pipeline performance for different tunings with a range of
# metrics averaging the cross-validation metrics for each fold.
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("mean_test"))

# A glance at the (hyper)parameters and their tuned values
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("param_")) %>%
  str()

# Learner performance barplot
text_pipe$p_compare_models_bar

# Predictions on test set
preds <- text_pipe$pred
head(preds)

# We can also get the row indices of the train and test data. Note that, in
# Python, indices start from 0. For example, the row indices of a data frame
# with 5 rows would be 0, 1, 2, 3 & 4.
head(sort(text_pipe$index_training_data))
head(sort(text_pipe$index_test_data))

# Let's subset the original data set
text_dataset <- pxtextmineR::text_data
rownames(text_dataset) <- 0:(nrow(text_dataset) - 1)
data_train <- text_dataset[text_pipe$index_training_data, ]
data_test <- text_dataset[text_pipe$index_test_data, ]
str(data_train)
str(data_test)