text_classification_pipeline_r: Fit and evaluate the pipeline

Description Usage Arguments Details Value References Examples

View source: R/text_classification_pipeline_r.R

Description

Split the data, build and fit the pipeline, produce performance metrics.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
text_classification_pipeline_r(
  filename,
  target,
  predictor,
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2,
  n_iter = 10,
  n_jobs = 1,
  verbose = 3,
  learners = c("SGDClassifier"),
  reduce_criticality = FALSE,
  theme = NULL
)

Arguments

filename

A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv").

target

String. The name of the response variable.

predictor

String. The name of the predictor variable.

test_size

Numeric. Proportion of data that will form the test dataset.

ordinal

Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities.

tknz

Tokenizer to use ("spacy" or "wordnet").

metric

String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score").

cv

Number of cross-validation folds.

n_iter

Number of parameter settings that are sampled (see sklearn.model_selection.RandomizedSearchCV).

n_jobs

Number of jobs to run in parallel (see sklearn.model_selection.RandomizedSearchCV). NOTE: If your machine does not have the number of cores specified in n_jobs, then an error will be returned.

verbose

Controls the verbosity (see sklearn.model_selection.RandomizedSearchCV).

learners

Vector. Scikit-learn names of the learners to tune. Must be one or more of "SGDClassifier", "RidgeClassifier", "Perceptron", "PassiveAggressiveClassifier", "BernoulliNB", "ComplementNB", "MultinomialNB", "KNeighborsClassifier", "NearestCentroid", "RandomForestClassifier". When a single model is used, it can be passed as a string.

reduce_criticality

Logical. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that hold data on criticality. If TRUE, then all records with a criticality of "-5" (respectively, "5") are assigned a criticality of "-4" (respectively, "4"). This is to avoid situations where the pipeline breaks due to a lack of sufficient data for "-5" and/or "5". Defaults to FALSE.

theme

String. For internal use by Nottinghamshire Healthcare NHS Foundation Trust or other trusts that use theme labels ("Access", "Environment/ facilities" etc.). The column name of the theme variable. Defaults to NULL. If supplied, the theme variable will be used as a predictor (along with the text predictor) in the model that is fitted with criticality as the response variable. The rationale is two-fold. First, to help the model improve predictions on criticality when the theme labels are readily available. Second, to force the criticality for "Couldn't be improved" to always be "3" in the training and test data, as well as in the predictions. This is the only criticality value that "Couldn't be improved" can take, so by forcing it to always be "3", we are improving model performance, but are also correcting possible erroneous assignments of values other than "3" that are attributed to human error.

Details

This function brings together the three functions that run chunks of the process independently, namely splitting data into training and test sets (factory_data_load_and_split_r), building and fitting the pipeline (factory_pipeline_r) on the whole dataset (train and test), and assessing pipeline performance (factory_model_performance_r).

For details on what the pipeline does/how it works, see factory_pipeline_r's Details section.

Value

A list of length 7:

References

Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145–156.

Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# We can prepare the data, build and fit the pipeline, and get performance
# metrics, in two ways. One way is to run the factory_* functions independently
# The commented out script right below would do exactly that.

# Prepare training and test sets
# data_splits <- pxtextmineR::factory_data_load_and_split_r(
#   filename = pxtextmineR::text_data,
#   target = "label",
#   predictor = "feedback",
#   test_size = 0.90) # Make a small training set for a faster run in this example
#
# # Fit the pipeline
# pipe <- pxtextmineR::factory_pipeline_r(
#   x = data_splits$x_train,
#   y = data_splits$y_train,
#   tknz = "spacy",
#   ordinal = FALSE,
#   metric = "class_balance_accuracy_score",
#   cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
#   learners = c("SGDClassifier", "MultinomialNB")
# )
# (SGDClassifier represents both logistic regression and linear SVM. This
# depends on the value of the "loss" hyperparameter, which can be "log" or
# "hinge". This is set internally in factory_pipeline_r).
#
# # Assess model performance
# pipe_performance <- pxtextmineR::factory_model_performance_r(
#   pipe = pipe,
#   x_train = data_splits$x_train,
#   y_train = data_splits$y_train,
#   x_test = data_splits$x_test,
#   y_test = data_splits$y_test,
#   metric = "accuracy_score")

# Alternatively, we can use text_classification_pipeline_r() to do everything in
# one go.
text_pipe <- pxtextmineR::text_classification_pipeline_r(
  filename = pxtextmineR::text_data,
  target = 'label',
  predictor = 'feedback',
  test_size = 0.33,
  ordinal = FALSE,
  tknz = "spacy",
  metric = "class_balance_accuracy_score",
  cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
  learners = c("SGDClassifier", "MultinomialNB"),
  reduce_criticality = FALSE,
  theme = NULL
)

names(text_pipe)

# Let's compare pipeline performance for different tunings with a range of
# metrics averaging the cross-validation metrics for each fold.
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("mean_test"))

# A glance at the (hyper)parameters and their tuned values
text_pipe$
  tuning_results %>%
  dplyr::select(learner, dplyr::contains("param_")) %>%
  str()

# Learner performance barplot
text_pipe$p_compare_models_bar

# Predictions on test set
preds <- text_pipe$pred
head(preds)

# We can also get the row indices of the train and test data. Note that, in
# Python, indices start from 0. For example, the row indices of a data frame
# with 5 rows would be 0, 1, 2, 3 & 4.
head(sort(text_pipe$index_training_data))
head(sort(text_pipe$index_test_data))

# Let's subset the original data set
text_dataset <- pxtextmineR::text_data
rownames(text_dataset) <- 0:(nrow(text_dataset) - 1)
data_train <- text_dataset[text_pipe$index_training_data, ]
data_test <- text_dataset[text_pipe$index_test_data, ]
str(data_train)
str(data_test)

nhs-r-community/pxtextmineR documentation built on Dec. 22, 2021, 2:10 a.m.