Description Usage Arguments Details Value References Examples
View source: R/text_classification_pipeline_r.R
Split the data, build and fit the pipeline, produce performance metrics.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | text_classification_pipeline_r(
filename,
target,
predictor,
test_size = 0.33,
ordinal = FALSE,
tknz = "spacy",
metric = "class_balance_accuracy_score",
cv = 2,
n_iter = 10,
n_jobs = 1,
verbose = 3,
learners = c("SGDClassifier"),
reduce_criticality = FALSE,
theme = NULL
)
|
filename |
A data frame with the data (class and text columns), otherwise the dataset name (CSV), including full path to the data folder (if not in the project's working directory), and the data type suffix (".csv"). |
target |
String. The name of the response variable. |
predictor |
String. The name of the predictor variable. |
test_size |
Numeric. Proportion of data that will form the test dataset. |
ordinal |
Whether to fit an ordinal classification model. The ordinal model is the implementation of Frank and Hall (2001) that can use any standard classification model that calculates probabilities. |
tknz |
Tokenizer to use ("spacy" or "wordnet"). |
metric |
String. Scorer to use during pipeline tuning ("accuracy_score", "balanced_accuracy_score", "matthews_corrcoef", "class_balance_accuracy_score"). |
cv |
Number of cross-validation folds. |
n_iter |
Number of parameter settings that are sampled (see
|
n_jobs |
Number of jobs to run in parallel (see |
verbose |
Controls the verbosity (see |
learners |
Vector. |
reduce_criticality |
Logical. For internal use by Nottinghamshire
Healthcare NHS Foundation Trust or other trusts that hold data on
criticality. If |
theme |
String. For internal use by Nottinghamshire Healthcare NHS
Foundation Trust or other trusts that use theme labels ("Access",
"Environment/ facilities" etc.). The column name of the theme variable.
Defaults to |
This function brings together the three functions that run chunks of
the process independently, namely splitting data into training and test
sets (factory_data_load_and_split_r), building and fitting
the pipeline (factory_pipeline_r) on the whole dataset
(train and test), and assessing pipeline performance
(factory_model_performance_r).
For details on what the pipeline does/how it works, see
factory_pipeline_r's
Details
section.
A list of length 7:
A fitted Scikit-learn pipeline containing a number of objects
that can be accessed with the $ sign (see examples). For a
partial list see "Atributes" in
sklearn.model_selection.RandomizedSearchCV.
Do not be surprised if more objects are in the pipeline than
those in the aforementioned "Attributes" list. Python objects can
contain several objects, from numeric results (e.g. the
pipeline's accuracy), to methods (i.e. functions in the R
lingo) and classes. In Python, these are normally accessed with
object.<whatever>, but in R the command is object$<whatever>.
For instance, one can access method predict() to make to make
predictions on unseen data. See Examples.
tuning_results Data frame. All (hyper)parameter values
and models tried during fitting.
pred Vector. The predictions on the test set.
accuracy_per_class Data frame. Accuracies per class.
p_compare_models_bar A bar plot comparing the mean scores (of
the user-supplied metric parameter) from the cross-validation
on the training set, for the best (hyper)parameter values for
each learner.
index_training_data The row names/indices of the training
data. Note that, in Python, indices start from 0 and go up to
number_of_records - 1. See Examples.
index_test_data The row names/indices of the test data. Note
that, in Python, indices start from 0 and go up to
number_of_records - 1. See Examples.
Frank E. & Hall M. (2001). A Simple Approach to Ordinal Classification. Machine Learning: ECML 2001 145–156.
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. (2011), Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12:2825–-2830.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | # We can prepare the data, build and fit the pipeline, and get performance
# metrics, in two ways. One way is to run the factory_* functions independently
# The commented out script right below would do exactly that.
# Prepare training and test sets
# data_splits <- pxtextmineR::factory_data_load_and_split_r(
# filename = pxtextmineR::text_data,
# target = "label",
# predictor = "feedback",
# test_size = 0.90) # Make a small training set for a faster run in this example
#
# # Fit the pipeline
# pipe <- pxtextmineR::factory_pipeline_r(
# x = data_splits$x_train,
# y = data_splits$y_train,
# tknz = "spacy",
# ordinal = FALSE,
# metric = "class_balance_accuracy_score",
# cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
# learners = c("SGDClassifier", "MultinomialNB")
# )
# (SGDClassifier represents both logistic regression and linear SVM. This
# depends on the value of the "loss" hyperparameter, which can be "log" or
# "hinge". This is set internally in factory_pipeline_r).
#
# # Assess model performance
# pipe_performance <- pxtextmineR::factory_model_performance_r(
# pipe = pipe,
# x_train = data_splits$x_train,
# y_train = data_splits$y_train,
# x_test = data_splits$x_test,
# y_test = data_splits$y_test,
# metric = "accuracy_score")
# Alternatively, we can use text_classification_pipeline_r() to do everything in
# one go.
text_pipe <- pxtextmineR::text_classification_pipeline_r(
filename = pxtextmineR::text_data,
target = 'label',
predictor = 'feedback',
test_size = 0.33,
ordinal = FALSE,
tknz = "spacy",
metric = "class_balance_accuracy_score",
cv = 2, n_iter = 10, n_jobs = 1, verbose = 3,
learners = c("SGDClassifier", "MultinomialNB"),
reduce_criticality = FALSE,
theme = NULL
)
names(text_pipe)
# Let's compare pipeline performance for different tunings with a range of
# metrics averaging the cross-validation metrics for each fold.
text_pipe$
tuning_results %>%
dplyr::select(learner, dplyr::contains("mean_test"))
# A glance at the (hyper)parameters and their tuned values
text_pipe$
tuning_results %>%
dplyr::select(learner, dplyr::contains("param_")) %>%
str()
# Learner performance barplot
text_pipe$p_compare_models_bar
# Predictions on test set
preds <- text_pipe$pred
head(preds)
# We can also get the row indices of the train and test data. Note that, in
# Python, indices start from 0. For example, the row indices of a data frame
# with 5 rows would be 0, 1, 2, 3 & 4.
head(sort(text_pipe$index_training_data))
head(sort(text_pipe$index_test_data))
# Let's subset the original data set
text_dataset <- pxtextmineR::text_data
rownames(text_dataset) <- 0:(nrow(text_dataset) - 1)
data_train <- text_dataset[text_pipe$index_training_data, ]
data_test <- text_dataset[text_pipe$index_test_data, ]
str(data_train)
str(data_test)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.