classify_tweets: Classify tweets into political and non-political
In haukelicht/politicaltweets: Classify political tweets

classify_tweets

R Documentation

Classify tweets into political and non-political

Description

Function takes a data frame of tweet features as input, and obtains a prediction for each sample using an ensemble classifier.

Usage

classify_tweets(
  x,
  model = ensemble.model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## Default S3 method:
classify_tweets(
  x,
  model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## S3 method for class 'caretEnsemble'
classify_tweets(
  x,
  model = ensemble.model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## S3 method for class 'caretList'
classify_tweets(
  x,
  model = constituent.models,
  na.rm = TRUE,
  threshold = 0.5,
  blend.by = "PR-AUC",
  .train.ctrl = trainControl(method = "repeatedcv", number = 10, repeats = 10, search =
    "grid", returnData = FALSE, returnResamp = "none", savePredictions = "none",
    classProbs = TRUE, summaryFunction = superSumFun, allowParallel = TRUE),
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE,
  .cache.model = FALSE,
  .cache.path = getOption("politicaltweets.cache.path")
)

Arguments

`x`	a data frame object of tweet features/predictor variables
`model`	Either a 'caretEnsemble' object (created with `caretEnsemble`), or a 'caretList' object, (i.e., a list of `train` objects obtained with `caretList`). Defaults to the 'caretEnsemble' object `ensemble.model`. See section "Using `classify_tweets` when `model` is a 'caretList' object" for details.
`na.rm`	logical. List-wise remove rows with missings? If `TRUE` (default), rows with any missing values (`NA`, `NaN`, or `Inf`) on predictor variables (see `?get_model_predvars`) are dropped. Information on removed rows is recorded in attributes "removed.rows" (indexes) and "removed.rows.nas" (list of lists recording predictor variable/feature names with missing values).
`threshold`	a unit-length double vector in (0, 1), specifying the (predicted) probability threshold used to classify samples as positive (i.e., "political") instances.
`...`	Additional arguments passed to specific method and `predict`.
`.predict.type`	a unit-length character string, either "prob" (obtain predicted probabilities, the default) or or "raw" (obtain predicted classes)
`.add`	logical: Column-bind (add) predictions to `x` before returning? Default is `FALSE`.
`.verbose`	logical. Print messages to console informing about what the function is doing.
`.debug`	logical. Defaults to `FALSE`. If `TRUE` a message will be printed to the console if predicting new samples classes fails that informs about the source of the error.
`blend.by`	a unit-length character string determining the evaluation metric based on which constituent models (base learners) should be blended into the ensemble classifier (see section "Ensemble classifier")
`.train.ctrl`	a list object created by calling `trainControl`. Make sure that the summary function you specify when setting up training controls (argument `summaryFunction` of `trainControl`) returns the evaluation metric used to blend models, i.e., `blend.by`.
`.cache.model`	logical. Cache ensemble classifiers obtained from `model` using `blend.by`. Default is `FALSE`.
`.cache.path`	unit-length character, specifying where to write cached ensemble classifiers if `.cache.model = TRUE`. Default is "cache" in package directory in file system (see `getOption("politicaltweets.cache.path")`).

Details

classify_tweets can handle two types of model input:

Lists of pre-trained base learner models: This is the default behavior if the input to argument model is a 'caretList' object (i.e., a list of pre-trained base learners). In this case, the base learners are first "blended" into a greedy ensemble classifier, and the resulting ensemble model is then used to classify samples in x.
Pre-trained ensemble classifiers: If the input to argument model is a 'caretEnsemble' object, this ensemble model is directly used to classify samples in x.

Value

A data frame of predictions. Check attribute "removed.rows" for indexes of removed rows and "removed.rows.nas" for corresponding missing value information if na.rm = TRUE.

Methods (by class)

default: Default method (when model is neither a 'caretList' or 'caretEnsemble' object)
caretEnsemble: Method when model is a 'caretEnsemble' object (i.e., a pre-trained ensemble model)
caretList: Method when model is a 'caretList' object (i.e., a list of pre-trained base learner models)

Using `classify_tweets` when `model` is a 'caretList' object

By default, four constituent models are used to create the ensemble classifier (see ?constituent.models):

glmnet: a generalized linear model (GLM) with Elastic-Net regularization (glmnet)
svmRadial: a Support Vector Machine (SVM) with a radial kernel (ksvm with kernel = "rbfdot")
ranger: a Random Forest (ranger)
xgbTree: an eXtreme Gradient Boosting (XGBoost) machine (xgboost with learner = "tree")

The ensemble classifier is obtain by "blending" constituent models using a generalized linear model (GLM) This is done by a call to the caretEnsemble function (see vignette("caretEnsemble-intro", package = "caretEnsemble")).

The blend.by determines which evaluation metric is used to "blend" constituent models. It is passed to the metric argument when calling caretEnsemble, which, in turn, forwards metric to train when training the GLM with method = "glm".

Classifying samples in `x`

To classify samples in x, the ensemble model is passed to the object argument when calling caretEnsemble's predict method.

By default (.predict.type = "prob"), predicted probabilities for the "yes" (political) class are obtained, and a classification into "yes" and "no" is induced based on the threshold (default is .5). That is, all samples with a predicted probability ≥ threshold are classified as "yes" instances.

Alternatively, you can directly obtain an assignment into classes by setting .predict.type = "raw". CAUTION: In the latter case, threshold will have no effect, and the default threshold of .5 is always used.

haukelicht/politicaltweets documentation built on July 3, 2023, 4:11 a.m.