classify_tweets: Classify tweets into political and non-political

View source: R/classify_tweets.R

classify_tweetsR Documentation

Classify tweets into political and non-political

Description

Function takes a data frame of tweet features as input, and obtains a prediction for each sample using an ensemble classifier.

Usage

classify_tweets(
  x,
  model = ensemble.model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## Default S3 method:
classify_tweets(
  x,
  model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## S3 method for class 'caretEnsemble'
classify_tweets(
  x,
  model = ensemble.model,
  na.rm = TRUE,
  threshold = 0.5,
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE
)

## S3 method for class 'caretList'
classify_tweets(
  x,
  model = constituent.models,
  na.rm = TRUE,
  threshold = 0.5,
  blend.by = "PR-AUC",
  .train.ctrl = trainControl(method = "repeatedcv", number = 10, repeats = 10, search =
    "grid", returnData = FALSE, returnResamp = "none", savePredictions = "none",
    classProbs = TRUE, summaryFunction = superSumFun, allowParallel = TRUE),
  ...,
  .predict.type = "prob",
  .add = FALSE,
  .verbose = TRUE,
  .debug = FALSE,
  .cache.model = FALSE,
  .cache.path = getOption("politicaltweets.cache.path")
)

Arguments

x

a data frame object of tweet features/predictor variables

model

Either

  • a 'caretEnsemble' object (created with caretEnsemble), or

  • a 'caretList' object, (i.e., a list of train objects obtained with caretList).

Defaults to the 'caretEnsemble' object ensemble.model. See section "Using classify_tweets when model is a 'caretList' object" for details.

na.rm

logical. List-wise remove rows with missings? If TRUE (default), rows with any missing values (NA, NaN, or Inf) on predictor variables (see ?get_model_predvars) are dropped. Information on removed rows is recorded in attributes "removed.rows" (indexes) and "removed.rows.nas" (list of lists recording predictor variable/feature names with missing values).

threshold

a unit-length double vector in (0, 1), specifying the (predicted) probability threshold used to classify samples as positive (i.e., "political") instances.

...

Additional arguments passed to specific method and predict.

.predict.type

a unit-length character string, either "prob" (obtain predicted probabilities, the default) or or "raw" (obtain predicted classes)

.add

logical: Column-bind (add) predictions to x before returning? Default is FALSE.

.verbose

logical. Print messages to console informing about what the function is doing.

.debug

logical. Defaults to FALSE. If TRUE a message will be printed to the console if predicting new samples classes fails that informs about the source of the error.

blend.by

a unit-length character string determining the evaluation metric based on which constituent models (base learners) should be blended into the ensemble classifier (see section "Ensemble classifier")

.train.ctrl

a list object created by calling trainControl. Make sure that the summary function you specify when setting up training controls (argument summaryFunction of trainControl) returns the evaluation metric used to blend models, i.e., blend.by.

.cache.model

logical. Cache ensemble classifiers obtained from model using blend.by. Default is FALSE.

.cache.path

unit-length character, specifying where to write cached ensemble classifiers if .cache.model = TRUE. Default is "cache" in package directory in file system (see getOption("politicaltweets.cache.path")).

Details

classify_tweets can handle two types of model input:

  1. Lists of pre-trained base learner models: This is the default behavior if the input to argument model is a 'caretList' object (i.e., a list of pre-trained base learners). In this case, the base learners are first "blended" into a greedy ensemble classifier, and the resulting ensemble model is then used to classify samples in x.

  2. Pre-trained ensemble classifiers: If the input to argument model is a 'caretEnsemble' object, this ensemble model is directly used to classify samples in x.

Value

A data frame of predictions. Check attribute "removed.rows" for indexes of removed rows and "removed.rows.nas" for corresponding missing value information if na.rm = TRUE.

Methods (by class)

  • default: Default method (when model is neither a 'caretList' or 'caretEnsemble' object)

  • caretEnsemble: Method when model is a 'caretEnsemble' object (i.e., a pre-trained ensemble model)

  • caretList: Method when model is a 'caretList' object (i.e., a list of pre-trained base learner models)

Using classify_tweets when model is a 'caretList' object

By default, four constituent models are used to create the ensemble classifier (see ?constituent.models):

  • glmnet: a generalized linear model (GLM) with Elastic-Net regularization (glmnet)

  • svmRadial: a Support Vector Machine (SVM) with a radial kernel (ksvm with kernel = "rbfdot")

  • ranger: a Random Forest (ranger)

  • xgbTree: an eXtreme Gradient Boosting (XGBoost) machine (xgboost with learner = "tree")

The ensemble classifier is obtain by "blending" constituent models using a generalized linear model (GLM) This is done by a call to the caretEnsemble function (see vignette("caretEnsemble-intro", package = "caretEnsemble")).

The blend.by determines which evaluation metric is used to "blend" constituent models. It is passed to the metric argument when calling caretEnsemble, which, in turn, forwards metric to train when training the GLM with method = "glm".

Classifying samples in x

To classify samples in x, the ensemble model is passed to the object argument when calling caretEnsemble's predict method.

By default (.predict.type = "prob"), predicted probabilities for the "yes" (political) class are obtained, and a classification into "yes" and "no" is induced based on the threshold (default is .5). That is, all samples with a predicted probability ≥ threshold are classified as "yes" instances.

Alternatively, you can directly obtain an assignment into classes by setting .predict.type = "raw". CAUTION: In the latter case, threshold will have no effect, and the default threshold of .5 is always used.


haukelicht/politicaltweets documentation built on July 3, 2023, 4:11 a.m.