The politicaltweets
R package provides functions to preprocess and classify tweets data according to whether or not they are political based on a pre-trained ensemble classifier.
remotes::install_github("haukelicht/politicaltweets")
Note that all but one package dependencies are distributed via CRAN.
The one exeption is the laserize
package, which can be installed from GitHub.
To classify a tweet, five steps are required
rtweet
's lookup_statuses()
(with parse = TRUE
).x
of create_tweet_features()
to create data frame of tweet featuresx
of create_tweet_text_representations()
with .compute.pcs = FALSE
to create obtain tweet text embeding representations[^embedding]x
of classify_tweets()
[^embedding]: It obtains tweet text LASER embedding representations using the laserize
package and projects tweets LASER representations onto a pre-defined independent component space.
A minimal workin example:
library(dplyr)
library(politicaltweets)
# instead of querying data from the Tweet API (step 1)
# below we use a prototypical tweets data frame
glimpse(tweets.df.prototype)
# step 2
tfeats <- create_tweet_features(tweets.df.prototype, .as.data.table = FALSE)
# step 3
ttreps <- create_tweet_text_representations(tweets.df.prototype, .compute.pcs = FALSE)
# step 4
temp <- as_tibble(tfeats) %>%
left_join(mutate(as_tibble(ttreps$ics), status_id = rownames(ttreps$ics)))
# step 5
preds <- classify_tweets(temp, .debug = TRUE)
# inspect the result
cbind(temp[, c("text", "lang")], preds)
x
All functions exported by politicaltweets
expect that data passed to
their arguments x
conforms the naming and typing conventions of tweets data frames set by
the rtweet
package.
A prototypical tweets data frame is distributed with the politicaltweets
package,
see ?tweets.df.prototype
.
(Moreover, politicaltweets::required.tweets.df.cols
maps required columns to the accepted classes.)
classify_tweets()
with a pre-trained ensemble classifierclassify_tweets()
can handle two types of model input:
By default, classify_tweets()
uses a list of four pre-trained models (see ?constituent.modles
for details)
"blends" them into an ensemble classifier using blend.by = "PR-AUC"
(maximize the area under the precision-recall curve).
More generally, classify_tweets()
can handle two types of model inputs:
model
is a 'caretList' object (i.e., a list of pre-trained base learners). In this case, the base learners are first "blended" into a greedy ensemble classifier, and the resulting ensemble model is then used to classify samples in x
.model
is a 'caretEnsemble' object, this ensemble model is directly used to classify samples in x
.Thus you can train o
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.