create_tweet_text_representations: Create tweet text representations

View source: R/create_tweet_text_representations.R

create_tweet_text_representationsR Documentation

Create tweet text representations

Description

Function obtains the LASER embedding of tweet texts, as well as their principal and independent component representations.

Usage

create_tweet_text_representations(
  x,
  .keep.embeddings = FALSE,
  .compute.pcs = TRUE,
  .compute.ics = TRUE,
  .req.columns.mapping = tibble::tribble(~colname, ~accepted_types, "status_id",
    c("character", "integer"), "text", c("character"), "lang", c("character")),
  ...
)

Arguments

x

a data.frame data.table, or tibble recording tweets. For required column (naming and typing conventions) refer to ?required.tweets.df.cols. For an example see ?tweets.df.prototype.

.keep.embeddings

logical. Tweet text LASER embedding representations in return object? Default is FALSE because for each tweet a 1024 element long double vector is returned which may consume substantial amounts of working memory.

.compute.pcs

logical. Project Tweet text LASER embedding onto pre-defined principal component space? Default is TRUE.

.compute.ics

logical. Project Tweet text LASER embedding onto pre-defined independent component space? Default is TRUE.

.req.columns.mapping

a two-column data.frame mapping column names to (character vectors specifying) expected column classes. The first column must be named colname and have type character. The second column must be a list-column of character vectors and named accepted_types. Default maps column name "status_id" to classes "character" or "integer", and "text" to "lang" to "character".

...

Additional arguments passed to laserize

Details

Tweet text LASER embedding representations are obtained using laserize (see https://github.com/haukelicht/laserize). Each text is represented by a numeric vector with 1024 elements (the dimensionality of the LASER embedding space).

To further reduce the dimensionality of tweet text representations, principal component analysis (PCA) and independent component analysis (ICA) can be applied by setting .compute.pcs and .compute.ics to TRUE, respectively.

These lower-representations are obtained by applying the rotation (PCA) and de-noising (ICA) matrixes obtained by applying PCA and ICA to the model training and validation data before training classifiers. (The resulting objects can be accessed via laser.embedding.prcomp and laser.embedding.icomp.)

Specifically, when training the constituent models of the ensemble classifier, instead of using the full 1024 elements long representation for texts as features, the dimensionality of the embedding space has been reduced to 300 using independent component analysis (ICA) via fastICA.

To project new tweet texts' LASER embeddings onto the independent components space used during model training, we first need to obtain LASER embedding representations of the new tweets' texts, and then use the projection matrixes obtained when applying the dimensionality reduction techniques to the original model training and validation data to obtain representations of new tweets in the independent component space.

ICA estimates W such that XKW = S, where here X is a n-times-1024 matrix of the n tweet text embedding re presentations, K is a 1024-times-300 pre-whitening matrix that projects X onto the first 300 principal components and is estimated from the original data, W is the 300-times-300 un-mixing matrix estimated from the original data, and S is the n-times-300 source matrix of independent components.

If e is the LASER embeddings matrix of tweet texts in x, we thus compute the dot product of e, laser.embedding.icomp$K, and laser.embedding.icomp$W to obtain ics, the independent component representation of e.

Value

A list of four elements:

embeddings

If .keep.embeddings = FALSE (the default), NULL. Otherwise, a matrix recording LASER embedding representations of tweet texts. Row names correspond to x$status_id (but see Note below). Column names refer to the 1024 embedding vectors and are named "e0001", ..., "e1024".

pcs

If .compute.pcs = TRUE (the default), a matrix recording principal component representation of tweet text LASER embeddings. Row names correspond to x$status_id (but see Note below). Column names refer to the 300 principal components and are named "pc1", ..., "pc300". Otherwise NULL.

ics

If .compute.ics = TRUE (the default), a matrix recording independent component representation of tweet text LASER embeddings. Row names correspond to x$status_id (but see Note below). Column names refer to the 300 independent components and are named "ic1", ..., "ic300". Otherwise NULL.

removed

integer vector, reporting the indexes of rows removed from x because is.na(x$text) is TRUE

Note

Note that, strictly speaking, matrices row names do not always correspond to x$status_id, as when is.na(x$text) is TRUE for some, these rows will not be contained in returned matrixes, and their status IDs will hence not be present among row names.


haukelicht/politicaltweets documentation built on July 3, 2023, 4:11 a.m.