create_tweet_text_representations: Create tweet text representations
In haukelicht/politicaltweets: Classify political tweets

View source: R/create_tweet_text_representations.R

create_tweet_text_representations

R Documentation

Create tweet text representations

Description

Function obtains the LASER embedding of tweet texts, as well as their principal and independent component representations.

Usage

create_tweet_text_representations(
  x,
  .keep.embeddings = FALSE,
  .compute.pcs = TRUE,
  .compute.ics = TRUE,
  .req.columns.mapping = tibble::tribble(~colname, ~accepted_types, "status_id",
    c("character", "integer"), "text", c("character"), "lang", c("character")),
  ...
)

Arguments

`x`	a `data.frame` `data.table`, or `tibble` recording tweets. For required column (naming and typing conventions) refer to `?required.tweets.df.cols`. For an example see `?tweets.df.prototype`.
`.keep.embeddings`	logical. Tweet text LASER embedding representations in return object? Default is `FALSE` because for each tweet a 1024 element long double vector is returned which may consume substantial amounts of working memory.
`.compute.pcs`	logical. Project Tweet text LASER embedding onto pre-defined principal component space? Default is `TRUE`.
`.compute.ics`	logical. Project Tweet text LASER embedding onto pre-defined independent component space? Default is `TRUE`.
`.req.columns.mapping`	a two-column `data.frame` mapping column names to (character vectors specifying) expected column classes. The first column must be named `colname` and have type character. The second column must be a list-column of character vectors and named `accepted_types`. Default maps column name "status_id" to classes "character" or "integer", and "text" to "lang" to "character".
`...`	Additional arguments passed to `laserize`

Details

Tweet text LASER embedding representations are obtained using laserize (see https://github.com/haukelicht/laserize). Each text is represented by a numeric vector with 1024 elements (the dimensionality of the LASER embedding space).

To further reduce the dimensionality of tweet text representations, principal component analysis (PCA) and independent component analysis (ICA) can be applied by setting .compute.pcs and .compute.ics to TRUE, respectively.

These lower-representations are obtained by applying the rotation (PCA) and de-noising (ICA) matrixes obtained by applying PCA and ICA to the model training and validation data before training classifiers. (The resulting objects can be accessed via laser.embedding.prcomp and laser.embedding.icomp.)

Specifically, when training the constituent models of the ensemble classifier, instead of using the full 1024 elements long representation for texts as features, the dimensionality of the embedding space has been reduced to 300 using independent component analysis (ICA) via fastICA.

To project new tweet texts' LASER embeddings onto the independent components space used during model training, we first need to obtain LASER embedding representations of the new tweets' texts, and then use the projection matrixes obtained when applying the dimensionality reduction techniques to the original model training and validation data to obtain representations of new tweets in the independent component space.

ICA estimates W such that XKW = S, where here X is a n-times-1024 matrix of the n tweet text embedding re presentations, K is a 1024-times-300 pre-whitening matrix that projects X onto the first 300 principal components and is estimated from the original data, W is the 300-times-300 un-mixing matrix estimated from the original data, and S is the n-times-300 source matrix of independent components.

If e is the LASER embeddings matrix of tweet texts in x, we thus compute the dot product of e, laser.embedding.icomp$K, and laser.embedding.icomp$W to obtain ics, the independent component representation of e.

Value

A list of four elements:

embeddings: If .keep.embeddings = FALSE (the default), NULL. Otherwise, a matrix recording LASER embedding representations of tweet texts. Row names correspond to x$status_id (but see Note below). Column names refer to the 1024 embedding vectors and are named "e0001", ..., "e1024".
pcs: If .compute.pcs = TRUE (the default), a matrix recording principal component representation of tweet text LASER embeddings. Row names correspond to x$status_id (but see Note below). Column names refer to the 300 principal components and are named "pc1", ..., "pc300". Otherwise NULL.
ics: If .compute.ics = TRUE (the default), a matrix recording independent component representation of tweet text LASER embeddings. Row names correspond to x$status_id (but see Note below). Column names refer to the 300 independent components and are named "ic1", ..., "ic300". Otherwise NULL.
removed: integer vector, reporting the indexes of rows removed from x because is.na(x$text) is TRUE

Note

Note that, strictly speaking, matrices row names do not always correspond to x$status_id, as when is.na(x$text) is TRUE for some, these rows will not be contained in returned matrixes, and their status IDs will hence not be present among row names.

haukelicht/politicaltweets documentation built on July 3, 2023, 4:11 a.m.