View source: R/create_tweet_text_representations.R
create_tweet_text_representations | R Documentation |
Function obtains the LASER embedding of tweet texts, as well as their principal and independent component representations.
create_tweet_text_representations(
x,
.keep.embeddings = FALSE,
.compute.pcs = TRUE,
.compute.ics = TRUE,
.req.columns.mapping = tibble::tribble(~colname, ~accepted_types, "status_id",
c("character", "integer"), "text", c("character"), "lang", c("character")),
...
)
x |
a |
.keep.embeddings |
logical. Tweet text LASER embedding representations
in return object?
Default is |
.compute.pcs |
logical. Project Tweet text LASER embedding onto pre-defined principal component space?
Default is |
.compute.ics |
logical. Project Tweet text LASER embedding onto pre-defined independent component space?
Default is |
.req.columns.mapping |
a two-column |
... |
Additional arguments passed to |
Tweet text LASER embedding representations are obtained using laserize
(see https://github.com/haukelicht/laserize).
Each text is represented by a numeric vector with 1024 elements
(the dimensionality of the LASER embedding space).
To further reduce the dimensionality of tweet text representations,
principal component analysis (PCA) and independent component analysis (ICA)
can be applied by setting .compute.pcs
and .compute.ics
to TRUE
, respectively.
These lower-representations are obtained by applying
the rotation (PCA) and de-noising (ICA) matrixes obtained by applying
PCA and ICA to the model training and validation data before training classifiers.
(The resulting objects can be accessed via laser.embedding.prcomp
and laser.embedding.icomp
.)
Specifically, when training the constituent models of the ensemble classifier, instead of
using the full 1024 elements long representation for texts as features,
the dimensionality of the embedding space has been reduced to 300
using independent component analysis (ICA) via fastICA
.
To project new tweet texts' LASER embeddings onto the independent components space used during model training, we first need to obtain LASER embedding representations of the new tweets' texts, and then use the projection matrixes obtained when applying the dimensionality reduction techniques to the original model training and validation data to obtain representations of new tweets in the independent component space.
ICA estimates W such that XKW = S, where here X is a n-times-1024 matrix of the n tweet text embedding re presentations, K is a 1024-times-300 pre-whitening matrix that projects X onto the first 300 principal components and is estimated from the original data, W is the 300-times-300 un-mixing matrix estimated from the original data, and S is the n-times-300 source matrix of independent components.
If e
is the LASER embeddings matrix of tweet texts in x
,
we thus compute the dot product of
e
, laser.embedding.icomp$K
, and laser.embedding.icomp$W
to obtain ics
, the independent component representation of e
.
A list of four elements:
If .keep.embeddings = FALSE
(the default), NULL
.
Otherwise, a matrix
recording LASER embedding representations of tweet texts.
Row names correspond to x$status_id
(but see Note below).
Column names refer to the 1024 embedding vectors and are named "e0001", ..., "e1024".
If .compute.pcs = TRUE
(the default),
a matrix
recording principal component representation of tweet text LASER embeddings.
Row names correspond to x$status_id
(but see Note below).
Column names refer to the 300 principal components and are named "pc1", ..., "pc300".
Otherwise NULL
.
If .compute.ics = TRUE
(the default),
a matrix
recording independent component representation of tweet text LASER embeddings.
Row names correspond to x$status_id
(but see Note below).
Column names refer to the 300 independent components and are named "ic1", ..., "ic300".
Otherwise NULL
.
integer vector, reporting the indexes of rows removed from x
because is.na(x$text)
is TRUE
Note that, strictly speaking, matrices row names do not always correspond to x$status_id
,
as when is.na(x$text)
is TRUE
for some, these rows will not be contained
in returned matrixes, and their status IDs will hence not be present among row names.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.