textmodel_lss: Fit a Latent Semantic Scaling model

View source: R/textmodel_lss.R

textmodel_lssR Documentation

Fit a Latent Semantic Scaling model

Description

Latent Semantic Scaling (LSS) is a semi-supervised algorithm for document scaling based on word embedding.

Usage

textmodel_lss(x, ...)

## S3 method for class 'dfm'
textmodel_lss(
  x,
  seeds,
  terms = NULL,
  k = 300,
  slice = NULL,
  weight = "count",
  cache = FALSE,
  simil_method = "cosine",
  engine = c("RSpectra", "irlba", "rsvd"),
  auto_weight = FALSE,
  include_data = FALSE,
  group_data = FALSE,
  verbose = FALSE,
  ...
)

## S3 method for class 'fcm'
textmodel_lss(
  x,
  seeds,
  terms = NULL,
  k = 50,
  max_count = 10,
  weight = "count",
  cache = FALSE,
  simil_method = "cosine",
  engine = "rsparse",
  auto_weight = FALSE,
  verbose = FALSE,
  ...
)

## S3 method for class 'tokens'
textmodel_lss(
  x,
  seeds,
  terms = NULL,
  k = 200,
  min_count = 5,
  engine = "wordvector",
  tolower = TRUE,
  include_data = FALSE,
  group_data = FALSE,
  spatial = TRUE,
  verbose = FALSE,
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm(), quanteda::fcm(), quanteda::tokens or quanteda::tokens_xptr object.

...

additional arguments passed to the underlying engine.

seeds

a character vector or named numeric vector that contains seed words. If seed words contain "*", they are interpreted as glob patterns. See quanteda::valuetype.

terms

a character vector or named numeric vector that specify words for which polarity scores will be computed; if a numeric vector, words' polarity scores will be weighted accordingly; if NULL, all the features in x except those less frequent than min_count will be used.

k

the number of singular values requested to the SVD engine. Only used when x is a dfm.

slice

a number or indices of the components of word vectors used to compute similarity; slice < k to further truncate word vectors; useful for diagnosys and simulation.

weight

weighting scheme passed to quanteda::dfm_weight(). Ignored when engine = "rsparse".

cache

if TRUE, save the result of SVD for next execution with identical x and settings. Use the base::options(lss_cache_dir) to change the location cache files to be save.

simil_method

specifies method to compute similarity between features. The value is passed to quanteda.textstats::textstat_simil(), "cosine" is used otherwise.

engine

select the engine to factorize x to generate word vectors. If x is a dfm, RSpectra::svds(), irlba::irlba() or rsvd::rsvd(). If x is a fcm, rsparse::GloVe(). If x is a tokens (or tokens_xptr), wordvector::textmodel_word2vec().

auto_weight

automatically determine weights to approximate the polarity of terms to seed words. Deprecated.

include_data

if TRUE, the fitted model includes the dfm supplied as x.

group_data

if TRUE, apply dfm_group(x) before saving the dfm.

verbose

show messages if TRUE.

max_count

passed to x_max in rsparse::GloVe$new() where cooccurrence counts are ceiled to this threshold. It should be changed according to the size of the corpus. Used only when x is a fcm.

min_count

the minimum frequency of the words. Words less frequent than this in x are removed before training.

tolower

if TRUE, lower-case all the words in the model.

spatial

[experimental] if FALSE, return a probabilistic model. See the details.

Details

Latent Semantic Scaling (LSS) is a semisupervised document scaling method. textmodel_lss() constructs word vectors from use-provided documents (x) and weights words (terms) based on their semantic proximity to seed words (seeds). Seed words are any known polarity words (e.g. sentiment words) that users should manually choose. The required number of seed words are usually 5 to 10 for each end of the scale.

If seeds is a named numeric vector with positive and negative values, a bipolar model is construct; if seeds is a character vector, a unipolar model. Usually bipolar models perform better in document scaling because both ends of the scale are defined by the user.

A seed word's polarity score computed by textmodel_lss() tends to diverge from its original score given by the user because it's score is affected not only by its original score but also by the original scores of all other seed words. If auto_weight = TRUE, the original scores are weighted automatically using stats::optim() to minimize the squared difference between seed words' computed and original scores. Weighted scores are saved in seed_weighted in the object.

When x is a tokens or tokens_xptr object, wordvector::textmodel_word2vec is called internally with type = "skip-gram" and other arguments passed via .... If spatial = TRUE, it return a spatial model; otherwise a probabilistic model. While the polarity scores of words are their cosine similarity to seed words in spatial models, they are predicted probability that the seed words to occur in their contexts. The probabilistic models are still experimental, so use them with caution.

Please visit the package website for examples.

References

Watanabe, Kohei. 2020. "Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages", Communication Methods and Measures. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/19312458.2020.1832976")}.

Watanabe, Kohei. 2017. "Measuring News Bias: Russia's Official News Agency ITAR-TASS' Coverage of the Ukraine Crisis" European Journal of Communication. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1177/0267323117695735")}.


LSX documentation built on Sept. 13, 2025, 1:10 a.m.

Related to textmodel_lss in LSX...