In koheiw/LSS: Semi-Supervised Algorithm for Document Scaling

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", 
                      fig.width = 8, fig.height = 4, dpi = 150, out.width = 760)

We must define the measurement with seed words to apply LSS. If seed words are not available, we must create a list of seed words using thesaurus and glossaries, but some of the words might be used in many different contexts, making them unsuitable as seed words. Good seed words are words that appear only in the contexts of the target concepts.

We can evaluate the suitability of seed words by checking their synonyms identified in the corpus: fitting a LSS model with one seed word at a time and checking words with highest polarity scores. This repetitive process can be automated using bootstrap_lss(). This function extracts seed words from a fitted LSS model, fits a LSS model with each seed words internally and returns their synonyms or similarity scores. We use the LSS model fitted in the introduction in this example.

library(LSX)
lss <- readRDS("lss.rds")
print(lss)

The model is fitted with the generic sentiment seed words. Their original polarity scores are weighted by the inverse of the number of seed words (1/7 = 0.142) to allow unequal numbers for opposing ends.

print(lss$seeds)
print(lss$seeds_weighted)

Evaluation with synomums

By default, bootstrap_lss() returns lists of synonyms for seed words. Each column is a list of words sorted by their similarity to the seed word shown at the top. There are many proper names in the example, but we can find many words are positive for positive seed words and negative for negative seed words. If a list is a mixture of positive and negative words, the seed word is probably too ambiguous.

bs_term <- bootstrap_lss(lss, mode = "terms")
knitr::kable(head(bs_term, 10))

Evaluation with similarity scores

If mode = "coef", the function returns the similarity scores of words for each seed word (words were sorted by these scores in the lists above). We can use the matrix to evaluate the seed words more systematically if lists of synonyms are not useful or sufficient.

bs_coef <- bootstrap_lss(lss, mode = "coef")
knitr::kable(head(bs_coef, 10), digits = 3)

We can use words with known polarity such as "russia" and "ukriane" as anchor words in evaluating seed words. We know that "russia" is more positive than "ukraine" because the corpus is a collection of articles published by the Russian state media.

We can confirm that the difference in similarity scores between the anchor words and the polarity scores of the seed words largely agree. However, "fortunate" and "negative" disagree with the expected differences, suggesting that they are more ambiguous than other seed words.

dat_seed <- data.frame(seed = lss$seeds, diff = bs_coef["russia",] - bs_coef["ukraine",])
print(dat_seed)

Conclusions

We should make a list of seed words and evaluate them one by one using bootstrap_lss() to create accurate measurement. However, seed words become much less ambiguous when they are used as a set, so we should not be too nervous about its results. Seed word selection should be motivated primarily by the theoretical framework

References

Watanabe, K., & Zhou, Y. (2020). Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches, Social Science Computer Review, https://doi.org/10.1177/0894439320907027.
Watanabe, K. (2021). Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, Communication Methods and Measures, https://doi.org/10.1080/19312458.2020.1832976.

koheiw/LSS documentation built on March 9, 2024, 4:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com