Please cite this package as:
Chan, C.H., Zeng, J., Wessler, H., Jungblut, M., Welbers, K., Bajjalieh, J., van Atteveldt, W., & Althaus, S. (2020) Reproducible Extraction of Cross-lingual Topics. Communication Methods & Measures. DOI: 10.1080/19312458.2020.1812555
The rectr package contains an example dataset "wiki" with English and German articles from Wikipedia about programming languages and locations in Germany. This package uses the corpus data structure from the quanteda
package.
require(rectr) require(tibble) require(dplyr) wiki
Currently, this package supports aligned fastText from Facebook Research and Multilingual BERT (MBERT) from Google Research. For easier integration, the PyTorch version of MBERT from Transformers is used.
## setup a conda environment, default name: rectr_condaenv mbert_env_setup()
## default to your current directory download_mbert(noise = TRUE)
Create a multilingual corpus
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
Create a multilingual dfm
## default wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE) wiki_dfm
wiki_dfm <- readRDS("man/figures/wiki_dfm.RDS")
Filter the dfm for language differences
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2) wiki_dfm_filtered
Estimate a Guassian Mixture Model
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394) wiki_gmm
The document-topic matrix is available in wiki_gmm$theta
.
Rank the articles according to the theta1.
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
Download and preprocess fastText word embeddings from Facebook. Make sure you have at least 5G of disk space and a reasonably amount of RAM. It took around 20 minutes on my machine.
get_ft("en") get_ft("de")
emb <- read_ft(c("en", "de"))
Create a multilingual corpus
wiki_corpus <- create_corpus(wiki$content, wiki$lang)
Create a multilingual dfm
require(future) plan(multisession) wiki_dfm <- transform_dfm_boe(wiki_corpus, emb, .progress = TRUE) wiki_dfm
Filter the dfm for language differences
wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2) wiki_dfm_filtered
Estimate a Guassian Mixture Model
wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394) wiki_gmm
The document-topic matrix is available in wiki_gmm$theta
.
Rank the articles according to the theta1.
wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)
SessionInfo
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.