rectr

Please cite this package as:

Chan, C.H., Zeng, J., Wessler, H., Jungblut, M., Welbers, K., Bajjalieh, J., van Atteveldt, W., & Althaus, S. (2020) Reproducible Extraction of Cross-lingual Topics. Communication Methods & Measures. DOI: 10.1080/19312458.2020.1812555

The rectr package contains an example dataset "wiki" with English and German articles from Wikipedia about programming languages and locations in Germany. This package uses the corpus data structure from the quanteda package.

require(rectr)
require(tibble)
require(dplyr)
wiki

Currently, this package supports aligned fastText from Facebook Research and Multilingual BERT (MBERT) from Google Research. For easier integration, the PyTorch version of MBERT from Transformers is used.

Multilingual BERT

Step 1: Setup your conda environment

## setup a conda environment, default name: rectr_condaenv
mbert_env_setup()

Step 1: Download MBERT model

## default to your current directory
download_mbert(noise = TRUE)

Step 2: Create corpus

Create a multilingual corpus

wiki_corpus <- create_corpus(wiki$content, wiki$lang)

Step 2: Create bag-of-embeddings dfm

Create a multilingual dfm

## default
wiki_dfm <- transform_dfm_boe(wiki_corpus, noise = TRUE)
wiki_dfm
wiki_dfm <- readRDS("man/figures/wiki_dfm.RDS")

Step 3: Filter dfm

Filter the dfm for language differences

wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered

Step 4: Estimate GMM

Estimate a Guassian Mixture Model

wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm

The document-topic matrix is available in wiki_gmm$theta.

Rank the articles according to the theta1.

wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)

Aligned fastText

Step 1: Download word embeddings

Download and preprocess fastText word embeddings from Facebook. Make sure you have at least 5G of disk space and a reasonably amount of RAM. It took around 20 minutes on my machine.

get_ft("en")
get_ft("de")

Step 1: Read the downloaded word embeddings

emb <- read_ft(c("en", "de"))

Step 2: Create corpus

Create a multilingual corpus

wiki_corpus <- create_corpus(wiki$content, wiki$lang)

Step 2: Create bag-of-embeddings dfm

Create a multilingual dfm

require(future)
plan(multisession)
wiki_dfm <- transform_dfm_boe(wiki_corpus, emb, .progress = TRUE)
wiki_dfm

Step 3: Filter dfm

Filter the dfm for language differences

wiki_dfm_filtered <- filter_dfm(wiki_dfm, k = 2)
wiki_dfm_filtered

Step 4: Estimate GMM

Estimate a Guassian Mixture Model

wiki_gmm <- calculate_gmm(wiki_dfm_filtered, seed = 46709394)
wiki_gmm

The document-topic matrix is available in wiki_gmm$theta.

Rank the articles according to the theta1.

wiki %>% mutate(theta1 = wiki_gmm$theta[,1]) %>% arrange(theta1) %>% select(title, lang, theta1) %>% print(n = 400)

SessionInfo

sessionInfo()


chainsawriot/rectr documentation built on July 30, 2023, 2:30 p.m.