This is a version of WEAT written entirely in R.
require(purrr) require(lsa) take <- function(word, w) { return(as.vector(w[word, , drop = FALSE])) } get_x <- function(w, words) { purrr::map(words, take, w = w) } g <- function(c, A, B, w) { A_emb <- get_x(w, A) B_emb <- get_x(w, B) c_emb <- get_x(w, c)[[1]] a_cos_diff <- mean(purrr::map_dbl(A_emb, ~cosine(., c_emb))) b_cos_diff <- mean(purrr::map_dbl(B_emb, ~cosine(., c_emb))) return(a_cos_diff - b_cos_diff) } .clean <- function(x, w_lab, verbose = FALSE) { new_x <- intersect(x, w_lab) if (length(new_x) < length(x) & verbose) { print("Some word(s) are not available in w.") } return(new_x) } r_weat <- function(w, S, T, A, B, verbose = FALSE) { w_lab <- rownames(w) A <- .clean(A, w_lab, verbose = verbose) B <- .clean(B, w_lab, verbose = verbose) S <- .clean(S, w_lab, verbose = verbose) T <- .clean(T, w_lab, verbose = verbose) S_diff <- purrr::map_dbl(S, g, A, B, w) T_diff <- purrr::map_dbl(T, g, A, B, w) ## union_diff <- purrr::map_dbl(union(S, T), g, A, B, w) return((mean(S_diff) - mean(T_diff)) / sd(c(S_diff, T_diff))) } require(compiler) r_weat_c <- cmpfun(r_weat)
require(sweater) S2 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition") T2 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture") A2 <- c("male", "man", "boy", "brother", "he", "him", "his", "son") B2 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter") r_weat(glove_math, S2, T2, A2, B2) r_weat_c(glove_math, S2, T2, A2, B2)
The same implementation in C++ from sweater
calculate_es(query(glove_math, S2, T2, A2, B2)) cpp_weat <- function(w, S, T, A, B) { calculate_es(query(w, S, T, A, B)) }
The C++ implementation in sweater
is >10x faster. Byte-code compilation (r_weat_c
) can bring about almost no little improvement.
require(bench) benchmark_res <- bench::mark( r_weat(glove_math, S2, T2, A2, B2), r_weat_c(glove_math, S2, T2, A2, B2), cpp_weat(glove_math, S2, T2, A2, B2), relative = TRUE) benchmark_res
In this benchmark, we test how the lengths of S/T/A/B affect the performance. sweater
is at least 7x faster.
set.seed(12121) stab_length <- seq(10, 100, 10) r_bench <- function(stab_n) { w_lab <- rownames(googlenews) S <- sample(w_lab, stab_n) T <- sample(w_lab, stab_n) A <- sample(w_lab, stab_n) B <- sample(w_lab, stab_n) bench::mark(r_weat(googlenews, S, T, A, B), r_weat_c(googlenews, S, T, A, B), cpp_weat(googlenews, S, T, A, B), relative = TRUE) } res <- map(stab_length, r_bench) res %>% map_dfr(~.[1,3]) %>% dplyr::mutate(stab_length = stab_length)
The hidden gem of this package is the function read_word2vec
. It is so flexible and yet speedy. In the following example, we are going to compare the typical task of the bias detection pipeline:
read_word2vec
is based on the C++ based function data.table::fread
and it can read almost all formats (word2vec, glove, fastText, etc). The entire workflow can be finished in less than a minute.
```{bash, engine.opts='-l'} time Rscript bench.R
The Python workflow, however, needs to use `gensim` to read the pretained word embedding file and it can't read GLoVE format directly and the file needs to first convert to word2vec [^1]. The `KeyedVectors.load_word2vec_format` is not written in a low-level language and thus is many times slower than `read_word2vec`. And the result reported is not the same as the numbers reported in Caliskan et al. ```{bash, engine.opts='-l'} time python3 bench.py
The original Java code by Caliskan et al. is extremely fast because the code is highly optimized. If all you need to do is WEAT and you know how to write Java, it is recommended using the Java code. See this footnote [^2] on how to run the code.
```{bash, engine.opts='-l'}
time java -classpath .:./lib/commons-lang3-3.3.2.jar:./lib/commons-math3-3.6.1.jar WeatBenchmark
## Testing environment ```r sessionInfo()
neofetch --stdout
[^1]: Actually the only difference between the two format is the GLoVE format doesn't record the dimensionality of the matrix in the first line.
[^2]: Please download the original Java code and keep the Utils.java
in the same directory. Also, put the 2 jar files provided inside the lib
directory. The commented-out line of command compiles the WeatBenchmark.java
to JVM Byecode. It should be finished in 1s.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.