In chainsawriot/sweater: Speedy Word Embedding Association Test and Extras Using R

Benchmark

This is a version of WEAT written entirely in R.

require(purrr)
require(lsa)

take <- function(word, w) {
    return(as.vector(w[word, , drop = FALSE]))
}

get_x <- function(w, words) {
    purrr::map(words, take, w = w)
}

g <- function(c, A, B, w) {
    A_emb <- get_x(w, A)
    B_emb <- get_x(w, B)
    c_emb <- get_x(w, c)[[1]]
    a_cos_diff <- mean(purrr::map_dbl(A_emb, ~cosine(., c_emb)))
    b_cos_diff <- mean(purrr::map_dbl(B_emb, ~cosine(., c_emb)))
    return(a_cos_diff - b_cos_diff)
}

.clean <- function(x, w_lab, verbose = FALSE) {
    new_x <- intersect(x, w_lab)
    if (length(new_x) < length(x) & verbose) {
        print("Some word(s) are not available in w.")
    }
    return(new_x)
}


r_weat <- function(w, S, T, A, B, verbose = FALSE) {
    w_lab <- rownames(w)
    A <- .clean(A, w_lab, verbose = verbose)
    B <- .clean(B, w_lab, verbose = verbose)
    S <- .clean(S, w_lab, verbose = verbose)
    T <- .clean(T, w_lab, verbose = verbose)
    S_diff <- purrr::map_dbl(S, g, A, B, w)
    T_diff <- purrr::map_dbl(T, g, A, B, w)
    ## union_diff <- purrr::map_dbl(union(S, T), g, A, B, w)
    return((mean(S_diff) - mean(T_diff)) / sd(c(S_diff, T_diff)))
}
require(compiler)

r_weat_c <- cmpfun(r_weat)

The Calikskan et al. example.

require(sweater)
S2 <- c("math", "algebra", "geometry", "calculus", "equations",
        "computation", "numbers", "addition")
T2 <- c("poetry", "art", "dance", "literature", "novel", "symphony",
        "drama", "sculpture")
A2 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
B2 <- c("female", "woman", "girl", "sister", "she", "her", "hers",
        "daughter")
r_weat(glove_math, S2, T2, A2, B2)
r_weat_c(glove_math, S2, T2, A2, B2)

The same implementation in C++ from sweater

calculate_es(query(glove_math, S2, T2, A2, B2))
cpp_weat <- function(w, S, T, A, B) {
     calculate_es(query(w, S, T, A, B))
}

The C++ implementation in sweater is >10x faster. Byte-code compilation (r_weat_c) can bring about almost no little improvement.

require(bench)
benchmark_res <- bench::mark(
                            r_weat(glove_math, S2, T2, A2, B2),
                            r_weat_c(glove_math, S2, T2, A2, B2),
                            cpp_weat(glove_math, S2, T2, A2, B2),
                            relative = TRUE)
benchmark_res

Random benchmark

In this benchmark, we test how the lengths of S/T/A/B affect the performance. sweater is at least 7x faster.

set.seed(12121)
stab_length <- seq(10, 100, 10)
r_bench <- function(stab_n) {
    w_lab <- rownames(googlenews)
    S <- sample(w_lab, stab_n)
    T <- sample(w_lab, stab_n)
    A <- sample(w_lab, stab_n)
    B <- sample(w_lab, stab_n)
    bench::mark(r_weat(googlenews, S, T, A, B),
                r_weat_c(googlenews, S, T, A, B),
                cpp_weat(googlenews, S, T, A, B),
                relative = TRUE)
}

res <- map(stab_length, r_bench)
res %>% map_dfr(~.[1,3]) %>% dplyr::mutate(stab_length = stab_length)

versus WEFE

The hidden gem of this package is the function read_word2vec. It is so flexible and yet speedy. In the following example, we are going to compare the typical task of the bias detection pipeline:

read a pretrained word embedding file (In this case GloVE, 5.3 GB)
do WEAT

read_word2vec is based on the C++ based function data.table::fread and it can read almost all formats (word2vec, glove, fastText, etc). The entire workflow can be finished in less than a minute.

```{bash, engine.opts='-l'} time Rscript bench.R

The Python workflow, however, needs to use `gensim` to read the pretained word embedding file and it can't read GLoVE format directly and the file needs to first convert to word2vec [^1]. The `KeyedVectors.load_word2vec_format` is not written in a low-level language and thus is many times slower than `read_word2vec`. And the result reported is not the same as the numbers reported in Caliskan et al.

```{bash, engine.opts='-l'}
time python3 bench.py

versus the original Java code

The original Java code by Caliskan et al. is extremely fast because the code is highly optimized. If all you need to do is WEAT and you know how to write Java, it is recommended using the Java code. See this footnote [^2] on how to run the code.

```{bash, engine.opts='-l'}

javac -cp ./lib/commons-lang3-3.3.2.jar:./lib/commons-math3-3.6.1.jar WeatBenchmark.java Utils.java

time java -classpath .:./lib/commons-lang3-3.3.2.jar:./lib/commons-math3-3.6.1.jar WeatBenchmark

## Testing environment

```r
sessionInfo()

neofetch --stdout

[^1]: Actually the only difference between the two format is the GLoVE format doesn't record the dimensionality of the matrix in the first line.

[^2]: Please download the original Java code and keep the Utils.java in the same directory. Also, put the 2 jar files provided inside the lib directory. The commented-out line of command compiles the WeatBenchmark.java to JVM Byecode. It should be finished in 1s.

chainsawriot/sweater documentation built on June 29, 2024, 8:17 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com