knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  progress = FALSE,
  error = FALSE, 
  message = FALSE,
  warning = FALSE,
  rownames.print = FALSE
)
options(width = 100, digits = 2)

What is rapidraker?

rapidraker provides an implementation of the same keyword extraction algorithm (RAKE) that slowraker does, but it's written in Java instead of R. This makes it a bit faster than slowraker.

Installation

You can get the stable version from CRAN:

install.packages("rapidraker")

The development version of the package requires you to compile the latest Java source code in rapidrake-java, so installing it is not as simple as making a call to devtools::install_github().

Basic usage

library(slowraker)
library(rapidraker)

data("dog_pubs")
rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))
# Note, we have to split the vignette up like this so that it doesn't print 
# the progress bar.
library(slowraker)
library(rapidraker)

data("dog_pubs")
rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))

Performance comparison

txt <- rep(dog_pubs$abstract, 20)

sr_time <- system.time(slowrake(txt))[["elapsed"]]
rr_time <- system.time(rapidrake(txt))[["elapsed"]]

In this example, rapidrake() took r rr_time seconds to execute while slowrake() took r sr_time, making the Java version about about r round(sr_time / rr_time, 0) times faster.

Making rapidrake() even faster

We can parallelize extraction across documents like so:

# The following code was run on aarch64-apple-darwin20, 12 cores
library(parallel)
library(doParallel)
library(foreach)

cores <- detectCores()
# Make txt vector larger so we can more easily see the speed improvement of parallelization
txt2 <- rep(txt, cores * 3) 
by <- floor(length(txt2) / cores)

cl <- makeCluster(cores)
registerDoParallel(cl)

rr_par_time <- system.time(
  foreach(i = 1:cores) %dopar% {
    start <- (i - 1) * by + 1
    finish <- start + by - 1
    rapidraker::rapidrake(txt2[start:finish])
  }
)[["elapsed"]]

stopCluster(cl)

The sequential version of rapidrake() took r rr_time seconds to extract keywords for r length(txt) documents, while the parallel version took r rr_par_time seconds for r length(txt2) documents. This suggests that the parallel version was about r round(rr_time * cores * 3 / rr_par_time, 0) times faster than the regular version.



crew102/slowraker documentation built on Sept. 5, 2024, 11:22 a.m.