knitr::opts_chunk$set( collapse = TRUE, comment = "#>", progress = FALSE, error = FALSE, message = FALSE, warning = FALSE, rownames.print = FALSE ) options(width = 100, digits = 2)
rapidraker
?rapidraker
provides an implementation of the same keyword extraction algorithm (RAKE) that slowraker
does, but it's written in Java instead of R. This makes it a bit faster than slowraker
.
You can get the stable version from CRAN:
install.packages("rapidraker")
The development version of the package requires you to compile the latest Java source code in rapidrake-java, so installing it is not as simple as making a call to devtools::install_github()
.
library(slowraker) library(rapidraker) data("dog_pubs") rakelist <- rapidrake(txt = dog_pubs$abstract[1:5]) head(rbind_rakelist(rakelist))
# Note, we have to split the vignette up like this so that it doesn't print # the progress bar. library(slowraker) library(rapidraker) data("dog_pubs") rakelist <- rapidrake(txt = dog_pubs$abstract[1:5])
head(rbind_rakelist(rakelist))
txt <- rep(dog_pubs$abstract, 20) sr_time <- system.time(slowrake(txt))[["elapsed"]] rr_time <- system.time(rapidrake(txt))[["elapsed"]]
In this example, rapidrake()
took r rr_time
seconds to execute while slowrake()
took r sr_time
, making the Java version about about r round(sr_time / rr_time, 0)
times faster.
rapidrake()
even fasterWe can parallelize extraction across documents like so:
# The following code was run on aarch64-apple-darwin20, 12 cores library(parallel) library(doParallel) library(foreach) cores <- detectCores() # Make txt vector larger so we can more easily see the speed improvement of parallelization txt2 <- rep(txt, cores * 3) by <- floor(length(txt2) / cores) cl <- makeCluster(cores) registerDoParallel(cl) rr_par_time <- system.time( foreach(i = 1:cores) %dopar% { start <- (i - 1) * by + 1 finish <- start + by - 1 rapidraker::rapidrake(txt2[start:finish]) } )[["elapsed"]] stopCluster(cl)
The sequential version of rapidrake()
took r rr_time
seconds to extract keywords for r length(txt)
documents, while the parallel version took r rr_par_time
seconds for r length(txt2)
documents. This suggests that the parallel version was about r round(rr_time * cores * 3 / rr_par_time, 0)
times faster than the regular version.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.