knitr::opts_chunk$set(echo = TRUE, fig.path = "man/figures/", cache = TRUE, warning = FALSE, message = FALSE)
An experimental R package to parallelise some functions from the
excellent {sf} and {rmapshaper}
packages using the also brilliant {furrr}
package. Right now, it's just parallel versions of st_join
and
st_filter
from {sf} and
ms_simplify
from {rmapshaper} (Although this now seems essentially redundant after some major
performance improvements in rmapshaper). They won't always help and may be slower but, sometimes it might be
useful. This is just messing about right now tbh. The code is deliberately copied
from {sf}, {rmapshaper} and {furrr} so that it can be used as a drop in replacement.
I've added {geoarrow} as a dependency to play with using it to pass data between cores - it seems to be fractionally faster and solves the issue of passing objects larer than the limit allowed by furrr... make sure to install {arrow} if you want to try this out.
Make sure to experiment with the number of cores - often it will be much more efficient to use a small number of processes than all of your machines's availabel processes due to the start up time of those processes.
# install.packages("remotes") remotes::install_github('h-a-graham/sfurrr')
So here is the some data included in the package - it is the English cycle network from Open Street Map (downloaded with {osmextract}) and British counties from Ordnance survey.
library(sfurrr) #built in functions to load the data. cwe <- cycleways_england() gbc <- gb_counties() basetheme::basetheme("dark") # makes it pretty plot(gbc['geometry'], axes = TRUE) plot(cwe['geometry'], add=TRUE, col='#39C17360') summary(cwe) summary(gbc)
Now, let's say we want to do a spatial join between the cycleways and the counties so we attach the county data to the cycleway network. This might allow us to do some summarised stats on the cycle network of different counties, for example.
So let's do this with {sf} which is loaded by default with {sfurr}. Let's also get some timings with {tictoc}
library(tictoc) tic() join.sf <- st_join(cwe, gbc) toc() plan(multisession, workers = 4) tic() join.sfurr <- future_st_join(cwe, gbc) toc()
Okay.. so {sf} is actually pretty fast! by using a small number of cores - here 4,
we can get a slight speed up - any more cores and it would be increasingly slow.
But what about more costly spatial operations? Let's try now with
the option largest=TRUE
which joins based on the largest amount of
intersection.
# ------------ `st_join` ---------------- tic() joinL.sf <- st_join(cwe, gbc, largest=TRUE) toc() # ------------ `future_st_join` ---------------- plan(multisession, workers = 8) tic() joinL.sfurr <- future_st_join(cwe, gbc, largest=TRUE) toc()
Okay so now we see that going parallel does indeed offer some potential uses when using a costly spatial function. Here we use 8 processes and it pays off more due to the expensive computation.
Once again, here is a comparison of the simplest approach with the
st_intersect
spatial predicate.
# ------------ `st_fiter` ---------------- tic() filt_t1 <- st_filter(cwe['highway'], gbc[1:50,]) toc() # ----------- `future_st_filter` ----------------- plan(multisession, workers = 4) tic() filt_t2 <- future_st_filter(cwe['highway'], gbc[1:50,]) toc()
Again with using a limited number of cores there is a small speed up but not
that much... Let's use the st_within
spatial predicate to filter out
cycleways that are not located entirely within the county areas... This
is kind of pointless for this use case and is just illustrative
really...
# ------------ `st_filter` ---------------- tic() within_filt_t1 <- st_filter(joinL.sfurr, gbc[1:50,], .predicate = st_within) toc() # ----------- `future_st_filter` ----------------- plan(multisession, workers = 6) tic() within_filt_t2 <- future_st_filter(joinL.sfurr, gbc[1:50,], .predicate = st_within) toc()
Cool, so in this case it is faster!
This is not a globally useful idea but in some cases, when using very large spatial datasets, you may get a speed up by running spatial filters/joins in parallel. Any speed up will depend on the number of processes you can run; rememeber it is probably not wise to use all the cores at your disposal - sometimes less is more!
plot(gbc['geometry'], axes = TRUE) plot(within_filt_t2['Name'], add=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.