A visual introduction to rollply
In rollply: Moving-Window Add-on for 'plyr'

A visual introduction to rollply

knitr::opts_chunk$set(cache=TRUE)

library(proj4) 
library(ggplot2) 
library(rgdal)
library(tidyr) 
library(rollply) 
library(plyr)

Rollply is a small function built upon plyr's ddply function to facilitate moving-window-based computations. If you have a data.frame, give the dimensions over which the window should move, and rollply will make the subsets, apply the function on them and then combine the results into a data.frame.

In short, rollply extends the split-apply-combine strategy to moving-window computations, using a similar syntax. This tutorial thus assumes some basic familiarity with plyr, and in particular the function ddply. Let's start with a simple example.

Examples

A time-series example

A simple use of moving-windows is adding a trendline to a time series plot. We will use the CO2 data from the Mauna Loa NOAA Observatory as a environmentally-conscious example.

# Download and format data
url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt"
hawaii <- read.table(url)[ ,c(3,4)]
names(hawaii) <- c('date','CO2')
hawaii[hawaii$CO2 < 0, "CO2"] <- NA # mark NAs as such

# Display original trend
CO2.plot <- ggplot(hawaii) + geom_line(aes(date, CO2)) + ylab("CO2 (ppm)")
print(CO2.plot)

There is a clear trend here! Let's smooth out the season effect (the wiggles in the black curve). We'll use a window with a size of one year to compute a yearly average.

# with smoothed trend
hawaii.smoothed <- rollply(hawaii, ~ date, wdw.size = 1,
                           summarize, CO2.mean = mean(CO2, na.rm = TRUE), )
CO2.plot + geom_line(aes(date, CO2.mean), data = hawaii.smoothed, color = 'red')

And voilà! A rather nice, although a bit depressing trend line for our data. When working on time-series, this represents an alternative to specialized packages such as zoo that also provide tools to apply functions over rolling windows.

Let's take a more complex example that works on two-dimensional data.

Exploring french town names

If you open a map of France, you'll notice that towns and villages tend to have names that follow patterns. For example, Brittany's towns are famous for having names starting with a "ker-". Many towns in Lorraine end in "-ange" (a legacy from the german ending "-ingen").

Can we visually explore the distribution of french towns names ? rollply can help here.

A moving-window approach essentially boils down to the following steps:

Build a grid over the whole country
For each point of the grid, take all the data poitns (towns) that are less than xx kilometers from it.
Check the names of the towns and count the ones matching a pattern. Return the number of matching towns.
Combine the results in a data.frame

Like ddply (in package plyr), rollply takes care of points 1,2 and 4. We just need to define a function that does number 3.

Let's download a dataset of town names with their geographical coordinates:

# Download and prepare dataset
# Source and decription: 
# https://publicdata.eu/dataset/listes-des-communes-par-rgions-dpartements-circonscriptions

tmpfile <- tempfile()
url <- paste0('http://www.nosdonnees.fr/wiki/images/b/b5/',
         'EUCircos_Regions_departements_circonscriptions_communes_gps.csv.gz')
download.file(url, destfile = tmpfile)
dat <- read.csv2(tmpfile, stringsAsFactors = FALSE)
file.remove(tmpfile)
dat <- dat[with(dat, latitude != "" | ! grepl(",", latitude)), 
           c('nom_commune', 'latitude', 'longitude')]
colnames(dat) <- c('name', 'latitude', 'longitude')

dat[ ,'name']      <- as.factor(tolower(dat[ ,'name']))
dat[ ,'latitude']  <- as.numeric(dat[ ,'latitude'])
dat[ ,'longitude'] <- as.numeric(dat[ ,'longitude'])

# We use an equirectangular projection to work on true distances
dat <- na.omit(dat)
dat <- data.frame(dat, proj4::project(dat[ ,c('longitude','latitude')],
                                      '+proj=eqc'))
dat <- dat[ ,c('name','x','y')]

# Visualise distribution of towns
str(dat)
ggplot(dat) + geom_point(aes(x, y), alpha=.1)

Nice, let's see whether "ker"-named towns mainly occur in Brittany.

# This is our custom function : it accepts a data frame and a regular 
# expression and returns the number of matches in the column "name", formatted
# withing a data.frame (this plays well with ddply).
how_many_with_name <- function(df, regexp) { 
  data.frame(ker = sum(grepl(regexp, df[ ,'name'])))
}

dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3,
                    how_many_with_name, regexp = "^ker")

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

It seems there are indeed many towns with a name starting with "ker" in Brittany (and a couple in Alsace/Lorraine, too). However, our plot is pretty ugly: we cannot see the actual country shape! When nothing is specified, rollply computes its values over a rectangular grid than spans the maximum width and height of the original dataset.

Here, the spatial distribution of towns reflects pretty nicely the overall shape of the country, so instead of a building a rectangular grid, we can choose to build it only within the alpha-hull of the set of towns.

dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4,
                    grid_npts = 10e3, # number of grid points 
                    grid_type = "ahull_fill", # grid type: fills an alpha-hull with the given number of points
                    grid_opts = list(alpha = .05, # shape parameter of the hull
                                     verbose = TRUE), 
                    how_many_with_name, regexp = "^ker")

Note that building an alpha hull-based grid can be quite computationally expensive, as it requires iterating to find the suitable number of points. However, one can pregenerate grids using the build_grid_* functions family and supply them directly to rollply using the grid argument, so this computation can be only done once (see below).

So, are there really more town named ker-something in Brittany than elsewhere? I'll let you judge (spoiler: yes!):

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

Building grids: helper functions

As seen in the french towns example, rollply uses internally a grid of coordinates. For each points of this grid it selects the observations within the window, then applies the function on this subset. The user can either provide a grid as a data.frame or rollply will take care of building one automatically.

Several helper functions are provided to build nice grids, they all start with build_grid_.

build_grid_identical builds a grid with an identical number of points on each dimension
build_grid_squaretile (2D only) builds a grid of points with square tiles (same distance in X and Y between each points)
build_grid_ahull_crop (2D only) builds a grid of points with square tiles, then discard all the points that do not fall in the alpha-hull of the actual data.
build_grid_ahull_fill (2D only) same as above, but iteratively tries to build a grid with a final number of points approximately equal to the requested number of points (parameter grid_npts).

data(meadow)

# We project lon/lat to UTM zone 11 (southern california)
meadow[ ,c('x','y')] <- proj4::project(as.matrix(meadow[ ,c('lon','lat')]),
                                       "+proj=utm +zone=11 +ellps=WGS84")

ggplot(meadow, aes(x,y)) +
  geom_point(shape='+') +
  ggtitle('Sample points') +
  xlab('UTM X') +
  ylab('UTM Y')

For this example, we will use samples from a vegetation survey in a meadow in Yosemite National Park, California.

# We request a grid with approximately this number of points:
npts <- 500
base.plot <- ggplot(NULL, aes(x,y)) +
               geom_point(data = meadow, shape='+') +
               xlab('UTM X') +
               ylab('UTM Y') 

grids <- list(identical  = build_grid_identical(meadow[ ,c('x','y')], npts),
              squaretile = build_grid_squaretile(meadow[ ,c('x','y')], npts),
              ahull_crop = build_grid_ahull_crop(meadow[ ,c('x','y')], npts),
              ahull_fill = build_grid_ahull_fill(meadow[ ,c('x','y')], npts))

plot_grid <- function(grid_type) {
  base.plot +
    geom_point(data=grids[[grid_type]]) +
    annotate('text', x = min(meadow$x), y = min(meadow$y),
            label = paste(nrow(grids[[grid_type]]), "points"),
            hjust = 0, vjust = 1)
}

build_grid_identical:

plot_grid('identical')

build_grid_squaretile (mind the change of the figure aspect ratio: tiles are squares):

plot_grid('squaretile')

build_grid_ahull_crop:

plot_grid('ahull_crop')

build_grid_ahull_fill:

plot_grid('ahull_fill')

Note that the outline as given by the alpha-hull of a set of points depends on a parameter, alpha that determines its intrincateness. A reasonable default is provided, but it is a good idea to try different values to get an idea of the best fit. For more information on the alpha hull, please refer to the package documentation.

Notes

Performance considerations

rollply inherits plyr's pros and cons. As in the latter's functions, parallelism or progress report is just one argument away (set .parallel = TRUE or .progress = "time"). However, rollply does a lot of data.frame subsetting which remains an expensive operation in R.