knitr::opts_chunk$set(cache=TRUE)
library(proj4) library(ggplot2) library(rgdal) library(tidyr) library(rollply) library(plyr)
Rollply is a small function built upon plyr's ddply
function to facilitate moving-window-based computations. If you have a
data.frame
, give the dimensions over which the window should
move, and rollply will make the subsets, apply the function on them and then
combine the results into a data.frame.
In short, rollply extends the
split-apply-combine strategy to moving-window
computations, using a similar syntax. This tutorial thus assumes some basic
familiarity with plyr
, and in particular the function ddply
.
Let's start with a simple example.
A simple use of moving-windows is adding a trendline to a time series plot. We will use the CO2 data from the Mauna Loa NOAA Observatory as a environmentally-conscious example.
# Download and format data url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt" hawaii <- read.table(url)[ ,c(3,4)] names(hawaii) <- c('date','CO2') hawaii[hawaii$CO2 < 0, "CO2"] <- NA # mark NAs as such # Display original trend CO2.plot <- ggplot(hawaii) + geom_line(aes(date, CO2)) + ylab("CO2 (ppm)") print(CO2.plot)
There is a clear trend here! Let's smooth out the season effect (the wiggles in the black curve). We'll use a window with a size of one year to compute a yearly average.
# with smoothed trend hawaii.smoothed <- rollply(hawaii, ~ date, wdw.size = 1, summarize, CO2.mean = mean(CO2, na.rm = TRUE), ) CO2.plot + geom_line(aes(date, CO2.mean), data = hawaii.smoothed, color = 'red')
And voilĂ ! A rather nice, although a bit depressing trend line for our data.
When working on time-series, this represents an alternative to specialized
packages such as zoo
that also provide tools to apply functions over rolling
windows.
Let's take a more complex example that works on two-dimensional data.
If you open a map of France, you'll notice that towns and villages tend to have names that follow patterns. For example, Brittany's towns are famous for having names starting with a "ker-". Many towns in Lorraine end in "-ange" (a legacy from the german ending "-ingen").
Can we visually explore the distribution of french towns names ? rollply can help here.
A moving-window approach essentially boils down to the following steps:
data.frame
Like ddply
(in package plyr
), rollply
takes care of points 1,2 and 4. We
just need to define a function that does number 3.
Let's download a dataset of town names with their geographical coordinates:
# Download and prepare dataset # Source and decription: # https://publicdata.eu/dataset/listes-des-communes-par-rgions-dpartements-circonscriptions tmpfile <- tempfile() url <- paste0('http://www.nosdonnees.fr/wiki/images/b/b5/', 'EUCircos_Regions_departements_circonscriptions_communes_gps.csv.gz') download.file(url, destfile = tmpfile) dat <- read.csv2(tmpfile, stringsAsFactors = FALSE) file.remove(tmpfile) dat <- dat[with(dat, latitude != "" | ! grepl(",", latitude)), c('nom_commune', 'latitude', 'longitude')] colnames(dat) <- c('name', 'latitude', 'longitude') dat[ ,'name'] <- as.factor(tolower(dat[ ,'name'])) dat[ ,'latitude'] <- as.numeric(dat[ ,'latitude']) dat[ ,'longitude'] <- as.numeric(dat[ ,'longitude']) # We use an equirectangular projection to work on true distances dat <- na.omit(dat) dat <- data.frame(dat, proj4::project(dat[ ,c('longitude','latitude')], '+proj=eqc')) dat <- dat[ ,c('name','x','y')]
# Visualise distribution of towns str(dat) ggplot(dat) + geom_point(aes(x, y), alpha=.1)
Nice, let's see whether "ker"-named towns mainly occur in Brittany.
# This is our custom function : it accepts a data frame and a regular # expression and returns the number of matches in the column "name", formatted # withing a data.frame (this plays well with ddply). how_many_with_name <- function(df, regexp) { data.frame(ker = sum(grepl(regexp, df[ ,'name']))) } dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3, how_many_with_name, regexp = "^ker") ggplot(dat_roll) + geom_raster(aes(x, y, fill = ker)) + scale_fill_distiller(palette = 'Greys')
It seems there are indeed many towns with a name starting with "ker" in Brittany (and a couple in Alsace/Lorraine, too). However, our plot is pretty ugly: we cannot see the actual country shape! When nothing is specified, rollply computes its values over a rectangular grid than spans the maximum width and height of the original dataset.
Here, the spatial distribution of towns reflects pretty nicely the overall shape of the country, so instead of a building a rectangular grid, we can choose to build it only within the alpha-hull of the set of towns.
dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3, # number of grid points grid_type = "ahull_fill", # grid type: fills an alpha-hull with the given number of points grid_opts = list(alpha = .05, # shape parameter of the hull verbose = TRUE), how_many_with_name, regexp = "^ker")
Note that building an alpha hull-based grid can be quite computationally
expensive, as it requires iterating to find the suitable number of points.
However, one can pregenerate grids using the build_grid_*
functions family and
supply them directly to rollply using the grid
argument, so this computation
can be only done once (see below).
So, are there really more town named ker-something in Brittany than elsewhere? I'll let you judge (spoiler: yes!):
ggplot(dat_roll) + geom_raster(aes(x, y, fill = ker)) + scale_fill_distiller(palette = 'Greys')
As seen in the french towns example, rollply uses internally a grid of
coordinates. For each points of this grid it selects the observations within the
window, then applies the function on this subset. The user can either provide a
grid as a data.frame
or rollply will take care of building one automatically.
Several helper functions are provided to build nice grids, they all start with build_grid_.
build_grid_identical builds a grid with an identical number of points on each dimension
build_grid_squaretile (2D only) builds a grid of points with square tiles (same distance in X and Y between each points)
build_grid_ahull_crop (2D only) builds a grid of points with square tiles, then discard all the points that do not fall in the alpha-hull of the actual data.
build_grid_ahull_fill (2D only) same as above, but iteratively tries to
build a grid with a final number of points approximately equal to
the requested number of points (parameter grid_npts
).
data(meadow) # We project lon/lat to UTM zone 11 (southern california) meadow[ ,c('x','y')] <- proj4::project(as.matrix(meadow[ ,c('lon','lat')]), "+proj=utm +zone=11 +ellps=WGS84") ggplot(meadow, aes(x,y)) + geom_point(shape='+') + ggtitle('Sample points') + xlab('UTM X') + ylab('UTM Y')
For this example, we will use samples from a vegetation survey in a meadow in Yosemite National Park, California.
# We request a grid with approximately this number of points: npts <- 500 base.plot <- ggplot(NULL, aes(x,y)) + geom_point(data = meadow, shape='+') + xlab('UTM X') + ylab('UTM Y') grids <- list(identical = build_grid_identical(meadow[ ,c('x','y')], npts), squaretile = build_grid_squaretile(meadow[ ,c('x','y')], npts), ahull_crop = build_grid_ahull_crop(meadow[ ,c('x','y')], npts), ahull_fill = build_grid_ahull_fill(meadow[ ,c('x','y')], npts)) plot_grid <- function(grid_type) { base.plot + geom_point(data=grids[[grid_type]]) + annotate('text', x = min(meadow$x), y = min(meadow$y), label = paste(nrow(grids[[grid_type]]), "points"), hjust = 0, vjust = 1) }
plot_grid('identical')
plot_grid('squaretile')
plot_grid('ahull_crop')
plot_grid('ahull_fill')
Note that the outline as given by the alpha-hull of a set of points depends on
a parameter, alpha
that determines its intrincateness. A reasonable default
is provided, but it is a good idea to try different values to get an idea of
the best fit. For more information on the alpha hull, please refer to the
package
documentation.
rollply
inherits plyr's pros and cons. As in the latter's functions,
parallelism or progress report is just one argument away (set .parallel = TRUE
or .progress = "time"
). However, rollply does a lot of data.frame
subsetting
which remains an expensive operation in R.
rollply has well-known bugs and is still in active development! Development happens at github. Do not hesitate to post issues or pull requests.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.