stopifnot(requireNamespace("proj4", quietly = TRUE) &
          requireNamespace("ggplot2", quietly = TRUE) &
          requireNamespace("rgdal", quietly = TRUE) &
          requireNamespace("tidyr", quietly = TRUE) &
          requireNamespace("rollply", quietly = TRUE) &
          requireNamespace("plyr", quietly = TRUE))

Rollply is a small function built upon plyr’s ddply function to facilitate moving-window-based computations. If you have a data.frame, give the dimensions over which the function the window should move, and rollply will make the subsets, apply the function on them and then combine the results.

In short, rollply extends the split-apply-combine strategy to moving-window computations, using a similar syntax. Let’s start with a simple example.

Examples

A time-series example

A simple use of moving-windows is adding a trendline to a time series plot. We will use the CO2 data from the Mauna Loa NOAA Observatory as a environmentally-conscious example.

# Download and format data
url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt"
hawaii <- read.table(url)[ ,c(3,4)]
names(hawaii) <- c('date','CO2')
hawaii[hawaii$CO2 < 0, "CO2"] <- NA # mark NAs as such

# Display original trend
CO2.plot <- ggplot(hawaii) + geom_line(aes(date, CO2)) + ylab("CO2 (ppm)")
print(CO2.plot)

There is a clear trend here! Let’s smooth out the season effect (the wiggles in the black curve). We’ll use a window with a size of one year to compute a yearly average.

# with smoothed trend
hawaii.smoothed <- rollply(hawaii, ~ date, wdw.size = 1,
                           summarize, CO2.mean = mean(CO2, na.rm = TRUE), )
CO2.plot + geom_line(aes(date, CO2.mean), data = hawaii.smoothed, color = 'red')

And voilà! A rather nice, although a bit depressing, trend line for our data.

Now, this example is a bit silly: if you are working on regularly-spaced time series, you might be better off using one of the specialized packages (e.g. TTR). However, for a quick-and-dirty approach here Rollply is perfect as it only uses standard data.frames.

Let’s take a more complex example.

Exploring french town names

If you open a map of France, you’ll notice that towns and villages tend to have names that follow patterns. For instance, Brittany’s towns are famous for having names starting with a “ker-”. Many towns in Lorraine end in “-ange” (a legacy from the german ending “-ingen”).

Can we visually explore the distribution of town names ?

A moving-window approach essentially boils down to the following steps:

  1. Build a grid over the whole country
  2. For each point of the grid, take all the towns that are less than xx kilometers from it.
  3. Check the names of the towns and count the ones matching a pattern. Return the number of matching towns.
  4. Combine the results in a data.frame

Like ddply (in package plyr), rollply takes care of points 1,2 and 4. We just need to define a function that does number 3!

Let’s download a dataset of town names with their geographical coordinates:

# Download and prepare dataset
tmpfile <- tempfile()
url <- paste0('http://www.nosdonnees.fr/wiki/images/b/b5/',
         'EUCircos_Regions_departements_circonscriptions_communes_gps.csv.gz')
download.file(url, destfile = tmpfile)
dat <- read.csv2(tmpfile, stringsAsFactors = FALSE)
file.remove(tmpfile)
dat <- dat[ ,c('nom_commune', 'latitude', 'longitude')]
colnames(dat) <- c('name', 'latitude', 'longitude')

dat[ ,'name']      <- as.factor(tolower(dat[ ,'name']))
dat[ ,'latitude']  <- as.numeric(dat[ ,'latitude'])
dat[ ,'longitude'] <- as.numeric(dat[ ,'longitude'])

# We use an equirectangular projection to work on true distances
dat <- data.frame(dat, proj4::project(dat[ ,c('longitude','latitude')],
                                      '+proj=eqc'))
dat <- na.omit(dat)
dat <- dat[ ,c('name','x','y')]
# Visualise distribution of towns
str(dat)
## 'data.frame':    33814 obs. of  3 variables:
##  $ name: Factor w/ 34141 levels "aast","abainville",..: 1264 2487 2798 2821 3515 4005 4708 5689 5730 6645 ...
##  $ x   : num  574507 585626 587480 561534 600452 ...
##  $ y   : num  5146469 5159442 5152029 5155736 5129790 ...
ggplot(dat) + geom_point(aes(x, y), alpha=.1)

Nice, let’s see whether “ker”-named towns are predominantly in Brittany.

# This is our custom function.
how_many_with_name <- function(regxp, X) sum(grepl(regxp, X))


dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3,
                    summarize, ker = how_many_with_name("^ker", name))

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

It seems there are indeed many towns with a name starting with “ker” in Brittany (and a couple in Alsace/Lorraine, too). However, our plot is pretty ugly: we can’t see the actual country shape! When nothing is specified, rollply computes its values over a rectangular grid than spans the maximum width and height of the original dataset.

Here our spatial dataset of towns reflect pretty well the overall shape of the country, so instead of a rectangular grid we can build a grid inside the overall outline (the alpha-hull) of the dataset.

dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4,
                    grid_npts = 10e3,
                    grid_type = "ahull_fill",
                    grid_opts = list(alpha = .02, # shape parameter for hull
                                     verbose = TRUE),
                    summarize, ker = how_many_with_name("^ker", name))
## Building grid in alphahull...
## run 1, error=-0.54% (4605 points)
## run 2, error=-0.29% (7082 points)
## run 3, error=-0.08% (9162 points)
## run 4, error=0% (9950 points)

Note that building an alpha hull-based grid is very slow (and not bug-free), as it is built on the pure-R alphahull package. However, one can pregenerate grids using the build_grid_* functions family and supply them directly to rollply using the grid argument, so this computation can be only done once (see below).

So, are there really more town named ker-something in Brittany than elsewhere? I’ll let you judge (spoiler: yes!):

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

Building grids: helper functions

As seen in the french towns example, rollply uses internally a grid of coordinates. For each points of this grid it selects the observations within the window, then applies the function on this subset. The user can either provide a grid as a data.frame or rollply will take care of building one automatically.

Several helper functions are provided to build nice grids, they all start with build_grid_.

For this example, we will use samples from a vegetation survey in a meadow in Yosemite National Park, California.

# We request a grid with approximately this number of points:
npts <- 500
base.plot <- ggplot(NULL, aes(x,y)) +
               geom_point(data=meadow, shape='+') +
               xlab('UTM X') +
               ylab('UTM Y')

grids <- list(identical  = build_grid_identical(meadow[ ,c('x','y')], npts),
              squaretile = build_grid_squaretile(meadow[ ,c('x','y')], npts),
              ahull_crop = build_grid_ahull_crop(meadow[ ,c('x','y')], npts),
              ahull_fill = build_grid_ahull_fill(meadow[ ,c('x','y')], npts))

plot_grid <- function(grid_type) {
  base.plot +
    geom_point(data=grids[[grid_type]]) +
    annotate('text', x = min(meadow$x), y = min(meadow$y),
            label = paste(nrow(grids[[grid_type]]), "points"),
            hjust = 0, vjust = 1)
}
plot_grid('identical')

plot_grid('squaretile')

plot_grid('ahull_crop')

plot_grid('ahull_fill')

Let’s produce a map of the soil moisture in the meadow:

# Build a high-resolution grid
hires_grid <- build_grid_ahull_fill(meadow[ ,c('x','y')], 1e4,
                                    grid_opts = list(verbose = TRUE))
## Building grid in alphahull...
## run 1, error=-0.58% (4191 points)
## run 2, error=-0.33% (6668 points)
## run 3, error=-0.12% (8835 points)
## run 4, error=-0.02% (9836 points)
# Compute the local average( moisture of the soil (to 40m)
meadow_roll <- rollply(meadow, ~ x + y, wdw.size = 40,
                    grid = hires_grid,
                    summarize, meanvwc.mean = mean(meanvwc, na.rm = TRUE))
# Plot the map
ggplot() +
  geom_raster(aes(x, y, fill = meanvwc.mean), data = meadow_roll) +
  scale_fill_distiller('RdYlGn', type = "div") +
  geom_point(aes(x, y), data = meadow, pch = "x", size = 1)

Notes

Performance considerations

rollply inherits plyr’s pros and cons. As in the latter’s functions, parallelism is just one argument away (set .parallel = TRUE to use the foreach backend). However, rollply does a lot of data.frame subsetting and this is still an expensive operation in R. You can however add a .progress='time' and have an estimate of how long your coffee break should be.

Bugs & comments

Rollply is my first package, has well-known bugs and is still in active development!

Development happens at github. Do not hesitate to post issues or pull requests.