stopifnot(requireNamespace("proj4", quietly = TRUE) &
requireNamespace("ggplot2", quietly = TRUE) &
requireNamespace("rgdal", quietly = TRUE) &
requireNamespace("tidyr", quietly = TRUE) &
requireNamespace("rollply", quietly = TRUE) &
requireNamespace("plyr", quietly = TRUE))
Rollply is a small function built upon plyr’s ddply
function to facilitate moving-window-based computations. If you have a data.frame
, give the dimensions over which the function the window should move, and rollply will make the subsets, apply the function on them and then combine the results.
In short, rollply extends the split-apply-combine strategy to moving-window computations, using a similar syntax. Let’s start with a simple example.
A simple use of moving-windows is adding a trendline to a time series plot. We will use the CO2 data from the Mauna Loa NOAA Observatory as a environmentally-conscious example.
# Download and format data
url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt"
hawaii <- read.table(url)[ ,c(3,4)]
names(hawaii) <- c('date','CO2')
hawaii[hawaii$CO2 < 0, "CO2"] <- NA # mark NAs as such
# Display original trend
CO2.plot <- ggplot(hawaii) + geom_line(aes(date, CO2)) + ylab("CO2 (ppm)")
print(CO2.plot)
There is a clear trend here! Let’s smooth out the season effect (the wiggles in the black curve). We’ll use a window with a size of one year to compute a yearly average.
# with smoothed trend
hawaii.smoothed <- rollply(hawaii, ~ date, wdw.size = 1,
summarize, CO2.mean = mean(CO2, na.rm = TRUE), )
CO2.plot + geom_line(aes(date, CO2.mean), data = hawaii.smoothed, color = 'red')
And voilà! A rather nice, although a bit depressing, trend line for our data.
Now, this example is a bit silly: if you are working on regularly-spaced time series, you might be better off using one of the specialized packages (e.g. TTR). However, for a quick-and-dirty approach here Rollply is perfect as it only uses standard data.frame
s.
Let’s take a more complex example.
If you open a map of France, you’ll notice that towns and villages tend to have names that follow patterns. For instance, Brittany’s towns are famous for having names starting with a “ker-”. Many towns in Lorraine end in “-ange” (a legacy from the german ending “-ingen”).
Can we visually explore the distribution of town names ?
A moving-window approach essentially boils down to the following steps:
data.frame
Like ddply
(in package plyr
), rollply
takes care of points 1,2 and 4. We just need to define a function that does number 3!
Let’s download a dataset of town names with their geographical coordinates:
# Download and prepare dataset
tmpfile <- tempfile()
url <- paste0('http://www.nosdonnees.fr/wiki/images/b/b5/',
'EUCircos_Regions_departements_circonscriptions_communes_gps.csv.gz')
download.file(url, destfile = tmpfile)
dat <- read.csv2(tmpfile, stringsAsFactors = FALSE)
file.remove(tmpfile)
dat <- dat[ ,c('nom_commune', 'latitude', 'longitude')]
colnames(dat) <- c('name', 'latitude', 'longitude')
dat[ ,'name'] <- as.factor(tolower(dat[ ,'name']))
dat[ ,'latitude'] <- as.numeric(dat[ ,'latitude'])
dat[ ,'longitude'] <- as.numeric(dat[ ,'longitude'])
# We use an equirectangular projection to work on true distances
dat <- data.frame(dat, proj4::project(dat[ ,c('longitude','latitude')],
'+proj=eqc'))
dat <- na.omit(dat)
dat <- dat[ ,c('name','x','y')]
# Visualise distribution of towns
str(dat)
## 'data.frame': 33814 obs. of 3 variables:
## $ name: Factor w/ 34141 levels "aast","abainville",..: 1264 2487 2798 2821 3515 4005 4708 5689 5730 6645 ...
## $ x : num 574507 585626 587480 561534 600452 ...
## $ y : num 5146469 5159442 5152029 5155736 5129790 ...
ggplot(dat) + geom_point(aes(x, y), alpha=.1)
Nice, let’s see whether “ker”-named towns are predominantly in Brittany.
# This is our custom function.
how_many_with_name <- function(regxp, X) sum(grepl(regxp, X))
dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3,
summarize, ker = how_many_with_name("^ker", name))
ggplot(dat_roll) +
geom_raster(aes(x, y, fill = ker)) +
scale_fill_distiller(palette = 'Greys')
It seems there are indeed many towns with a name starting with “ker” in Brittany (and a couple in Alsace/Lorraine, too). However, our plot is pretty ugly: we can’t see the actual country shape! When nothing is specified, rollply computes its values over a rectangular grid than spans the maximum width and height of the original dataset.
Here our spatial dataset of towns reflect pretty well the overall shape of the country, so instead of a rectangular grid we can build a grid inside the overall outline (the alpha-hull) of the dataset.
dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4,
grid_npts = 10e3,
grid_type = "ahull_fill",
grid_opts = list(alpha = .02, # shape parameter for hull
verbose = TRUE),
summarize, ker = how_many_with_name("^ker", name))
## Building grid in alphahull...
## run 1, error=-0.54% (4605 points)
## run 2, error=-0.29% (7082 points)
## run 3, error=-0.08% (9162 points)
## run 4, error=0% (9950 points)
Note that building an alpha hull-based grid is very slow (and not bug-free), as it is built on the pure-R alphahull
package. However, one can pregenerate grids using the build_grid_*
functions family and supply them directly to rollply using the grid
argument, so this computation can be only done once (see below).
So, are there really more town named ker-something in Brittany than elsewhere? I’ll let you judge (spoiler: yes!):
ggplot(dat_roll) +
geom_raster(aes(x, y, fill = ker)) +
scale_fill_distiller(palette = 'Greys')
As seen in the french towns example, rollply uses internally a grid of coordinates. For each points of this grid it selects the observations within the window, then applies the function on this subset. The user can either provide a grid as a data.frame
or rollply will take care of building one automatically.
Several helper functions are provided to build nice grids, they all start with build_grid_.
build_grid_identical builds a grid with as many points on each dimension
build_grid_squaretile (2D only) builds a 2D grid of points, with a number of points on each dimension that depends on the length of that dimension
build_grid_ahull_crop (2D only) builds a 2D grid of points, then discard all the points that do not fall in the alpha-hull of the actual data.
build_grid_ahull_fill (2D only) same as above, but tries to build a grid with a final number of points approximately equal to the one asked for (parameter grid_npts
)
For this example, we will use samples from a vegetation survey in a meadow in Yosemite National Park, California.
# We request a grid with approximately this number of points:
npts <- 500
base.plot <- ggplot(NULL, aes(x,y)) +
geom_point(data=meadow, shape='+') +
xlab('UTM X') +
ylab('UTM Y')
grids <- list(identical = build_grid_identical(meadow[ ,c('x','y')], npts),
squaretile = build_grid_squaretile(meadow[ ,c('x','y')], npts),
ahull_crop = build_grid_ahull_crop(meadow[ ,c('x','y')], npts),
ahull_fill = build_grid_ahull_fill(meadow[ ,c('x','y')], npts))
plot_grid <- function(grid_type) {
base.plot +
geom_point(data=grids[[grid_type]]) +
annotate('text', x = min(meadow$x), y = min(meadow$y),
label = paste(nrow(grids[[grid_type]]), "points"),
hjust = 0, vjust = 1)
}
plot_grid('identical')
plot_grid('squaretile')
plot_grid('ahull_crop')
plot_grid('ahull_fill')
Let’s produce a map of the soil moisture in the meadow:
# Build a high-resolution grid
hires_grid <- build_grid_ahull_fill(meadow[ ,c('x','y')], 1e4,
grid_opts = list(verbose = TRUE))
## Building grid in alphahull...
## run 1, error=-0.58% (4191 points)
## run 2, error=-0.33% (6668 points)
## run 3, error=-0.12% (8835 points)
## run 4, error=-0.02% (9836 points)
# Compute the local average( moisture of the soil (to 40m)
meadow_roll <- rollply(meadow, ~ x + y, wdw.size = 40,
grid = hires_grid,
summarize, meanvwc.mean = mean(meanvwc, na.rm = TRUE))
# Plot the map
ggplot() +
geom_raster(aes(x, y, fill = meanvwc.mean), data = meadow_roll) +
scale_fill_distiller('RdYlGn', type = "div") +
geom_point(aes(x, y), data = meadow, pch = "x", size = 1)
rollply
inherits plyr’s pros and cons. As in the latter’s functions, parallelism is just one argument away (set .parallel = TRUE
to use the foreach backend). However, rollply does a lot of data.frame
subsetting and this is still an expensive operation in R. You can however add a .progress='time'
and have an estimate of how long your coffee break should be.
Rollply is my first package, has well-known bugs and is still in active development!
Development happens at github. Do not hesitate to post issues or pull requests.