spind
is a package dedicated to removing the spectre of spatial autocorrelation in your species distribution models (hereafter referred to as SDMs). It contains many of the tools you need to calculate probabilities of occurence, assess model performance, and conduct multimodel inference for 2-D gridded datasets using methods that are robust to spatial autocorrelation.
The theory underlying the use of GEEs, WRMs, and many of the other tools in this package is covered elsewhere in the literature, and for the purposes of this vignette, we assume that you have already read those papers. We also assume that you have a working knowledge of how to use R. Instead, this vignette will focus on demonstrating how to utilize this package to create an SDM and assess its accuracy. Along the way, we will use a couple different data sets to examine how these functions work and investigate how one might use them to create a robust SDM.
Let’s start with a fairly simple GEE
using the simulated musdata
data set included in the package.
data(musdata)
data(carlinadata)
# Examine the structure to familiarize yourself with the data
?musdata
head(musdata)
?carlinadata
head(carlinadata)
# Next, fit a simple GEE and view the output
coords<-musdata[ ,4:5]
mgee<-GEE(musculus ~ pollution + exposure, family="poisson", data=musdata,
coord=coords, corstr="fixed", plot=TRUE, scale.fix=FALSE)
summary(mgee,printAutoCorPars=TRUE)
##
## Call:
## GEE(formula = musculus ~ pollution + exposure, family = "poisson",
## data = musdata, coord = coords, corstr = "fixed", plot = TRUE,
## scale.fix = FALSE)
## ---
## Coefficients:
## Estimate Std.Err z value Pr(>|z|)
## (Intercept) -1.90475 1.31091 -1.4530 0.1462252
## pollution 3.36216 0.91416 3.6779 0.0002352 ***
## exposure -1.46348 0.88010 -1.6629 0.0963410 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ---
## QIC: 1139.159
## ---
## Autocorrelation of GLM residuals
## [1] 0.685338504 0.509680590 0.363021118 0.247398654 0.144726020
## [6] 0.084220961 0.050228656 0.022369044 -0.001985639 -0.027296083
##
## Autocorrelation of GEE residuals
## [1] -0.001277974 -0.004261554 0.045280260 0.022738750 0.005821352
## [6] 0.004289166 0.008311357 0.003437398 0.001030847 -0.010359040
## ---
## Autocorrelation parameters from fixed model
## [1] "a=alpha^(d^v) , alpha=0.685 , v=1.093"
predictions<-predict(mgee,newdata=musdata)
As you can see, this package includes S3 methods for summary
and predict
. These are useful in evaluating model fit and autocorrelation of residuals compared to a non-spatial model (in this case, a GLM with the same family as the GEE). Additionally, the plot
argument in GEE
can be used to visually inspect the autocorrelation of the residuals from each regression. Note that a QIC (Quasi-information criterion) score is reported as opposed to AIC. This is calculated based on the method described in Hardin & Hilbe (2003) and is implemented using the function qic.calc
. Please see the references in the documentation of qic.calc
for more details on how this is calculated.
Note that trying to fit GEEs with corstr="fixed"
to large data sets will result in errors, as the resulting matrices will be too large to be handled in R. This is where fitting clustered models can come in handy. These can be specified by changing the corstr
to either "quadratic"
or "exchangeable"
. See Carl & Kuehn 2007 for more details on how these work.
Next, we’ll examine the other main model that is introduced in this package - the Wavelet Revised Model. Let’s start with a fairly simple WRM
using the same musdata
data set as above.
mwrm<-WRM(musculus ~ pollution + exposure, "poisson", musdata,
coord=coords, level=1, plot=TRUE)
summary(mwrm)
##
## Call:
## WRM(formula = musculus ~ pollution + exposure, family = "poisson",
## data = musdata, coord = coords, level = 1, plot = TRUE)
##
## Pearson Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6140000 -0.3057000 0.0047510 -0.0003873 0.3039000 3.0620000
## ---
## Coefficients:
## Estimate Std.Err z value Pr(>|z|)
## (Intercept) -1.9360 1.9177 -1.0095 0.312717
## pollution 3.1841 1.2251 2.5991 0.009348 **
## exposure -1.2286 1.5063 -0.8156 0.414723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ---
## Number of observations n: 400 , n.eff: 300 , AIC: 1110.845
##
## Number of iterations: 7
## ---
## Autocorrelation of glm.residuals
## [1] 0.685338504 0.509680590 0.363021118 0.247398654 0.144726020
## [6] 0.084220961 0.050228656 0.022369044 -0.001985639 -0.027296083
## Autocorrelation of wavelet.residuals
## [1] 0.024855393 -0.086311686 0.007820356 0.024501828 -0.016578686
## [6] 0.002798656 -0.002977017 -0.004611334 0.018150352 -0.008727321
predictions<-predict(mwrm, newdata=musdata)
Let’s try padding with mean values.
# Padding with mean values
padded.mwrm<-WRM(musculus ~ pollution + exposure, "poisson", musdata,
coord=coords, level=1, pad=list(padform=1), plot=TRUE)
summary(padded.mwrm)
##
## Call:
## WRM(formula = musculus ~ pollution + exposure, family = "poisson",
## data = musdata, coord = coords, level = 1, pad = list(padform = 1),
## plot = TRUE)
##
## Pearson Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.6140000 -0.3057000 0.0047510 -0.0003873 0.3039000 3.0620000
## ---
## Coefficients:
## Estimate Std.Err z value Pr(>|z|)
## (Intercept) -1.9360 1.9177 -1.0095 0.312717
## pollution 3.1841 1.2251 2.5991 0.009348 **
## exposure -1.2286 1.5063 -0.8156 0.414723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ---
## Number of observations n: 400 , n.eff: 300 , AIC: 1110.845
##
## Number of iterations: 7
## ---
## Autocorrelation of glm.residuals
## [1] 0.685338504 0.509680590 0.363021118 0.247398654 0.144726020
## [6] 0.084220961 0.050228656 0.022369044 -0.001985639 -0.027296083
## Autocorrelation of wavelet.residuals
## [1] 0.024855393 -0.086311686 0.007820356 0.024501828 -0.016578686
## [6] 0.002798656 -0.002977017 -0.004611334 0.018150352 -0.008727321
padded.predictions<-predict(padded.mwrm, newdata=musdata)
WRM
has many of the same features as GEE
. Setting plot=TRUE
allows you to examine the autocorrelation of residuals from a GLM of the same family as your WRM. S3 methods for predict
and summary
allow you to examine outputs from the model using the same code as you might use for a GLM. Also note that this reports an AIC score, rather than a QIC score as in the GEE.
WRM
has a number of other model-specific functions that you may find useful in diagnosing model fit and understanding your results. For example, you might want to plot the variance or covariance of each of your wavelets as a function of level
. The covar.plot
function allows you to visually examine the wavelet relationships from your model. However, we are going to switch to the carlinadata
data set now.
coords<-carlinadata[ ,4:5]
covar.plot(carlina.horrida ~ aridity + land.use,
data=carlinadata,coord=coords,wavelet="d4",
wtrafo='modwt',plot='covar')
## $result
## [,1] [,2] [,3] [,4] [,5]
## carlina.horrida-(Intercept) NA NA NA NA NA
## carlina.horrida-aridity 0.0368 0.0450 0.0623 0.0780 0.0466
## carlina.horrida-land.use 0.4782 0.1191 0.0332 0.0126 0.0055
covar.plot(carlina.horrida ~ aridity + land.use,
carlinadata,coord=coords,wavelet="d4",
wtrafo='modwt',plot='var')
## $result
## [,1] [,2] [,3] [,4] [,5]
## carlina.horrida 0.7235 0.1792 0.0628 0.0242 0.0093
## (Intercept) NA NA NA NA NA
## aridity 0.0691 0.1025 0.2028 0.3588 0.2657
## land.use 0.7556 0.1851 0.0420 0.0119 0.0044
spind
provides a couple of frameworks for conducting multi-model inference analyses and some helper functions that we hope will make your life easier when examining the results. The first that we’ll examine here is the step.spind
function, which implements step-wise model selection. The process is loosely based on MASS::stepAIC
and stats::step
, but is specific to classes GEE
and WRM
. Currently, the function only supports backwards model selection. In other words, you have to start with all of the variables in your model formula and remove them in a stepwise fashion. We hope to add forward model selection methods shortly. Additionally, step.spind
is designed to always respect the heirarchy of variables in the model and the user cannot currently override this. For example, step.spind
would not remove age
while retaining I(age^2)
. However, we are happy to update that feature if a lot of you find it frustrating. We hope you won’t :)
We’ll go through an example of step.spind
using a GEE on the birthwt
data set in the MASS
package below.
library(MASS)
data(birthwt)
x<-rep(1:14,14)
y<-as.integer(gl(14,14))
coords<-cbind(x[-(190:196)],y[-(190:196)])
formula<-formula(low ~ age + lwt + race + smoke + ftv + bwt)
mgee<-GEE(formula, family = "gaussian", data = birthwt,
coord=coords, corstr="fixed",scale.fix=TRUE)
ss<-step.spind(mgee,birthwt)
## Iteration: 1
## Single term deletions
## Deleted Term: age
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 112.1329 -52.69176
## 2 age 111.6567 -52.67607
## 3 lwt 112.0104 -52.75028
## 4 race 112.0894 -52.79396
## 5 smoke 111.8553 -52.69791
## 6 ftv 111.8808 -52.71354
## 7 bwt 297.7540 -121.88867
##
## Iteration: 2
## Single term deletions
## Deleted Term: smoke
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 111.6567 -52.67607
## 2 lwt 111.5298 -52.72915
## 3 race 111.6271 -52.77580
## 4 smoke 111.3877 -52.67890
## 5 ftv 111.4679 -52.73120
## 6 bwt 297.7171 -121.83567
##
## Iteration: 3
## Single term deletions
## Deleted Term: ftv
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 111.3877 -52.67890
## 2 lwt 111.2494 -52.72805
## 3 race 111.3122 -52.76311
## 4 ftv 111.2088 -52.73630
## 5 bwt 298.6147 -123.33017
##
## Iteration: 4
## Single term deletions
## Deleted Term: lwt
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 111.2088 -52.73630
## 2 lwt 111.0717 -52.78793
## 3 race 111.1415 -52.82335
## 4 bwt 298.6351 -123.35038
##
## Iteration: 5
## Single term deletions
## Deleted Term: race
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 111.0717 -52.78793
## 2 race 110.9656 -52.86072
## 3 bwt 295.8817 -123.12477
##
## Iteration: 6
## Single term deletions
## Deleted Term: <none>
## --------------------
## Deleted.Vars QIC Quasi.Lik
## 1 <none> 110.9656 -52.86072
## 2 bwt 296.2879 -123.98743
##
##
## ---------------
## Best model found:
## low ~ bwt
best.mgee<-GEE(ss$model, family = "gaussian", data = birthwt,
coord=coords, corstr="fixed",scale.fix=TRUE)
summary(best.mgee,printAutoCorPars=FALSE)
##
## Call:
## GEE(formula = ss$model, family = "gaussian", data = birthwt,
## coord = coords, corstr = "fixed", scale.fix = TRUE)
## ---
## Coefficients:
## Estimate Std.Err t value Pr(>|t|)
## (Intercept) 1.2492e+00 4.9121e-01 2.5430 0.01099 *
## bwt -3.0919e-04 6.5913e-05 -4.6909 2.72e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ---
## QIC: 110.9656
## ---
## Autocorrelation of GLM residuals
## [1] 0.837748633 0.724407532 0.602588671 0.500754270 0.387294592
## [6] 0.275433941 0.147728669 0.008716423 -0.130798183 -0.268641655
##
## Autocorrelation of GEE residuals
## [1] 0.43453709 0.35186795 0.27457621 0.21231229 0.10255028
## [6] 0.08028419 0.07174312 0.04070057 0.02919975 -0.06904364
Additionally, we offer a couple of model selection procedures specific to WRMs. The first is implemented in mmiWMRR
. This performs a series of scale-specific Wavelet Multi-Resolution Regressions (which are implemented using scaleWMRR
). It allows one to examine the effect that the scale
parameter has on the results of their regressions, and then select the appropriate level for subsequent analyses. If you are not already familiar with the scale
parameter, please see Carl et al (2016) (citation in scaleWMRR
documentation) before using this function.
data(carlinadata)
coords<- carlinadata[,4:5]
mmi<- mmiWMRR(carlina.horrida ~ aridity + land.use,"poisson",
carlinadata,coords,scale=3,detail=TRUE,wavelet="d4")
## ---
## Level = 3
## (Int) aridity land.use df logLik AIC delta weight
## 4 2.93438 0.49042 -3.50797 3 -1178.072 2362.1 0.00 1
## 3 3.11997 -3.40857 2 -1198.412 2400.8 38.68 0
## 2 -0.55307 0.46073 2 -1217.184 2438.4 76.22 0
## 1 -0.28444 1 -1238.104 2478.2 116.06 0
One can then take this a step further and visualize these results as a function of scale using the rvi.plot
. rvi.plot
uses mmiWMRR
and repeats the analysis for each level of scale
, then plots the relative importance of each explanatory variable as function of scale
. It will also print the resulting model selection tables to the console.
rvi.plot(carlina.horrida ~ aridity + land.use,"poisson",
carlinadata,coords,maxlevel=4,detail=TRUE,wavelet="d4")
##
## Model selection tables:
##
## ---
## Level = 1
## (Int) aridity land.use df logLik AIC delta weight
## 3 3.63243 -3.78957 2 -1185.487 2375.0 0.00 0.995
## 4 1.94780 2.11872 -3.42550 3 -1189.834 2385.7 10.69 0.005
## 2 -0.82469 1.13710 2 -1200.512 2405.0 30.05 0.000
## 1 -0.13692 1 -1221.456 2444.9 69.94 0.000
## ---
## Level = 2
## (Int) aridity land.use df logLik AIC delta weight
## 4 2.23516 0.59909 -2.93096 3 -1184.169 2374.3 0.00 1
## 2 -0.73426 0.75117 2 -1209.233 2422.5 48.13 0
## 3 2.72922 -3.16193 2 -1228.572 2461.1 86.81 0
## 1 -0.40235 1 -1262.854 2527.7 153.37 0
## ---
## Level = 3
## (Int) aridity land.use df logLik AIC delta weight
## 4 2.93438 0.49042 -3.50797 3 -1178.072 2362.1 0.00 1
## 3 3.11997 -3.40857 2 -1198.412 2400.8 38.68 0
## 2 -0.55307 0.46073 2 -1217.184 2438.4 76.22 0
## 1 -0.28444 1 -1238.104 2478.2 116.06 0
## ---
## Level = 4
## (Int) aridity land.use df logLik AIC delta weight
## 2 -1.36217 1.87641 2 -1205.696 2415.4 0.00 1
## 1 -0.02245 1 -1220.984 2444.0 28.58 0
## 3 8.29497 -8.35099 2 -1272.480 2549.0 133.57 0
## 4 7.65184 1.81274 -9.01193 3 -1292.419 2590.8 175.45 0
##
## ---
## Relative variable importance:
##
## level=1 level=2 level=3 level=4
## aridity 0.005 1 1 1
## land.use 1.000 1 1 0
Once you have your model, whether it be a GEE, WRM, or some other spatial model, you will probably want to look at some other goodness of fit statistics. In this package, these are categorized according to whether or not their values are dependent on the chosen threshold. th.dep
and th.indep
are designed to work on any number of model types, all you need is a set of actual values, predictions, and their associated coordinates. We’ll use the hook data set to see how these work.
data(hook)
# Familiarize yourself with the data
?hook
head(hook)
df<-hook[,1:2]
coords<-hook[,3:4]
# Threshold dependent metrics
th.dep.indices<-th.dep(data=df,coord=coords,spatial=TRUE)
# Confusion Matrix
th.dep.indices$cm
#> [,1] [,2] [,3] [,4]
#> [1,] 5 2 0 0
#> [2,] 3 1 1 3
#> [3,] 2 0 0 8
#> [4,] 2 3 0 70
# Kappa statistic
th.dep.indices$kappa
#> [1] 0.628529
# Threshold independent metrics
th.indep.indices<-th.indep(data=df,coord=coords,spatial=TRUE,plot.ROC=TRUE)
# AUC
th.indep.indices$AUC
#> [1] 0.9424119
# TSS
th.indep.indices$TSS
#> [1] 0.7425474