The goal of MarineSPEED is to provide a benchmark data set for presence-only species distribution modeling (SDM) in order to facilitate reproducible and comparable SDM research. It contains species occurrences (coordinates) from a wide diversity of marine species and associated environmental data from Bio-ORACLE and MARSPEC. Some additional information about MarineSPEED can be found in the R Shiny viewer at http://marinespeed.org.
Three functions help with exploring
library(marinespeed)
# set a data directory, preferably something different from tempdir to avoid
# unnecessary downloads for every R session
options(marinespeed_datadir = tempdir())
# list all species
species <- list_species()
The first 5 species and there aphia_id (WoRMS species id) are:
species | aphia_id |
---|---|
Laternula elliptica | 197217 |
Pseudosagitta gazellae | 266258 |
Parasagitta elegans | 105440 |
Parasagitta setosa | 105443 |
Branchiostoma lanceolatum | 104906 |
The species information consists of species identifiers, taxonomic information from the World Register of Marine Species (WoRMS), a visual assessment score for the amount of sampling bias and the covered latitudinal zones.
# all species information
info <- species_info()
colnames(info)
## [1] "species" "aphia_id" "kingdom" "phylum"
## [5] "class" "order" "family" "genus"
## [9] "sampling_bias" "eco_polar" "eco_temperate" "eco_tropical"
## [13] "eco_open_ocean"
To loop over the occurrence data of all species you have to call the lapply_species function. For instance if you wanted to count the total number of records in MarineSPEED you’d need the following code. As you can see the function passed to lapply_species expects to parameters, one for the species name and one for the actual occurrences.
get_occ_count <- function(speciesname, occ) {
nrow(occ)
}
record_counts <- lapply_species(get_occ_count)
sum(unlist(record_counts))
## [1] 868151
To enable the usage of the same cross-validation k-fold datasets I splitted species occurrence data upfront in 5 folds (or 4 and 9 for grid) in 3 different ways:
Below code plots the training (blue) and test (red) occurrences for the first two disc folds of the first two species.
## plot first 2 disc folds for the first 2 species (blue=trainig, red=test)
plot_occurrences <- function(speciesname, data, fold) {
title <- paste0(speciesname, " (fold = ", fold, ")")
plot(data$occurrence_train[,c("longitude", "latitude")], pch=20, col="blue",
main = title)
points(data$occurrence_test[,c("longitude", "latitude")], pch=20, col="red")
}
lapply_kfold_species(plot_occurrences, species=species[1:2,],
fold_type = "disc", k = 1:2)