gatoRs (Geographic and Taxonomic Occurrence R-Based Scrubbing) provides users with tools to streamline the downloading and processing biodiversity data.
Data Downloading
Identifying Synonyms
Historically, many names may have been used to refer to your taxa of interest. For example, specimen representing Galax urceolata (Diapensiaceae) can be found under the scientific name Galax aphylla, despite the latter being invalidated over 50 years ago (see more here). Since synonyms are common, we designed gatoRs retrieve biodiversity records based on a list of names, however the user must supply the synonym list.
There are many databases available to compile synonym lists for plant species including:
- World Flora Online - WFO Plant List
- TROPICOS
- World Checklist of Vascular Plants
- USDA PLANTS Database
- International Plant Names Index
- World Plants
Many R packages have been developed to access these databases including:
Download with gatoRs
With gators_download()
you can obtain biodiversity records for your species of interest from both GBIF and iDigBio. This function is innovative in how it searches iDigBio. Unlike spocc::occ()
, we do not query the iDigBio API using the scientific name field, as this will only return exact matches. Instead, we designed a “pseudo-fuzzy match” to search all fields for partial matches to the supplied scientific names. Additionally, the columns returned have been handpicked to aid in processing records for investigations of species distributions (see more gators_download()
).
After you identify synonyms, create a list of all possible names for your species of interest with the first name in the list as the accepted name (ex. c("Galax urceolata", "Galax aphylla")
). Note, the first name in your list will be used to identify the GBIF species code when gbif_match = "code"
.
Example:
library(gatoRs)
galaxdf <- gators_download(synonyms.list = c("Galax urceolata", "Galax aphylla"),
write.file = TRUE,
filename = "base_folder/my_file.csv", # Location to save file - must end in .csv
gbif.match = "fuzzy",
idigbio.filter = TRUE)
Optional parameters include gbif.match
and idigbio.filter
. gbif.match
allows you to search by fuzzy matching records to the scientific name (default, gbif.match = "fuzzy"
) or to search for the associated species key using GBIF’s backbone taxonomy system (gbif.match = "code"
).
If set to TRUE (default, recommended), idigbio.filter
fuzzy matches taxonomic columns to provided taxon names. This filters the data set for relevant data.
The function also generates scientificName, genus, specificEpithet, and infraspecificEpithet columns based on available data and parsing and then fixes incorrect capitalization of species names to give the data a cleaner look.
Data Processing
We downloaded 6885 observations for Galax urceolata in the example above. Of these observations, only those with locality information will be helpful when investigating this species distribution.
Identify Records Missing Locality Information
Locality information can be redacted or skewed due to protect threatened taxa, often locality information will be provided upon request or can be identified through georeferencing. We created functions to aid in this process.
Redacted Records
Locality information can be redacted or skewed due to protect threatened taxa; often locality information will be provided to aid research upon request.
To find data that needs to be manually received by an institution via a permit (or removed from the data set), use needed_records()
. After receiving the data from herbaria, manually merge the obtained records with your original data set.
Example:
redacted_info <- needed_records(galaxdf)
Records to Georeference
Some records may be missing latitude and longitude values, however locality information can be used to assign coordinates to the record through georeferencing.
To find data lacking coordinates but containing locality information, use need_to_georeference()
. You should georeference these records and then manually merge the obtain records with your original data set.
Example:
to_georeference <- need_to_georeference(galaxdf)
Occurrence Data Cleaning
Here we walk through each cleaning function, however we also created a simple one-step option full_clean()
.
Example:
galaxdf <- full_clean(df = galaxdf, synonyms.list = c("Galax urceolata", "Galax aphylla"),
taxa.filter = "fuzzy",
accepted.name = "Galax urceolata", remove.zero = TRUE,
precision = TRUE, digits = 2, remove.skewed = TRUE,
basis.list = c("FossilSpecimen", "FOSSIL_SPECIMEN"), cluster = TRUE)
Remove Duplicate Records
There can be an overlap of records with GBIF and iDigBio. Hence, duplicate records may be downloaded. To find and remove these, use remove_duplicates()
.
Example:
galaxdf <- remove_duplicates(galaxdf)
Resolve Taxon Names
To find data containing scientific names corresponding to your desired species, use taxa_clean()
. Use your downloaded data from the first step as input, as well as a synonyms list, the accepted name, and the filter option (exact, fuzzy, or interactive).
Example:
galaxdf <- taxa_clean(df = galaxdf,
synonyms.list = c("Galax urceolata", "Galax aphylla"),
taxa.filter = "fuzzy",
accepted.name = "Galax urceolata") # creates a new column with accepted name for easy comparison
Remove Particular Record Bases
Sometimes, certain bases of records may want to be removed from the data set. To do this, we provide basis_clean()
. This function can be used interactively (interactive = TRUE
) to view and inspect the types of basis of record associated with the records. Otherwise, this function can automate the process of removing particular record bases.
Example:
galaxdf <- basis_clean(galaxdf, basis.list = c("FossilSpecimen", "FOSSIL_SPECIMEN"))
Clean Locality
Basic Locality Clean
Here we remove any records with missing coordinates, impossible coordinates, coordinates at (0,0), and any that are flagged as skewed. The skewed records can be identified with the remove_skewed()
function and row value for the ‘InformationWitheld’ column. We also provide the option to round the provided latitude and longitude values to a specified number of decimal places.
galaxdf <- basic_locality_clean(df = galaxdf,
remove.zero = TRUE, # Records at (0,0) are removed
precision = TRUE, # latitude and longitude are rounded
digits = 2, # round to 2 decimal places
remove.skewed = TRUE)
Find and Remove Flagged Points
To find records that may have problematic coordinates, use process_flagged()
. This function can either automate the process of finding and removing problematic points (interactive = FALSE
) or allow for manual inspection. The latter will let you manually remove points deemed improper by viewing the points on a graph.
Example:
galaxdf <- process_flagged(galaxdf, interactive = TRUE)