Large and heterogeneous datasets may contain thousands of records missing spatial or taxonomic information (partially or entirely) as well as several records outside a region of interest or from doubtful sources. Such lower quality data are not fit for use in many research applications without prior amendments. The ‘Pre-filter’ module contains a series of tests to detect, remove, and, whenever, possible, correct such erroneous or suspect records.
Check here how to install the bdc package.
Read the merged database created in the module Standardization and integration of different datasets of the bdc package. It is also possible to read any datasets containing the required fields to run the functions (more details here).
<-
database ::read_csv(here::here("Output/Intermediate/00_merged_database.csv")) readr
⚠️IMPORTANT:
The results of the VALIDATION test used to flag data quality are appended in separate fields in this database and retrieved as TRUE ( ✅ ok ) or FALSE (❌check carefully ).
VALIDATION. Flag records missing species names.
<-
check_pf bdc_scientificName_empty(
data = database,
sci_name = "scientificName")
#> bdc_scientificName_empty:
#> Flagged 5 records.
#> One column was added to the database.
VALIDATION. Flag records missing partial or complete information on geographic coordinates.
<- bdc_coordinates_empty(
check_pf data = check_pf,
lat = "decimalLatitude",
lon = "decimalLongitude")
#> bdc_coordinates_empty:
#> Flagged 44 records.
#> One column was added to the database.
VALIDATION. This test flags records with out-of-range coordinates: latitude > 90 or -90; longitude >180 or -180.
<- bdc_coordinates_outOfRange(
check_pf data = check_pf,
lat = "decimalLatitude",
lon = "decimalLongitude")
#> bdc_coordinates_outOfRange:
#> Flagged 1 records.
#> One column was added to the database.
VALIDATION. This test flags records from doubtful source. For example, records from drawings, photographs, or multimedia objects, fossil records, among others.
# Check record sources of your dataset using:
# check_pf %>%
# dplyr::group_by(basisOfRecord) %>%
# dplyr::summarise(n = dplyr::n())
<- bdc_basisOfRecords_notStandard(
check_pf data = check_pf,
basisOfRecord = "basisOfRecord",
names_to_keep = "all")
#> bdc_basisOfRecords_notStandard:
#> Flagged 0 of the following specific nature:
#> character(0)
#> One column was added to the database.
ENRICHMENT. Deriving country names for records missing country names.
<- bdc_country_from_coordinates(
check_pf data = check_pf,
lat = "decimalLatitude",
lon = "decimalLongitude",
country = "country")
#> bdc_country_from_coordinates:
#> Country names were added to 27 records.
ENRICHMENT. Country names are standardized against a list of country names in several languages retrieved from Wikipedia.
<- bdc_country_standardized(
check_pf data = check_pf,
country = "country"
)
#> Standardizing country names
#>
#> country found: Bolivia
#> country found: Brazil
#> |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
#> |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
#> |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=00s
#>
#> bdc_country_standardized:
#> The country names of 65 records were standardized.
#> Two columns ('country_suggested' and 'countryCode') were added to the database.
AMENDMENT. The mismatch between informed country and
coordinates can be the result of negative or transposed coordinates.
Once detected a mismatch, different coordinate transformations are made
to correct the country and coordinates mismatch. Verbatim coordinates
are then replaced by the rectified ones in the returned database (a
database containing verbatim and corrected coordinates can be created in
the “Output” folder if save_outputs = TRUE
). Records near
countries coastline are not tested to avoid incur in false
positives.
<-
check_pf bdc_coordinates_transposed(
data = check_pf,
id = "database_id",
sci_names = "scientificName",
lat = "decimalLatitude",
lon = "decimalLongitude",
country = "country_suggested",
countryCode = "countryCode",
border_buffer = 0.2, # in decimal degrees (~22 km at the equator)
save_outputs = FALSE
)
#> Correcting latitude and longitude transposed
#>
#> 15 occurrences will be tested
#> Processing occurrences from: BR (15)
#> No latitude and longitude were transposed
VALIDATION. Records outside one or multiple reference countries; i.e., records in other countries or at an informed distance from the coast (e.g., in the ocean). This last step avoids flagging as invalid records close to country limits (e.g., records of coast or marshland species).
<-
check_pf bdc_coordinates_country_inconsistent(
data = check_pf,
country_name = "Brazil",
country = "country_suggested",
lon = "decimalLongitude",
lat = "decimalLatitude",
dist = 0.1 # in decimal degrees (~11 km at the equator)
)
#> dist is assumed to be in decimal degrees (arc_degrees).
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#>
#> bdc_coordinates_country_inconsistent:
#> Flagged 15 records.
#> One column was added to the database.
ENRICHMENT. Coordinates can be derived from a detailed description of the locality associated with records in a process called retrospective geo-referencing.
<- bdc_coordinates_from_locality(
xyFromLocality data = check_pf,
locality = "locality",
lon = "decimalLongitude",
lat = "decimalLatitude",
save_outputs = FALSE
)
#> bdc_coordinates_from_locality
#> Found 38 records missing or with invalid coordinates but with potentially useful information on locality
Here we create a column named “.summary” summing up the results of all VALIDATION tests (those starting with “.”). This column is FALSE when a record is flagged as FALSE in any data quality test (i.e. potentially invalid or suspect record).
<- bdc_summary_col(data = check_pf)
check_pf
#> bdc_summary_col:
#> Flagged 65 records.
#> One column was added to the database.
Creating a report summarizing the results of all tests. The report
can be automatically saved if save_report = TRUE.
<-
report bdc_create_report(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter",
save_report = FALSE)
report
Here we create figures (bar plots and maps) to make the
interpretation of the results of data quality tests easier. See some
examples below. Figures can be automatically saved if
save_figures = TRUE.
<-
figures bdc_create_figures(data = check_pf,
database_id = "database_id",
workflow_step = "prefilter",
save_figures = FALSE)
# Check figures using
$.coordinates_country_inconsistent figures
We can remove records flagged as erroneous or suspect to obtain a “clean” database. Records missing names or coordinates, outside a region of interest or from distrustful sources are rarely suitable for biodiversity analyses. We can use the column .summary to filter valid records passing in all tests (i.e., flagged as “TRUE”). Next, we use the bdc_filter_out_falgs function to remove all tests’ columns (that is, those starting with “.”).
<-
output %>%
check_pf ::filter(.summary == TRUE) %>%
dplyrbdc_filter_out_flags(data = ., col_to_remove = "all")
#> bdc_fiter_out_flags:
#> The following columns were removed from the database:
#> .scientificName_empty, .coordinates_empty, .coordinates_outOfRange, .basisOfRecords_notStandard, .coordinates_country_inconsistent, .summary
You can use qs::qsave() instead of write_csv to save a large database in a compressed format.
# use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database
%>%
output ::write_csv(.,
readr::here("Output", "Intermediate", "01_prefilter_database.csv")) here