rredlist benchmarks

Introduction

rredlist provides two APIs, a higher-level one that takes slightly more time but returns the data in a more user-friendly format (a list), and a lower-level one (i.e., functions that end with “_“) that takes less time but does no processing of the data (returning the raw JSON string). Both APIs return the exact same information, but it is up to the user whether the format processing is worth the extra time, especially when performing bulk operations. To help inform this decision by the user, here is some benchmarking related to the two APIs. First, we’ll break down the total difference in computation time between the two APIs, then we’ll dig into what components are causing this difference. We’ll use microbenchmark::microbenchmark() which has very little computational overhead. Note that the time units vary from comparison to comparison, and the speed of these functions may be highly hardware- and network-dependent.

library(rredlist)
library(microbenchmark)

Head-to-head benchmarks

We’ll start by benchmarking the two APIs head-to-head. We’ll test a couple of use cases, in rough order of increasing complexity.

1. Get species count

microbenchmark(
  rl_sp_count(),
  rl_sp_count_(),
  times = 10
)
#> Unit: milliseconds
#>            expr      min       lq     mean   median       uq      max neval cld
#>   rl_sp_count() 116.7317 119.2230 120.2650 119.9624 120.2204 127.7930    10   a
#>  rl_sp_count_() 116.9935 117.4514 142.3724 118.6451 120.1328 313.6149    10   a

2. Lookup individual assessment

microbenchmark(
  rl_assessment(136250858),
  rl_assessment_(136250858),
  times = 10
)
#> Unit: milliseconds
#>                       expr      min       lq     mean   median      uq      max neval cld
#>   rl_assessment(136250858) 240.1841 243.4012 250.3891 248.4046 255.671 265.0165    10   a
#>  rl_assessment_(136250858) 239.4693 242.0733 260.7446 247.0122 249.948 388.4490    10   a

3. Taxonomic lookup with defaults

microbenchmark(
  rl_family(),
  rl_family_(),
  times = 10
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>   rl_family() 126.6563 126.9070 131.7954 127.7793 128.5488 158.2391    10   a
#>  rl_family_() 126.2004 129.5163 132.9154 131.4624 136.4402 141.6659    10   a

4. Taxonomic lookup with query (one page of results)

microbenchmark(
  rl_family("Rheidae"),
  rl_family_("Rheidae"),
  times = 10
)
#> Unit: milliseconds
#>                   expr      min       lq     mean   median       uq      max neval cld
#>   rl_family("Rheidae") 614.5653 628.8536 672.6828 669.7144 712.9704 739.9620    10   a
#>  rl_family_("Rheidae") 606.4974 642.0678 693.4926 676.5411 756.3828 822.1573    10   a

5. Taxonomic lookup with query (~10 pages of results)

microbenchmark(
  rl_family("Corvidae", quiet = TRUE),
  rl_family_("Corvidae", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                                  expr      min       lq     mean   median       uq      max neval cld
#>   rl_family("Corvidae", quiet = TRUE) 10.89607 11.02283 11.18194 11.17454 11.32452 11.48761    10   a
#>  rl_family_("Corvidae", quiet = TRUE) 10.83396 10.92033 11.09111 11.04066 11.16850 11.63398    10   a

6. Taxonomic lookup with query (~40 pages of results)

microbenchmark(
  rl_family("Tyrannidae", quiet = TRUE),
  rl_family_("Tyrannidae", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                                    expr      min       lq     mean   median       uq      max neval cld
#>   rl_family("Tyrannidae", quiet = TRUE) 38.17690 38.26825 38.75878 38.58874 38.93320 39.98190    10   a
#>  rl_family_("Tyrannidae", quiet = TRUE) 37.74677 37.93254 38.28780 38.00963 38.72292 39.07203    10   a

7. Taxonomic lookup with query (~900 pages of results)

microbenchmark(
  rl_class("Aves", quiet = TRUE),
  rl_class_("Aves", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                             expr      min       lq     mean   median       uq      max neval cld
#>   rl_class("Aves", quiet = TRUE) 1424.662 1451.755 1467.593 1468.110 1483.732 1513.186    10   a
#>  rl_class_("Aves", quiet = TRUE) 1428.810 1435.013 1448.085 1444.788 1465.589 1472.253    10   a

And the winner is…

As you can see above, the two APIs take roughly the same amount of time for most use cases. I previously said that the low-level API is designed to be faster. While most of these comparisons agree with that statement, the time reduction is usually a few milliseconds per function call. When we get into more complex queries, like returning multiple pages of API results, we start to see larger time reductions, especially as the number of pages of results increases (10+ seconds for hundreds of pages).

Query breakdown

Based on the above, it doesn’t seem to matter much, time-wise, whether we parse the data or not. So then what takes up all of the query time? Let’s break down the process of querying the API and downloading a single page of assessments using some of the internal functions of rredlist:

microbenchmark(
  res <- rredlist:::rr_GET_raw("taxa/family/Rheidae"), # get the raw data for the first page
  x <- res$parse("UTF-8"), # parse the raw response data to JSON
  rredlist:::rl_parse(x, parse = TRUE), # parse the JSON to a list of dataframes
  rredlist:::rl_parse(x, parse = FALSE), # parse the JSON to a list of lists
  times = 10
)
#> Unit: microseconds
#>                                                 expr      min       lq      mean    median       uq      max neval cld
#>  res <- rredlist:::rr_GET_raw("taxa/family/Rheidae") 588296.5 595734.3 607564.29 603497.10 620369.4 634292.7    10  a 
#>                              x <- res$parse("UTF-8")    807.9    829.0   1971.02   1028.45   2543.7   7116.8    10   b
#>                 rredlist:::rl_parse(x, parse = TRUE)   1021.6   1159.5   1440.46   1212.80   1302.7   3237.3    10   b
#>                rredlist:::rl_parse(x, parse = FALSE)     76.4     97.2    167.72    114.65    292.4    317.2    10   b

The above benchmarking shows us that the vast majority of time is spent downloading data from the IUCN API. For a single page of results, even the highest level of parsing takes only 0.15% of the time it takes to download the data. Further, while parsing to a list of dataframes (parse = TRUE) takes about 10 times as long as just parsing to a list of lists (parse = FALSE), both methods remain very quick compared to the process of downloading the data.

Now let’s break down a multi-page query:

microbenchmark(
  lst <- rredlist:::page_assessments("taxa/family/Tyrannidae",
                                     key = rredlist:::check_key(NULL),
                                     quiet = TRUE), # get the data for all of the pages
  rredlist:::combine_assessments(lst, parse = TRUE), # parse the JSON to a list of dataframes
  rredlist:::combine_assessments(lst, parse = FALSE), # parse the JSON to a list of lists
  times = 10
)
#> Unit: milliseconds
#>                                                                                                               expr        min         lq
#>  lst <- rredlist:::page_assessments("taxa/family/Tyrannidae",      key = rredlist:::check_key(NULL), quiet = TRUE) 36737.2111 36941.2153
#>                                                                  rredlist:::combine_assessments(lst, parse = TRUE)   249.1083   266.0197
#>                                                                 rredlist:::combine_assessments(lst, parse = FALSE)    13.7986    15.4881
#>         mean      median         uq        max neval cld
#>  37558.35359 37456.89045 38066.7995 39157.4033    10  a 
#>    272.17091   271.24645   279.7633   291.9884    10   b
#>     16.63622    15.62825    16.6812    23.0368    10   b

Again, even with about 40 pages of data to parse, the download takes the vast majority of the time. The highest-level parsing has increased to about 1% of the time it takes to download the data, but this remains less than a second compared to the ~35 second download.

Conclusion

Ultimately, both APIs take about the same amount of time because the majority of time is spent downloading the data from the IUCN database and reading it into R. For larger downloads, the parsing done by the high-level API may take an appreciable amount of time (tenths of seconds to seconds). It’s possible that users who are calling these functions many (e.g., thousands) of times would appreciate this time reduction. However, for most users it probably won’t matter. Furthermore, keep in mind that if you use the low-level API you will likely need to do your own processing after the fact in order to do any sort of downstream analyses. Ultimately, the choice is up to you.