noctua
now supports caching. This was originally
inspired by pyathena
to reduce the cost of using AWS
Athena
. noctua
however has a different caching
method and utilities local caching in R
environments
instead of using AWS list_query_executions
. This is down to
dbClearResult
clearing AWS S3
’s AWS
Athena
output when caching is disabled.
By caching queries the performance of repeat queries is significantly
improved. This is because the query is no longer sent to AWS
Athena
. Instead the query ID, of the repeating query, is taken
from the R environment and the result is returned from AWS
S3
.
library(DBI)
library(noctua)
= dbConnect(athena())
con
# Start caching queries
noctua_options(cache_size = 10)
# Upload Data to AWS Athena
dbWriteTable(con, "iris", iris, partition = c("Partition" = "01"))
# initial query to AWS Athena
system.time(df1 = dbGetQuery(con, "select * from iris"))
# Info: (Data scanned: 3.63 KB)
# user system elapsed
# 0.105 0.004 3.397
# repeat query to AWS Athena
system.time(df2 = dbGetQuery(con, "select * from iris"))
# Info: (Data scanned: 3.63 KB)
# user system elapsed
# 0.072 0.000 0.348
Here we can see a performance increase of x10 with repeat query execution.
The weakness in caching occurs when the underlying data is updated. The cache will still only retrieve the previous query ID. This means that the new updated data won’t be return when the caching is enabled:
# Updating iris table
dbWriteTable(con, "iris", iris, append = T, partition = c("Partition" = "02"))
= dbGetQuery(con, "select * from iris")
dt5
# Stop using cache data
noctua_options()
= dbGetQuery(con, "select * from iris")
dt6
nrow(dt5)
# 150
nrow(dt6)
# 300
Sadly the cached query didn’t pick up the new data from
iris
.
The caching method in noctua
will remember previous
query ids within each R session, even if you stop and start caching in
noctua_options
.
# Start caching
noctua_options(cache_size = 10)
= dbExecute(con, "select * from iris")
res1
# Stop caching
noctua_options()
= dbExecute(con, "select * from iris")
res2
# Start caching
noctua_options(cache_size = 10)
= dbExecute(con, "select * from iris")
res3
# Compare Query ID's
@info$QueryExecutionId
res1# 9a9272f5-0632-4774-9aa9-d07f151dabc5
@info$QueryExecutionId
res2# be12fe0-3ec0-4595-b3e6-b3bf67efa266
@info$QueryExecutionId
res3# 9a9272f5-0632-4774-9aa9-d07f151dabc5
We can see that res1
and res3
utilise the
same QueryID, even tho caching was stopped and started.
To clear down the cache, just set the parameter:
clear_cache
within noctua_options
to
TRUE
noctua_options(clear_cache = T)