---
title: "BiocFileCache: Managing File Resources Across Sessions"
author: Lori Shepherd
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteIndexEntry{BiocFileCache: Managing File Resources Across Sessions}
    %\VignetteEncoding{UTF-8}
    %\VignetteDepends{rtracklayer}
---

```{r setup, echo=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```

# Overview

Organization of files on a local machine can be cumbersome. This is especially
true for local copies of remote resources that may periodically require a new
download to have the most updated information available. [BiocFileCache][] is
designed to help manage local and remote resource files stored locally. It
provides a convenient location to organize files and once added to the cache
management, the package provides functions to determine if remote resources are
out of date and require a new download.

## Installation and Loading

`BiocFileCache` is a _Bioconductor_ package and can be installed through
`biocLite`.

```{r, eval = FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("BiocFileCache", dependencies = TRUE)
```

After the package is installed, it can be loaded into _R_ workspace by

```{r, results='hide', warning=FALSE, message=FALSE}
library(BiocFileCache)
```

## Creating / Loading the Cache

The initial step to utilizing [BiocFileCache][] in managing files is to create a
cache object specifying a location. We will create a temporary directory for use
with examples in this vignette. If a path is not specified upon creation, the
default location is a directory `~/.BiocFileCache`.

```{r}
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
```

If the path location exists and has been utilized to store files previously, the
previous object will be loaded with any files saved to the cache.

Some utility functions to examine the cache are:

 * `bfccache(bfc)`
 * `length(bfc)`
 * `show(bfc)`
 * `bfcinfo(bfc)`

`bfccache()` will show the cache path. **NOTE**: Because we are using temporary
directories, your path location will be different than shown.

```{r}
bfccache(bfc)
length(bfc)
```

`length()` on a BiocFileCache will show the number of files currently being
tracked by the `BiocFileCache`. For more detailed information on what is store
in the `BiocFileCache` object, there is a show method which will display the
object, object class, cache path, and number of items currently being tracked.

```{r}
bfc
```

`bfcinfo()` will list a table of `BiocFileCache` resource files being tracked in
the cache. It returns a [dplyr][] object of class `tbl_sqlite`.

```{r}
bfcinfo(bfc)
```

The table of resource files includes the following information:

 * `rid`: resource id. Autogenerated. This is a unique identifier automatically
   generated when a resource is added to the cache.
 * `rname`: resource name. This is given by the user when a resource is added to
   the cache. It does not have to be unique and can be updated at anytime. We
   recommend descriptive key words and identifiers.
 * `create_time`: The date and time a resource is added to the cache.
 * `access_time`: The date and time a resource is utilized within the cache. The
   access time is updated when the resource is updated or accessed.
 * `rpath`: resource path. This is the path to the local file.
 * `rtype`: resource type. Either "local" or "web", indicating if the resource
   has a remote origin.
 * `fpath`: If rtype is "web", this is the link to the remote resource. It will
   be utilized to download the remote data.
 * `last_modified_time`: For a remote resource, the last_modified (if available)
   information for the local copy of the data. This information is checked
   against the remote resource to determine if the local copy is stale and needs
   to be updated.

Now that we have created the cache object and location, let's explore adding
files that the cache will manage!

## Adding / Tracking Resources

Now that a `BiocFileCache` object and cache location has been created, files can
be added to the cache for tracking. There are two functions to add a resource to
the cache:

 * `bfcnew()`
 * `bfcadd()`

The difference between the options: `bfcnew()` creates an entry for a resource
and returns a filepath to save to. As there are many types of data that can be
saved in many different ways, `bfcnew()` allows you to save any _R_ data object
in the appropriate manner and still be able to track the saved file. `bfcadd()`
should be utilized when a file already exists or a remote resource is being
accessed.

`bfcnew` takes the `BiocFileCache` object and a user specified `rname` and
returns a path location to save data to. (optionally) you can add the file
extension if you know the type of file that will be saved:

```{r}
savepath <- bfcnew(bfc, "NewResource", ext="RData")
savepath

## now we can use that path in any save function
m = matrix(1:12, nrow=3)
save(m, file=savepath)

## and that file will be tracked in the cache
bfcinfo(bfc)
```

`bfcadd()` is for existing files or remote resources.  The user will still
specify an `rname` of their choosing but also must specify a path to local file
or web resource as `fpath`. If no `fpath` is given, the default is to assume the
`rname` is also the path location. If the `fpath` is a local file, there are a
few options for the user determined by the `action` argument.  `action` will
allow the user to either `copy` the existing file into the cache directory,
`move` the existing file into the cache directory, or leave the file whereever
it is on the local system yet still track through the cache object `asis`. copy
and move will rename the file to the generated cache file path. If the `fpath`
is a remote source, the source will try to be downloaded, if it is successful it
will save in the cache location and track in the cache object; The original
source will be added to the cache information as `fpath`. Relative path
locations may also be used, specified with `rtype = "relative"`. This will
store a relative location for the file within the cache; only actions `copy`
and `move` are available for relative paths.

First let's use local files:

```{r}
fl1 <- tempfile(); file.create(fl1)
add2 <- bfcadd(bfc, "Test_addCopy", fl1)                 # copy
# returns filepath being tracked in cache
add2
# the name is the unique rid in the cache
rid2 <- names(add2)

fl2 <- tempfile(); file.create(fl2)
add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move
rid3 <- names(add3)

fl3 <- tempfile(); file.create(fl3)
add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local",
	       action="asis") # reference
rid4 <- names(add4)

file.exists(fl1)    # TRUE - copied from original location
file.exists(fl2)    # FALSE - moved from original location
file.exists(fl3)    # TRUE - left asis, original location tracked
```

Now let's add some examples with remote sources:

```{r}
url <- "http://httpbin.org/get"
add5 <- bfcadd(bfc, "TestWeb", fpath=url)
rid5 <- names(add5)

url2<- "https://en.wikipedia.org/wiki/Bioconductor"
add6 <- bfcadd(bfc, "TestWeb", fpath=url2)
rid6 <- names(add6)

# let's look at our BiocFileCache object now
bfc
bfcinfo(bfc)
```

Now that we are tracking resources, let's explore accessing their information!

## Investigating / Accessing Resources

Before we get into exploring individual resources, a helper function.  Most of
the functions provided require the unique rid[s] assigned to a resource. The
`bfcadd` and `bfcnew` return the path as a named character vector, the name of
the character vector is the rid.  However, you may want to access a resource
that you have added some time ago.

 * `bfcquery()`

`bfcquery()` will take in a key word and search across the `rname`, `rpath`, and
`fpath` for any matching entries.

```{r}
bfcquery(bfc, "Web")

bfcquery(bfc, "copy")

q1 <- bfcquery(bfc, "wiki")
q1
class(q1)
```

As you can see above `bfcquery()`, returns an object of class `tbl_sql` and can
be investiaged further utilizing methods for these classes, such as the package
`dplyr` methods. The `rid` can be seen in the first column of the table to be
used in other functions. To get a quick count of how many objects in the cache
matched the query, use `bfccount()`.

```{r}
bfccount(q1)
```


 * `[`

`[` allows for subsetting of the BiocFileCache object.  The output will be a
BiocFileSubCache object. Users will still be able to query, remove (from the
subset object only), and access resources of the subset, however the resources
cannot be updated.

```{r}
bfcsubWeb = bfc[paste0("BFC", 5:6)]
bfcsubWeb
bfcinfo(bfcsubWeb)
```

There are three methods for retrieving the `BiocFileCache` resource path
location.

 * `[[`
 * `bfcpath()`
 * `bfcrpath()`

The `[[` will access the `rpath` saved in the `BiocFileCache`. Retrieving this
location will return the path to the local version of the resource; allowing the
user to then use this path in any load/read methods most appropriate for the
resource. The `bfcpath()` returns a named character vector also displaying the
local file that can be used for retrieval. If the resource is a remote resource,
`bfcpath()` will also return the path to the original source saved as
`fpath`. The `bfcrpath()` returns a named character vector only displaying the
local file. `bfcrpath()` can also be used to add a resource into the
cache. `bfcrpath()` can take an argument `rnames`; if the element in `rnames`
is not found, it will try and add to the cache with `bfcadd()`.

```{r}
bfc[["BFC2"]]
bfcpath(bfc, "BFC2")
bfcpath(bfc, "BFC5")
bfcrpath(bfc, rids="BFC5")
bfcrpath(bfc)
bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis"))
```

Managing remote resources locally involves knowing when to update the local copy
of the data.

 * `bfcneedsupdate()`

`bfcneedsupdate()` is a method that will check the local copy of the data's
last_modified tag to the last_modified tag of the remote source. The cache saves
this information when the web resource is initially added. If the resource does
not have a last_modified tag, it is undetermined.

**Note:** This function does not automatically download the remote source if it
  is out of date.  Please see `bfcdownload()`.

```{r}
bfcneedsupdate(bfc, "BFC5")
bfcneedsupdate(bfc, "BFC6")
bfcneedsupdate(bfc)
```

## Updating Resource Entries or Local Copy of Remote Data

Just as you could access the `rpath`, the local resource path can be set with

 * `[[<-`

The file must exist in order to be replaced in the `BiocFileCache`. If the user
wishes to rename, they must make a copy (or touch) the file first.

```{r}
fileBeingReplaced <- bfc[[rid3]]
fileBeingReplaced

# fl3 was created when we were adding resources
fl3

bfc[[rid3]]<-fl3
bfc[[rid3]]
```

The user may also wish to change the `rname` or `fpath` associated with a
resource in addition to the `rpath`. This can be done with

 * `bfcupdate()`

Again, if changing the `rpath` the file must exist. If a `fpath` is being
updated, the data will be downloaded and overwrite the current file specified in
`rpath`.

```{r}
bfcinfo(bfc, "BFC1")
bfcupdate(bfc, "BFC1", rname="FirstEntry")
bfcinfo(bfc, "BFC1")
```

Now let's update a web resource

```{r}
library(dplyr)
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate")
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
```

Lastly, remote resources may require an update if the Data is out of date (See
`bfcneedsupdate()`).  The `bfcdownload` function will attempt to download from
the original resource saved in the cache as `fpath` and overwrite the out of
date file `rpath`

 * `bfcdownload()`

The following confirms that resources need updating, and the performs the update

```{r}
rid <- "BFC5"
test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA'
if (test)
    bfcdownload(bfc, rid)
```

## Adding MetaData

The following functions are provided for metadata:

 * `bfcmeta()<-`
 * `bfcmeta()`
 * `bfcmetalist()`
 * `bfcmetaremove()`

Additional metadata can be added as `data.frames` that become tables in the sql 
database. The `data.frame` must contain a column `rid` that matches the `rid`
column in the cache. Any metadata added will then be displayed when accessing 
the cache. Metadata is added with `bfcmeta()<-`. A table `name` must be provided
as an argument. Users can add multiple metadata tables as long as the names are 
unique. Tables may be appended or overwritten using additional arguments 
`append=TRUE` or `overwrite=TRUE`.  

```{r}
names(bfcinfo(bfc))
meta <- as.data.frame(list(rid=bfcrid(bfc)[1:3], idx=1:3))
bfcmeta(bfc, name="resourceData") <- meta
names(bfcinfo(bfc))
```
The metadata tables that exist can be listed with `bfcmetalist()` and can be 
retrieved with `bfcmeta()`. 

```{r}
bfcmetalist(bfc)
bfcmeta(bfc, name="resourceData")
```

Lastly, metadata can be removed with `bfcmetaremove()`.

```{r}
bfcmetaremove(bfc, name="resourceData")
```

**Note:** 

While quick implementations of all the functions exist where if you
don't specify a BiocFileCache object it will operate on `BiocFileCache()`,
this option is not available for `bfcmeta()<-`. This function must always
specify a BiocFileCache object by first defining a variable and then passing
that variable into the function. 

Example of ERROR:
```{r eval=FALSE}
bfcmeta(name="resourceData") <- meta
Error in bfcmeta(name = "resourceData") <- meta : 
  target of assignment expands to non-language object
```
Correct implementation:
```{r eval=FALSE}
bfc <- BiocFileCache()
bfcmeta(bfc, name="resourceData") <- meta
```
All other functions have a default, if the BiocFileCache object is missing it
will operate on the default cache `BiocFileCache()`.

## Removing Resources

Now that we have added resources, it is also possible to remove a resource.

 * `bfcremove()`

When you remove a resource from the cache, it will also delete the local file
but only if it is stored in the cache directory as given by `bfccache(bfc)`. If
it is a path to a file somewhere else on the user system, it will only be
removed from the `BiocFileCache` object but the file not deleted.

```{r}
# let's remind ourselves of our object
bfc

bfcremove(bfc, "BFC6")
bfcremove(bfc, "BFC1")

# let's look at our BiocFileCache object now
bfc
```

There is another helper function that may be of use:

 * `bfcsync()`

This function will compare two things:

 1. If any `rpath` cannot be found (This would occur if `bfcnew()` is used and
    the path was not used to save an object)
 2. If there are files in the cache directory (`bfccache(bfc)`), that are not
    being tracked by the `BiocFileCache` object

```{r}
# create a new entry that hasn't been used
path <- bfcnew(bfc, "UseMe")
rmMe <- names(path)
# We also have a file not being tracked because we updated rpath

bfcsync(bfc)

# you can suppress the messages and just have a TRUE/FALSE
bfcsync(bfc, FALSE)

#
# Let's do some cleaning to have a synced object
#
bfcremove(bfc, rmMe)
unlink(fileBeingReplaced)

bfcsync(bfc)
```

## Cleaning or Removing Cache

Finally, there are two function involved with cleaning or deleting the cache:

 * `cleanbfc()`
 * `removebfc()`

`cleanbfc()` will evaluate the resources in the `BiocFileCache` object and
determine which, if any, have not been accessed in a specified number of
days. If `ask=TRUE`, each entry that is above that threshold will ask if it
should be removed from the cache object and the file deleted (only deleted if in
`bfccache(bfc)` location). If `ask=FALSE`, it does not ask about each file and
automatically removes and deletes the file. The default number of days is 120.

```{r eval=FALSE}
cleanbfc(bfc)
```

`removebfc()` will remove the `BiocFileCache` complete from the system. Any
files saved in `bfccache(bfc)` directory will also be deleted.

```{r eval=FALSE}
removebfc(bfc)
```
**Note** Use with caution!

# Use Cases

## Local cache of an internet resource

One use for [BiocFileCache][] is to save local copies of remote
resources. The benefits of this approach include reproducibility,
faster access, and access (once cached) without need for an internet
connection. An example is an Ensembl GTF file (also available via
[AnnotationHub][])

```{r}
## paste to avoid long line in vignette
url <- paste(
    "ftp://ftp.ensembl.org/pub/release-71/gtf",
    "homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
    sep="/")
```

For a system-wide cache, simply load the [BiocFileCache][] package and
ask for the local resource path (`rpath`) of the resource.

```{r, eval=FALSE}
library(BiocFileCache)
bfc <- BiocFileCache()
path <- bfcrpath(bfc, url)
```

Use the path returned by `bfcrpath()` as usual, e.g.,

```{r, eval=FALSE}
gtf <- rtracklayer::import.gff(path)
```

A more compact use, the first or any time, is

```{r, eval=FALSE}
gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url))
```

Ensembl releases do not change with time, so there is no need to check
whether the cached resource needs to be updated.

## Cache of experimental computations

One might use [BiocFileCache][] to cache results from experimental
analysis. The `rname` field provides an opportunity to provide
descriptive metadata to help manage collections of resources, without
relying on cryptic file naming conventions.

Here we create or use a local file cache in the directory in which we are
doing our analysis.

```{r, eval=FALSE}
library(BiocFileCache)
bfc <- BiocFileCache("~/my-experiment/results")
```

We perform our analysis...

```{r, eval=FALSE}
library(DESeq2)
library(airway)
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)
```

...and then save our result in a location provided by
[BiocFileCache][].

```{r, eval=FALSE}
saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis"))
```

Retrieve the result at a later date

```{r, eval=FALSE}
result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis"))
```

Once might imagine the following workflow:

```{r eval=FALSE}
library(BiocFileCache)
library(rtracklayer)

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)

# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

# check if url is being tracked
res <- bfcquery(bfc, url)

if (bfccount(res) == 0L) {

    # if it is not in cache, add
    ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)

} else {

  # if it is in cache, get path to load
  rid = res %>% filter(fpath == url) %>% collect(Inf) %>% `[[`("rid")
  ans <- bfcrpath(bfc, rid)

  # check to see if the resource needs to be updated
  check <- bfcneedsupdate(bfc, rid)
  # check can be NA if it cannot be determined, choose how to handle
  if (is.na(check)) check <- TRUE
  if (check){
    ans < - bfcdownload(bfc, rid)
  }
}


# ans is the path of the file to load
ans


# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))

temp = GTFFile(ans)
info = import(temp)
```

```{r eval=TRUE}

#
# A simplier test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#


library(BiocFileCache)
library(rtracklayer)

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)

# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"

# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))

pathsToLoad

# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
summary(info)
```

```{r eval=FALSE}
#
# One could also imagine the following:
#

library(BiocFileCache)

# load the cache
bfc <- BiocFileCache()

#
# Do some work!
#

# add a location in the cache
filepath <- bfcnew(bfc, "R workspace")

save(list = ls(), file=filepath)

# now the R workspace is being tracked in the cache
```

# Summary

It is our hope that this package allows for easier management of local and
remote resources.

[BiocFileCache]: https://bioconductor.org/packages/BiocFileCache
[dplyr]: https://cran.r-project.org/package=dplyr