---
title: "h5client -- Bioconductor and remote HDF5"
author: "Samuela Pollack, Shweta Gopaulakrishnan, Vincent J. Carey"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{h5client -- notes on Bioconductor and remote HDF5}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

```{r setup,echo=FALSE,results="hide"}
suppressPackageStartupMessages({
suppressMessages({
library(rhdf5client)
library(BiocStyle)
library(DT)
library(DelayedArray)
})
})
```

# Overall introduction

This document summarizes work to March 2018 in demonstrating
the concept of remote HDF5.  There are two main components
to this document

- A section on Bioconductor-oriented object designs and methods for
using HDF5 data in server or object store, culminating in
`DelayedArray` interfaces to remote _HDF5 datasets_ that
are 2-d arrays of numbers

- A section on a more general R interface to the h5py/h5pyd
APIs for working with remote or local HDF5

N.B.  All python modules that are imported in this document
are imported with `convert=FALSE`, so that there are no
unintended translations of python data into R data.  You
will see `py_to_r` used below to accomplish such transitions
when desired.

# Introduction to the Bioconductor/R-centric interface work

The `r Biocpkg("rhdf5client")` package is a basis for
using [HDF Server](https://support.hdfgroup.org/projects/hdfserver/)
and [HDF Object store](https://www.hdfgroup.org/solutions/hdf-cloud/)
with R/Bioconductor.

## HDF Server interface

As of March 2018, we can use HDF Server with R in several ways.
With support from an NCI grant, we maintain a server in AWS EC2
that employs the RESTful API defined for the HDF Server.
```{r lkserv}
library(rhdf5client)
bigec2 = H5S_source(URL_h5serv())
class(bigec2)
bigec2
```

### The internal structure of an HDF Server

The server defines a hierarchical structure for all server
contents.  There are groups, linksets, and datasets.
```{r lkhi}
groups(bigec2)
links(bigec2,1)
datatable(data.frame(targets(links(bigec2,1))))
```

### Presenting a specific dataset to the R user

We use the double-bracket operator to derive
a reference to an HDF5 dataset from an `H5S_source`
instance.  We installed an image of the 10x genomics
1.3 million neuron dataset, that we can refer to
as:

```{r lkdb}
tenx_remote = bigec2[["tenx_full"]]
tenx_remote
```

This is sufficient to do arithmetic using familiar
R programming steps.  Note that the data
image here has 'neurons' as 'rows'.

```{r getcs}
apply(tenx_remote[1:4,1:27998],1,sum)
```

## Using the DelayedArray infrastructure

(At the moment the DelayedArray code is in restfulSE.)
The `H5S_Array` constructor takes care of source and
dataset navigation given the URL of the server and
the name of the 'host', in HDF server parlance.
```{r lkdela}
del10x = H5S_Array(URL_h5serv(), "tenx_full")
del10x
```

Here we have defined the R image of the data to be the
transpose of the image in HDF5.  So neurons are columns.
```{r lkdela2}
apply(del10x[,1:4],2,sum)
```

## Interface to HSDS (HDF Object Store)

The interface here is not mature.  The URL used
here is a server maintained by John Readey of the HDF Group.
The string `/home/stvjc` used below reflects a specific
setup of data on the server, it is not a "user folder" on any system.
```{r doos}
con = H5S_source(URL_hsds())   
con = setPath(con, "/home/reshg/tenx_full.h5") # this is as defined in store
ds2 = H5S_dataset2(con)
ds2
```

Again we have DelayedArray capabilities.
```{r lkdel3}
library(DelayedArray)
del10x_hsds = DelayedArray(new("H5S_ArraySeed", filepath = "", domain = "", 
        host = "", H5S_dataset = ds2))
del10x_hsds
apply(del10x_hsds[,1:4], 2, sum)
```

# Interfacing R to remote or local HDF5 via h5py/h5pyd

The `r CRANpkg("reticulate")` package 
makes it easy to convey python infrastructure
directly to R users.  However, we want to shape
aspects of the interaction to simplify
statistical computing.  We'll start by considering
how to use local HDF5 with the h5py python
modules, and then transition to remote HDF5.

Some of the basic strategies are adumbrated in
the `r Biocpkg("BiocSklearn")`, a proof of concept
of use of [scikit](http://scikit-learn.org/stable/#) modules in R.

A note on documentation.  For many python concepts
imported into an R session via `reticulate::import`,
`py_help` may be used to obtain documentation as
recorded in python docstrings.  Thus after the import
defining `np` below, `py_help(np)` will return
a paged high-level document on numpy to the session.

## Some basic tools for accessing local HDF5

We'll start with imports of key R and python packages.
```{r initcomp}
library(reticulate)
np = import("numpy", convert=FALSE)
h5py = import("h5py", convert=FALSE)
```

The `_hl` modules are fundamental infrastructure.

```{r dorh}
Rh5py = h5py[["_hl"]]
names(Rh5py)
```

### Handling numerical data via numpy

The following codes demonstrate ways of interfacing
to HDF5 via python.  `h5file` simply returns a python
reference to a `File` instance.  

`h5dsref` builds python commands to
facilitate manipulation of an HDF5 _dataset_ in R via numpy.

```{r domatr}
h5file = function( file )
  Rh5py$files$File( file )

fn = system.file("hdf5/numiris.h5", package="rhdf5client")
m1 = h5file(fn)
m1
class(m1)
```

The `File` instance can be regarded as a python dictionary.
We can learn the names of the datasets in the file:
```{r lkkkk}
m1$keys()
```

The `h5dsref` function was devised to give convenient
access to a dataset representing a matrix.
```{r domr}
h5dsref = function(filename, dsname="numiris") {
  py_run_string("import h5py", convert=FALSE)
  py_run_string(paste0("f = h5py.File('", filename, "')"))
  mref = py_run_string(paste0("g = f['", dsname, "']"))
  mref$g
}
```

We'll focus on the `h5dsref` approach for now.
We can get slices of the target array using
numpy's `take`.
```{r mkref}
numir = h5dsref(fn)
ta = np$take  # can't use `$` on the fly
numirsli = ta(numir, 0:2, 1L)
class(numirsli)
numirsli
```

So `numirsli` is a submatrix of the iris data
in `r fn`, with class `numpy.ndarray`.  We
can learn about available methods using `names`,
and try some out.
```{r reflec}
names(numirsli)
numirsli$ndim
numirsli$shape
numirsli$T$shape
```

Furthermore, we can create an R matrix with the
HDF5 numerical content as sliced via `take` using
`py_to_r` from reticulate:

```{r tx}
dim(py_to_r(numirsli)) # all in R
```

Thus, given an HDF5 dataset that can
be regarded as a numpy array, we can interrogate its
attributes and retrieve slices from R using `h5dsref`.

### Creating HDF5 datasets from R

```{r lkcr}
if (.Platform$OS.type != "windows")  {
tf = tempfile()
nf = h5py$File(tf, "w")
irmat = data.matrix(iris[,1:4])
nf$create_dataset('irisH5', data=r_to_py(irmat))
chk = h5dsref(tf, "irisH5")
ta(chk, 0:4, 0L)
nf$file$close() # no more reading, but
try(ta(chk, 0:4, 0L)) # is the close operation working?
}
```

Details on the `File` interface
are provided in [h5py docs](http://docs.h5py.org/en/latest/high/file.html#).

### Interim conclusions

The `Rh5py` interface defined here would appear to be an
adequate approach to interfacing between R and HDF5, but
we already have plenty of mileage in `r Biocpkg("rhdf5")`.
Our real interest is in providing a comprehensive interface
to the HDF Server and Object Store APIs, and we turn to 
this now.

## Working with HDF Object Store via h5pyd

The `File` API for the object store is a little different from the one
for local HDF5.  For the following to succeed you would
need credentials for the HDF Object Store instance noted in the endpoint
used below.
```{r getpd}
if (.Platform$OS.type != "windows")  {
Rh5pyd = import("h5pyd", as="h5py", convert=FALSE)
tenx_remote = Rh5pyd$File("/home/reshg/tenx_full.h5", "r",
     endpoint=URL_hsds())
tenx_remote
tenx_remote$keys() # only python
py_to_r(tenx_remote$keys()) # the strings of interest
}
```

The following function obtains a slice from a dataset
in the object store.  The index expression must be
appropriate for the dataset and follows the
convention for h5pyd: `start:end:stride` for each dimension,
with `[:end]` and `[:stride]` optional.

```{r lkgs}
if (.Platform$OS.type != "windows")  {
getslice = function(endpoint, mode, domain, dsname, indexstring="[0,0]") {
   py_run_string("import h5pyd", convert=FALSE)
   py_run_string(paste0("f = h5pyd.File(domain=", sQuote(domain), 
         ", mode = ", sQuote(mode), ", endpoint=", sQuote(endpoint), ")"))
   py_run_string(paste0("g = f['", dsname, "']", indexstring))$g
}
mr = getslice(URL_hsds(), "r", 
   "/home/reshg/tenx_full.h5", "newassay001", "[0:4, 0:27998]")
apply(mr,1,sum)
}
```

## Working with HDF Server

The `getslice` function will work with references to an HDF Server.
However, in the context of the vignette compilation, I see
an authentication error triggered.  It is not clear why; if
the two getslice commands are isolated and run in a single
R session, no problem arises.

```{r dosl2, eval=FALSE}
if (.Platform$OS.type != "windows")  {
hh = import("h5pyd", as="h5py") # avoid auth problem?
mr = getslice(URL_h5serv(), "r",
   "tenx_full.h5s.channingremotedata.org", "newassay001", "[0:4, 0:27998]")
apply(mr,1,sum)
}
```

## Towards a comprehensive interface

We'll focus on the object store.  After importing
`h5pyd` using reticulate, we can learn about available
infrastructure.

```{r lkrr}
if (.Platform$OS.type != "windows")  {
names(Rh5pyd)
}
```
With `py_help(Rh5pyd$Dataset)`, we obtain extensive
documentation in our R session.
```
Help on class Dataset in module h5pyd._hl.dataset:

class Dataset(h5pyd._hl.base.HLObject)
 |  Represents an HDF5 dataset
 |  
 |  Method resolution order:
 |      Dataset
 |      h5pyd._hl.base.HLObject
 |      h5pyd._hl.base.CommonStateObject
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __array__(self, dtype=None)
 |      Create a Numpy array containing the whole dataset.  DON'T THINK
 |      THIS MEANS DATASETS ARE INTERCHANGABLE WITH ARRAYS.  For one thing,
 |      you have to read the whole dataset everytime this method is called.
 |  
 |  __getitem__(self, args)
 |      Read a slice from the HDF5 dataset.
 |      
 |      Takes slices and recarray-style field names (more than one is
 |      allowed!) in any order.  Obeys basic NumPy rules, including
 |      broadcasting.
...
```

In what follows, we show the code that creates a new dataset
in the object store.  With `py_help(Rh5pyd$File)`, we find:
```
 |  create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)
 |      Create a new HDF5 dataset
 |      
 |      name
 |          Name of the dataset (absolute or relative).  Provide None to make
 |          an anonymous dataset.
 |      shape
 |          Dataset shape.  Use "()" for scalar datasets.  Required if "data"
 |          isn't provided.
 |      dtype
 |          Numpy dtype or string.  If omitted, dtype('f') will be used.
 |          Required if "data" isn't provided; otherwise, overrides data
 |          array's dtype.
 |      data
 |          Provide data to initialize the dataset.  If used, you can omit
 |          shape and dtype arguments.
 |      
 |      Keyword-only arguments:
 |      
 |      chunks
 |          (Tuple) Chunk shape, or True to enable auto-chunking.
 |      maxshape
 |          (Tuple) Make the dataset resizable up to this shape.  Use None for
 |          axes you want to be unlimited.
```
and we make use of the `create_dataset` method.  (Following code
is unevaluated, just for illustration, as it was tested and created the
persistent content.)
```{r lkcr2,eval=FALSE}
if (.Platform$OS.type != "windows")  {
nf = Rh5pyd$File(endpoint=URL_hsds(), mode="w", 
   domain="/home/stvjc/iris_demo.h5")
nf$create_dataset('irisH5', data=r_to_py(irmat))
}
```
We can read back with:
```{r lkrrrr}
if (.Platform$OS.type != "windows")  {
getslice(URL_hsds(), mode="r",
   domain="/home/stvjc/iris_demo.h5", "irisH5", "[0:3, 0:3]")
}
```

We can run `create_group` as well.  See
```{r lknnnaaa}
if (.Platform$OS.type != "windows")  {
sort(names(Rh5pyd$File))
}
```