---
title: "ssrch: selectize-based search engine for corpora of modest size"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{ssrch: small search engine}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

```{r setup,echo=FALSE,results="hide"}
suppressPackageStartupMessages({
library(ssrch)
library(DT)
})
```

# Introduction

Metadata are crucial to all forms of statistical analysis.
Metadata are used to define formal steps of
upstream data preprocessing, to annotate outcomes and covariates,
and to interpret inferential results.

The ssrch package was developed to provide a lightweight
approach to searching a metadata corpus with R.  There are three
basic steps:

- tokenize corpus elements syntactically,
- filter tokens into hashed environments,
- use selectize.js to achieve a predictive, autocompleted search interface.

We illustrate the system using a collection of tables of metadata
about cancer transcriptomics.

# Illustration

To provide a sense of what is at stake, we work with 68 CSV files
derived from the NCBI Sequence Read Archive metadata.  
The files consist of contents of the `sample.attribute` subtable of study metadata
retrieved using the SRAdbV2 package at github.com:seandavi/SRAdbV2.
The CSV files
are lodged in a zip file
`canc68.zip` with the `ssrch` package in `inst/cancer_corpus`.

The CSV files have been indexed in an S4 object of class
`DocSet`.

```{r lkds}
data(docset_cancer68)
docset_cancer68
```

A single document can be retrieved with the `retrieve_doc`
function, given the study accession number.
```{r lklkaa}
retrieve_doc("ERP010142", docset_cancer68)[1:3,1:5]
```

## Diversity of field names

There is partial standardization of field names in this corpus,
but there is considerable variation among studies in the number
and types of fields used.
```{r lkrec}
docids = ls(envir=docs2kw(docset_cancer68))
allc = lapply(docids, 
   function(x) try(retrieve_doc(x, docset_cancer68)))
summary(sapply(allc,ncol))
lapply(allc[c(11,55)], names)
```

The `ssrch` package provides
tools for identifying studies in which a given
field is used, and in which given values are recorded.

A motivating example is determining what studies involve
measurements on normal tissue adjacent to tumor.  In
our collection, this can be assessed via:
```{r lklk}
library(ssrch)
searchDocs("Adjacent", docset_cancer68, ignore.case=TRUE)
```


## Managing tokenized metadata

Before getting into the details of search engine
construction, we review some high-level concepts and methods.

First, we have the container for managing our search resource.
```{r lkd1}
getClass(class(docset_cancer68))
```
`docset_cancer68` is the result of using the `ssrch` function `parseDoc`
in conjunction with the CSV files zipped in `ssrch/inst/cancer_corpus`. 
It is an S4 class instance that manages environments mapping
from keywords to documents, documents to records, documents to keywords,

Furthermore, `DocSet` instances can have a component that
retrieves parsed documents from the corpus.  

## Querying the corpus

We'll search for the phrase `Non Small Cell` with an optional hyphen,
ignoring character case of possible hits, including as well
the abbreviation for Non-small cell lung cancer.
```{r try}
NSChits = searchDocs("Non.Small Cell|NSCLC$", docset_cancer68, ignore.case=TRUE)
NSChits
```
Three studies, three different patterns matched by the query string.
We can use `retrieve_doc` to look at the contents
of the metadata tables for these studies.

```{r lkdocs}
NSCdocs = lapply(unique(NSChits$docs), 
   function(x) retrieve_doc(x, docset_cancer68))
names(NSCdocs) = NSChits$docs
datatable(NSCdocs[[1]], options=list(lengthMenu=c(3,5,10)))
datatable(NSCdocs[[2]], options=list(lengthMenu=c(3,5,10)))
datatable(NSCdocs[[3]], options=list(lengthMenu=c(3,5,10)))
```

# A prototypical app

The `ctxsearch` function starts a shiny app that
provides access to full attribute data via a selectize
input that uses all tokens in the corpus (filtered).

![ctxsearch illustration](ssrchapp.png)

# Further work

The tab set should be prunable.

This will be the basis of an interactive interface to the
human transcriptome compendium