---
title: "ivygapSE -- Bioconductor container for Ivy-GAP expression and metadata"
author: "Vincent J. Carey, stvjc at channing.harvard.edu"
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{ivygapSE -- SummarizedExperiment for Ivy-GAP}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    highlight: pygments
    number_sections: yes
    theme: united
    toc: yes
---

# Introduction

CSV files from the Ivy-GAP project have been assembled into a SummarizedExperiment instance.
```{r loadup, echo=FALSE}
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(ivygapSE)
library(DT)
library(grid)
library(png)
library(ggplot2)
library(limma)
library(randomForest)
})
```
```{r chk}
library(ivygapSE)
data(ivySE)
ivySE
```
There are several types of metadata collected with the object, including the
README.txt (use `cat(metadata(ivySE)$README, sep="\n")` to see this in R),
the URL where data were retrieved, 
a character vector (builder) with the R code for creating (much of) the SummarizedExperiment, 
and two tables of tumor-specific and
block-specific information.

# Background on the ivyGlimpse app

The ivyGlimpse app is a rapid prototype of a browser-based
interface to salient features of the data.  The
most current code is maintained in the Bioconductor
ivygapSE package, but a public version of the app
may be visited at [shinyapps.io](https://vjcitn.shinyapps.io/ivyglimpse/).

The ivygapSE
package will evolve, based in part on associations
observed through the use of this app.  Briefly, the main
visualization of the app is a scatterplot of user-selected
tumor image features.  All contributions, based
on tumor sub-blocks (that have varying multiplicities per
tumor block and donor) are assembled
together without regard for source; interactive aspects of
the display allow the user to see which donor contributes
each point.  

Strata can be formed interactively by brushing
over the scatterplot; after the brushing event, the survival
times of donors contributing selected points are compared
to donors all of whose contributions lie outside the selection.
Expression data are also stratified in this way and gene-specific
boxplot sets (for user-specified gene sets) are produced for each stratum.

# Summary information on the underlying data

The number of RNA-seq samples is `r ncol(ivySE)`.  The FPKM matrix has dimensions
```{r lkd}
dim(ivySE)
```

There are 42 different tumor donors.
```{r lkse}
length(unique(metadata(ivySE)$tumorDetails$donor_id))
```
However, only 37 donors contributed tumor RNA that was sequenced:
```{r lkcon}
sum(metadata(ivySE)$tumorDetails$tumor_name %in% ivySE$tumor_name)
```
```{r getsbd,echo=FALSE}
subd = metadata(ivySE)$subBlockDetails
dsub = dim(subd)
```

Features of images from sub-blocks were quantified according to the following terminology for
anatomical characteristics.  Not all images provided information
on all attributes.

```{r lkonto,image=TRUE,echo=FALSE}
nomenclat()
```

# Additional details

We have used information in the
[IvyGAP technical white paper](http://help.brain-map.org/display/glioblastoma/Documentation?preview=/8028197/8454231/IvyOverview.pdf) to
spell out additional background on the data underlying the app and
SummarizedExperiment.

## Basic experimental design layout

There are six substudies contributing data in a partly
sequential design.

```{r lkk,echo=FALSE,fig=TRUE}
designOverview()
```

## Tumor-level details

The following table has one record per tumor (N=`r nrow(tumorDetails(ivySE))`).

```{r lkdttum,echo=FALSE}
datatable(tumorDetails(ivySE), options=list(lengthMenu=c(3,5,10,50,100)))
```

## Sub-block-level details

The following table has one record per sub-block (N=`r nrow(subBlockDetails(ivySE))`).
```{r lkdts,echo=FALSE}
datatable(subBlockDetails(ivySE), options=list(lengthMenu=c(3,5,10,50,    100)))
```

## Details on RNA-seq samples

The complete annotation on RNA-seq samples is provided in `colData(ivySE)`.  The
table follows here:

```{r lkcdd,echo=FALSE}
datatable(as.data.frame(colData(ivySE)), options=list(lengthMenu=c(3,5,10,50,100)))
```

### Key RNA-seq subsets

#### Subsets of design origin

The sub-blocks arose from a number of measurement objectives.

```{r lksbbbb}
sb = subBlockDetails(ivySE)
table(sb$study_name)
```

#### Subsets based on structure

We use the `structure_acronym` variable to assess the composition of
sources in the RNA-seq collection.

```{r lksa,fig=TRUE}
struc = as.character(colData(ivySE)$structure_acronym)
spls = strsplit(struc, "-")
basis = vapply(spls, function(x) x[1], character(1))
spec = vapply(spls, function(x) x[2], character(1))
table(basis, exclude=NULL)
barplot(table(basis))
```

Each of the major structural types contributes multiple samples from
specific objectives.  

```{r lktab}
lapply(split(spec,basis), function(x)sort(table(x),decreasing=TRUE))
```

### A simple differential expression study

We have used `r Biocpkg("limma")` to test for differential
expression among samples identified as `reference histology` in
classes `CT`, `CT-mvp`, `CT-pan`, `IT`, and `LE`.  The
resulting mean expression estimates (FPKM scale) 
and moderated test statistics are obtained
as follows:
```{r lklim, cache=TRUE}
library(limma)
ebout = getRefLimma()
```
The ten genes that are most significantly differentially
expressed between conditions CT and CT-mvp are found as follows:
```{r lknnn}
odig = options()$digits
options(digits=3)
limma::topTable(ebout, 2)
options(digits=odig) # revert
```

### Differential expression by molecular subtype

We can bind the molecular subtype information from the tumor
details to the expression sample annotation as follows:
```{r bindmol}
moltype = tumorDetails(ivySE)$molecular_subtype
names(moltype) = tumorDetails(ivySE)$tumor_name
moltype[nchar(moltype)==0] = "missing"
ivySE$moltype = factor(moltype[ivySE$tumor_name])
```

We will confine attention to samples annotated as "reference histology"
and compute the duplicate correlation for modeling the
effect of molecular subtype in the available samples.

```{r setdup, cache=TRUE}
library(limma)
refex = ivySE[, grep("reference", ivySE$structure_acronym)]
refmat = assay(refex)
tydes = model.matrix(~moltype, data=as.data.frame(colData(refex)))
ok = which(apply(tydes,2,sum)>0)  # some subtypes don't have ref histo samples
tydes = tydes[,ok]
block = factor(refex$tumor_id)
dd = duplicateCorrelation(refmat, tydes, block=block)
f2 = lmFit(refmat, tydes, correlation=dd$consensus)
ef2 = eBayes(f2)
colnames(tydes)
topTable(ef2,2)
```

### Classification of structural character

We assess the capacity of the expression measures to
discriminate the structural type (CT, CT-mvp, CT-pan, LE, IT)
using the random forests algorithm.  Features used have
interquartile range (IQR) over all relevant samples exceeding the
median IQR over all genes.

```{r lkrf,fig=TRUE}
refex = ivySE[, grep("reference", ivySE$structure_acronym)]
refex$struc = factor(refex$structure_acronym)
iqrs = rowIQRs(assay(refex))
inds = which(iqrs>quantile(iqrs,.5))
set.seed(1234)
rf1 = randomForest(x=t(assay(refex[inds,])), 
        y=refex$struc, mtry=30, importance=TRUE) 
rf1
varImpPlot(rf1)
```

# Next steps

Patel et al. Science 2014 (344(6190): 1396–1401) present single cell RNA-seq
for 430 cells from 5 tumors of different molecular subtypes.  It would
be interesting to use signature of structural origin to see whether
intra-tumor variation can be resolved into components coherent with
the five-element typology.  It would also be of interest to assess whether
structural type signatures are associated with any signatures of drug
sensitivity in relevant cell lines.