---
title: "_**flowGraph**_: Identifying differential cell populations in flow cytometry data accounting for marker frequency"
shorttitle: "flowGraph"
author: Alice Yue
email: aya43@sfu.ca
package: flowGraph
date: September 2020
output: 
    BiocStyle::html_document
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{flowGraph}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```


[![DOI](https://zenodo.org/badge/DOI/10.1101/837765.svg)](https://doi.org/10.1101/837765)

[Summary blog post](https://aya49.github.io/2020/09/30/flowGraph/)

flowGraph is an R package used to identify candidate biomarkers for disease diagnosis in flow cytometry data. It does so by identifying driver cell populations whose abundance changes significantly and independently given a disease.

flowGraph takes cell counts as input and outputs SpecEnr values for each cell population within a flow cytometry sample, based on their expected proportion. SpecEnr accounts for dependencies between cell populations such that we can use it to flag only cell populations whose abundance change is incurred wholly or in part because of its association with a sample class (e.g. healthy vs sick).

# Installing the package

**flowGraph** can be installed via Bioconductor.

You can also install the development version directly from Github using BiocManager:

```{r, eval=FALSE, label=library1, warning=FALSE, message=FALSE, echo=TRUE, fig=FALSE, results='hide'}
if (!require("BiocManager")) install.packages('BiocManager') 
BiocManager::install("aya49/flowGraph")
```

# Citation

The theory, proof, and algorithm behind the SpecEnr statistic used in the 
flowGraph package can be found in the following [paper](https://www.biorxiv.org/content/10.1101/837765v3.abstract). 
Please consider citing if you found it helpful.

bibtex:
```
@article{yue2019identifying,
  title={Identifying differential cell populations in flow cytometry data accounting for marker frequency},
  author={Yue, Alice and Chauve, Cedric and Libbrecht, Maxwell and Brinkman, Ryan},
  journal={BioRxiv},
  pages={837765},
  year={2019},
  publisher={Cold Spring Harbor Laboratory}
}
```

The scripts and data from the paper can be downloaded ons [Zenodo](https://zenodo.org/record/3991166).

# Introduction

The main goal of _**flowGraph**_ is to help users interpret which cell populations 
are differential between samples of different etiologies. To understand how 
flowGraph defines cell populations, we introduce the cell hierarchy model below.
This model will serve as the basis for the analysis done in the 
_**flowGraph**_ package.

```{r, label=library2, warning=FALSE, message=FALSE, echo=TRUE, fig=FALSE, message=FALSE}
library(flowGraph)
```

## Cell population naming convention

A **cell hierarchy** is a directed acyclic graph that maps out all possible 
cell populations. In this graph, each cell population is a node while relations 
between cell populations are represented by edges. Each node is labelled by 
whether or not its corresponding cell population contains cells that does/not 
contain a subset of markers. For example, if the markers used in an experiment 
is $A$, $B$, $C$, $D$, then $A^-$ and $A^+$ would represent the cell population 
with cells that does/not contain marker $A$ (e.g. Figure \ref{fig:ch}).

If you are using threshold gates, $A^+$/$A^-$ is the count of cells with a 
fluorescent intensity (FI) value higher/lower than the threshold. If you have 
multiple thresholds, for example 3 thresholds, for $A$, then you would have
$A^-$, $A^+$, $A^++$, $A^+++$ with thresholds in between.

If you are using polygon gates, then $A$ would be the name of your polygon gate
on any two dimension and $A^-$ would be the cells inside the gate and $A^+$ 
would be the cells outside the gate.

We also accept cell population names where marker conditions are separated by an
underscore e.g. $A^+\_B^+$.


```{r, message=FALSE}
no_cores <- 1
data(fg_data_pos2)

meta_cell <- get_phen_meta(colnames(fg_data_pos2$count))
suppressWarnings({ pccell <- flowGraph:::get_phen_list(meta_cell, no_cores) })
gr <- set_layout_graph(list(e=pccell$edf, v=meta_cell)) # layout cell hierarchy

gr <- ggdf(gr)
gr$v$colour <- ifelse(!grepl("[-]",gr$v$phenotype), 1,0)
                     # "nodes with only marker conditions", "other nodes")
gr$v$label <- gr$v$phenotype
gr$v$v_ind <- gr$v$label_ind <- TRUE
gr$e$e_ind <- !grepl("[-]",gr$e$from) & !grepl("[-]",gr$e$to)
```
```{r, fig.wide=TRUE}
knitr::opts_template$set(figure1=list(fig.height=9, fig.width=9))
plot_gr(gr, main="Example cell hierarchy with markers A, B, C, D")
```


Traditionally, cell populations are quantified by their proportion, or their 
cell count over the total cell count. A downside to this analysis is that if 
one cell population is differential, then all cell populations that contain 
cells that are also in that differential cell population would also be flagged 
as significantly differential. By incorporating information on relations 
between cell populations, _**flowGraph**_ uses the notion of 
expected proportions and **SpecEnr** or specific enrichment as a replacement 
for proportions to isolate only differential cell populations.


# Workflow: a simple example

This section will run through a simple example of how _**flowGraph**_ can be 
used to analyze a set of flow cytometry samples.

Typically, one would input a sample x cell population matrix and 
a directory where one wants to save all of the flowGraph resuts and plots:

```{r, eval=FALSE}
no_cores <- 1 # number of cores to parallelize on.
data(fg_data_pos2)
fg <- flowGraph(
    fg_data_pos2$count, # sample x cell population matrix
    meta=fg_data_pos2$meta, # a data frame with each sample's meta data
    path="flowGraph_example", # a directory for flowGraph to output results to
    no_cores=no_cores) # number of cores to use, typically 1 for small data
```

The flowGraph output can be loaded into R from the specified directory as 
a flowGraph object `fg`. 

```{r, eval=FALSE}
fg <- fg_load("flowGraph_example")
```

This flowGraph object can be further analyzed and
modified by using methods described in the following sections.


## Data sets contained in the package

The package contains two data sets:
- `fg_data_fca` [@aghaeepour2013critical]: a real data set comparing healthy 
and AML (acute myeloid leukemia) positive subjects; it is known that there is 
an outlier sample and that cell population node $CD34^+$ increases in production 
in the AML positive subjects' samples.
- `fg_data_pos2`: a positive control data set where cell population node 
   $A^{+}B^{+}C^+$ is artificially increased by 50\%.
- `fg_data_pos30`: a positive control data set where cell population node 
   $A^{+...}B^{+...}C^+$ is artificially increased by 50\%. Note this data set 
   contains multiple thresholds for markers $A$ and $B$.


Both of these are lists containing elements:
- `count`: a sample x cell population numeric matrix containing cell count data.
- `meta`: a data frame containing meta data on the samples in `count`. 
   The sample names in the `id` corresponds with the row names in `count`.

```{r,label=Load_Data, fig=FALSE, message=FALSE}
# data(fg_data_fca)
data(fg_data_pos2)
# data(fg_data_pos30)

# ?fg_data_fca
# ?fg_data_pos2
# ?fg_data_pos30
```


## Initializing a flowGraph object

To contain information regarding cell population quantification and the 
cell hierarchy structure in one place, we use a **flowGraph** object to conduct 
analysis. To initialize a flowGraph object, the user can give as input, 
a numeric vector or matrix. 
For examples on how all of these options 
can be used, see `?flowGraph`. 


For our example, we will directly use a numeric matrix as provided by our 
`fg_data_pos2` data set. By default, `flowGraph` will calculate all of the 
proportion `prop`, and SpecEnr `specenr`.


```{r, message=FALSE}
# no_cores <- 1 # number of cores to parallelize on.
data(fg_data_pos2)
fg <- flowGraph(fg_data_pos2$count, meta=fg_data_pos2$meta, no_cores=no_cores)

```

By default, `calculate_summary` is set to `TRUE` so that a default set of
summary statistics will be calculated for the SpecEnr
node feature and `prop` edge feature. Note that if the user decides to do this,
the `class` in `summary_pars`  must be the column name in data frame `meta` 
with the class labels. This is set to `"class"` by default.
A class in this context is, for example, an experiment or control sample.
If the user does not wish for this to be calculated during construction
of the flowGraph object, the user can set `summary_pars` to `NULL` in the
`flowGraph` function. Note that`summary_pars` must be specified if the fast
version of flowGraph is used.

`meta` can be given/modified at a later time. 
Just make sure the meta data is a data frame
containing a `id` column where its values correspond to the row names in 
the flowGraph objects' feature matrices.

```{r}
meta <- fg_get_meta(fg)
head(meta)

mcount <- fg_get_feature(fg, "node", "count")
head(rownames(mcount))
```


### Input format

The input to flowGraph is a sample x cell population phenotype matrix containing
the cell count of each cell population for each sample.

The row names of the input matrix should be sample ID, otherwise, 
flowGraph will create sample IDs for you. 

The column names of the input matrix should be cell population phenotype names
that follow cell population naming conventions.
Markers/measurements must not contain underscores, dashes, or pluses 
(`_`, `+`, `-`). If underscores were
used to separate marker/measurement conditions e.g. ssc+_cd45-, the underscores
will be removed from the column names of the features matrices but will be
saved in `fg_get_meta(fg)$phenotype_`.

To calculate the SpecEnr of a cell population (e.g. A+B+C+), 
flowGraph requires that all of its parent (A+B+, B+C+, A+C+) 
and grandparent cell populations (A+, B+, C+) are available.


## Retrieving results from a flowGraph object

The default summary statistics is calculated using the Wilcoxan signed-rank test and adjusted
using the `byLayer` adjustment method. This adjustment method is a family-wise method that 
multiplies the p-value for each cell population by the number of nodes in 
its layer and the total number of layers in the cell hierarchy on which there 
exists a cell population node.

Below, we retrieve this summary statistic and list out the cell
populations with the most significant p-values as per below.

```{r}
# get feature descriptions
fg_get_summary_desc(fg)
# get a summary statistic
fg_sum <- fg_get_summary(fg, type="node", summary_meta=list(
    node_feature="SpecEnr", 
    test_name="t_diminish", 
    class="class", 
    labels=c("exp","control"))
)
# fg_sum <- fg_get_summary(fg, type="node", index=1) # same as above

# list most significant cell populations
p <- fg_sum$values # p values
head(sort(p),30)
```

To make changes to the flowGraph object, see functions that start with `fg_`.

Once we have made all the changes necessary, we can save the flowGraph object to
a folder. Inside the folder, all the feature values and summary statistics
are saved as csv files. Plots for each summary statistic can also optionally 
be saved to this folder.

The same folder directory is used to load the flowGraph object when needed again.

```{r, eval=FALSE}
fg_save(fg, "path/to/user/specified/folder/directory") # save flowGraph object
fg <- fg_load("path/to/user/specified/folder/directory") # load flowGraph object
```

See other flowGraph object initialization options in the Appendix 2.

The following sections will go over additional options for summary statistics,
and result interpretation.


# Accessing and modifying data in a flowGraph object


## Flow cytometry sample meta data

The flowGraph object initially contains meta data on the samples and 
cell population nodes (phenotypes). The most basic way of understanding what 
is inside a flowGraph object is by using `show`. This shows a description of 
the flowGraph object and returns a list of data frames containing information 
on the node and edge features and the summary statitics performed on them 
shortly in this vignette.

```{r}
show(fg)
```

One can obtain meta data on samples and cell populations as follows. 
Note that information on cell populations is given to the user in the form of a 
`graph` or a list contianing data frames `v` and `e`. The former represents the 
nodes or the cell populatins, and the latter represent edges or the relation 
between cell population --- note that edges always point from parent to 
child cell populations indicative of whether or not a cell population is a 
sub-population of another.

```{r}
# get sample meta data
head(fg_get_meta(fg))

# modify sample meta data
meta_new <- fg_get_meta(fg)
meta_new$id[1] <- "new_sample_id1"
fg <- fg_replace_meta(fg, meta_new)
```
```{r}
# get cell population meta data
gr <- fg_get_graph(fg)
head(gr$v)
head(gr$e)
```


## Feature values

The user can also extract or modify the features inside a flowGraph object. 
For adding new features, unless needed, we recommend users stick with the 
default feature generation methods that starts with `fg_feat_`.

```{r}
# get feature descriptions
fg_get_feature_desc(fg)
# get count node feature
mc <- fg_get_feature(fg, type="node", feature="count")
dim(mc)
```
```{r}
# add a new feature; 
# input matrix must contain the same row and column names as existing features;
# we recommend users stick with default feature generation methods 
# that start with fg_feat_
fg <- fg_add_feature(fg, type="node", feature="count_copy", m=mc)
fg_get_feature_desc(fg)
```
```{r}
# remove a feature; note, the count node feature cannot be removed
fg <- fg_rm_feature(fg, type="node", feature="count_copy")
fg_get_feature_desc(fg)
```


## Feature summary statistics

Once a flowGraph object is created, the user can calculate summary statistics 
for any of the features it contains.

We recommend the user use the `fg_summary` function. Its default 
summary statistic is the significance T-test along with a `byLayer` 
p-value adjustment method. The user can specify other summary statistics or 
adjustment methods by providing the name of the method or a function to 
parameters `test_custom` or `adjust_custom`.

```{r}
fg_get_summary_desc(fg)

# calculate summary statistic
fg <- fg_summary(fg, no_cores=no_cores, class="class", label1="control",
                 node_features="count", edge_features="NONE",
                 overwrite=FALSE, test_name="t", diminish=FALSE)
fg_get_summary_desc(fg)
```
```{r}
# get a summary statistic
fg_sum1 <- fg_get_summary(fg, type="node",  summary_meta=list(
    feature="count", test_name="t", 
    class="class", label1="control", label2="exp"))
names(fg_sum1)
```
```{r}
# remove a summary statistic
fg <- fg_rm_summary(fg, type="node", summary_meta=list(
    feature="count", test_name="t", 
    class="class", label1="control", label2="exp"))
fg_get_summary_desc(fg)
```
```{r}
# add a new feature summary; 
# input list must contain a 'values', 'id1', and 'id2' containing summary 
# statistic values and the sample id's compared;
# we recommend users stick with default feature generation method fg_summary
fg <- fg_add_summary(fg, type="node",  summary_meta=list(
    feature="SpecEnr", test_name="t_copy",
    class="class", label1="control", label2="exp"), p=fg_sum1)
fg_get_summary_desc(fg)
```


A summary static statistic, once obtained, is a list containing:
- `values`: a vector of p-values for each node or edge.
- `id1` and `id2`: a vector of sample id's that were compared.
- `test_fun` and `adjust_fun`: the functions used to test and adjust the 
   summary statistic.
- `m1` and `m2`: a vector that summarizes one of the sets of samples compared. 
   These are not contained inside a flowGraph object but can be calculated on 
   the spot when retreiving a summary using `fg_get_summary` by setting 
   parameter `summary_fun` to a matrix function (default: `colSums`) and not 
   `NULL`. This usually does not need to be adjusted.


# Plotting and visualizing results

All of the calculated summaries can be visualized in the form of a 
cell hierarchy plot using function `fg_plot`. The plot can also be saved as a 
PNG file if the path to this PNG file is provided as a string for its 
`path` parameter. Here, we do not save the plot, but we plot the returned 
`graph` list `gr` given by `fg_plot` that contains all the plotting columns 
using the `plot_gr` function.

```{r}
# plotting functions default to plotting node feature SpecEnr 
# labelled with mean expected/proportion (maximum 30 labels are shown for clarity)
# and only significant nodes based on the wilcox_byLayer_diminish summary statistic
# are shown.
# gr <- fg_plot(fg, p_thres=.01, show_bgedges=TRUE, # show background edges
#               node_feature="SpecEnr", edge_feature="prop",
#               test_name="t_diminish", label_max=30)
gr <- fg_plot(fg, index=1, p_thres=.01, show_bgedges=TRUE)
# plot_gr(gr)
```

While through `plot_gr`, `fg_plot` uses the `ggplot2` package to create static
plot, the user can 
also choose to plot `gr` as an interactive plot by setting the `interactive` 
parameter to `TRUE` using the `ggiraph` package.

```{r}
# interactive version in beta
plot_gr(gr, interactive=TRUE)
```


# Appendix

## Appendix 1: OTher useful plots

Summary statistics can also be analyzed using other plots.

### QQ plot
For example, the user can plot a static/interactive 
`ggiraph` QQ plot of a chosen summary statistic. This plots the p-values against
a uniform distribution.

```{r, message=FALSE}
data(fg_data_pos2)
fg1 <- flowGraph(fg_data_pos2$count, class=fg_data_pos2$meta$class, 
                 no_cores=no_cores)
```
```{r}
fg_get_summary_desc(fg)

fg_plot_qq(fg, type="node", index=1)
fg_plot_qq(fg, type="node", index=1, logged=TRUE)

# interactive version
fg_plot_qq(fg, type="node", index=1, interactive=TRUE)
```

### Boxplot
To understand how each p-value was obtained, the user can also plot the 
distribution of values as boxplots for a specific feature between features of different
class labels.

```{r}
fg_plot_box(fg, type="node", summary_meta=NULL, index=1, node_edge="A+")
```

### Logged p-value vs feature difference
Another useful plot is to compare the p-value and the difference between the
mean of a feature value between samples of different classes. This should
look like a volcano plot.

```{r}
fg_plot_pVSdiff(fg, type="node", summary_meta=NULL, index=1)

# interactive version
fg_plot_pVSdiff(fg, type="node", summary_meta=NULL, index=1, interactive=TRUE)
```

### Customizing the cell hierarchy plot

The user can also manually specify how the cell hierarchy plot should look. The columns 
needed for plotting in `plot_gr` can be attached onto the `graph` slot of the 
`fg` the `flowGraph` object using the `ggdf` function. For more information on 
these columns, see `?ggdf`.

```{r}
gr <- fg_get_graph(fg)
gr <- ggdf(gr)

gr$v$colour <- ifelse(!grepl("[-]",gr$v$phenotype), 1, 0)
                     # "nodes with only marker conditions", "other nodes")
gr$v$label <- gr$v$phenotype
gr$v$v_ind <- gr$v$label_ind <- TRUE
gr$e$e_ind <- !grepl("[-]",gr$e$from) & !grepl("[-]",gr$e$to)

plot_gr(gr, main="Example cell hierarchy with markers A, B, C, D")
```


## Appendix 2: flowGraphSubset, a fast version of the flowGraph constructor

If the user is only interested in one set of class labels for a set of samples,
they can choose to use `flowGraphSubset`, a faster version of the default constructor
`flowGraph`. It is fast because the edge list, proportion, expected proportion, 
and SpecEnr features are only calculated for cell populations who are in the
0'th and 1st layer, or have a significant parent population. The assumption here
is that cell populations who are significantly differentially abundant 
must also have at least one significantly differentially abundant parent 
population, which is true for almost all cases.

However, if the user wants to test different sets of sample class labels on the 
same set of samples, we recommend using the default `flowGraph` constructor
as it calculates SpecEnr for all cell populations. Since SpecEnr only has to be
calculated once, the user can apply multiple statistically significance tests
and ask questions about different class sets on the same SpecEnr values.

So in summary, ONLY USE THIS OVER flowGraph IF: 1) your data set has more than
10,000 cell populations and you want to speed up your calculation time AND
2) you only have one set of classes you want to test on the 
SAME SET OF SAMPLES (e.g. control vs experiment).

The parameters for `flowGraphSubset` is a bit different than those in `flowGraph`.
It is currently in beta, so we recommend reading the manual for it carefully.

```{r, evaluate=FALSE}
data(fg_data_pos2)
fg <- flowGraphSubset(fg_data_pos2$count, meta=fg_data_pos2$meta, no_cores=no_cores,
                 summary_pars=flowGraphSubset_summary_pars(),
                 summary_adjust=flowGraphSubset_summary_adjust())
```


# System information

The following is an output of `sessionInfo()` on the system on which this 
document was compiled.

```{r}
sessionInfo()
```


# References