---
title: "Analyzing differential co-expression with csdR"
bibliography: ../inst/REFERENCES.bib
output: BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{csdR}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# When and why to use this package
The purpose of this package is to compare the gene expression of the genes
in *two* different conditions.
The most typical case is when comparing gene expression in patients
with a disease with gene expression in study participants without the disease.
Hence, we may construct a network containing genes which are relevant
for the development of the disease.
The input data may come from different measurements of expression such as
microarray, proteomics or RNA-seq as long as:

- There are no missing values in the data.
Consider filling in with average values
or pseudo-values if this is not the case.
- The expression values are coded as continuous numerical values which
are comparable between samples.
Note that only the ranks of each gene across the samples does matter
as CSD uses the rank-based Spearman correlation.
- The gene labels for the two conditions must match.

For differential gene-expression involving more than two separate conditions,
consider `CoDiNA` [@MorselliGysi2020] instead.

# Installation

This package is hosted on Bioconductor. To install it, type:

```{r install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("csdR")
```

Then,

```{r}
library(csdR)
```

should load the package into the current **R** session.
For this vignette, we further load some auxiliary packages
and set the random seed

```{r setup}
suppressPackageStartupMessages({
    library(magrittr)
    library(igraph)
    library(glue)
    library(dplyr)
})
set.seed(45394534)
```



# Some theoretical considerations

This is a re-implementation and slight modification of the CSD algorithm
presented by Voigt _et al._ 2017[@Voigt2017].
In the first phase, the algorithm finds the co-expression between
all genes within each condition using the Spearman correlation.
For each pair of genes, we apply bootstrapping across the samples
and compute the mean Spearman correlation
$\rho_1$ and $\rho_2$ for the two conditions and the associated
standard errors $\sigma_1$ and $\sigma_2$.
In the second stage, the values for the two conditions are compared and gives us
the following *differential* co-expression scores:

* Conserved score, 
$C=\frac{\left|\rho_1+\rho_2\right|}{\sqrt{\sigma_1^2+\sigma_2^2}}$,
a high value indicating the same strong co-expression in both conditions.
* Specific score,
$S=\frac{\left|\left|\rho_1\right|-\left|\rho_2\right|\right|}{\sqrt{\sigma_1^2+\sigma_2^2}}$,
a high value indicating a strong co-expression in one condition,
but not in the other.
* Differentiated score,
$D=\frac{\left|\left|\rho_1\right|+\left|\rho_2\right|-\left|\rho_1+\rho_2\right|\right|}{\sqrt{\sigma_1^2+\sigma_2^2}}$,
a high value indicating a strong co-expression in both conditions,
but with the opposite sign.




# Workflow outline

In this example, we are provided by two expression expression matrices
from thyroid glands,
`sick_expression` for patients with thyroid cancer
and `normal_expression` for healthy controls.
To run the CSD analysis for these two conditions, we simply do the following:
```{r}
data("sick_expression")
data("normal_expression")
csd_results <- run_csd(
    x_1 = sick_expression, x_2 = normal_expression,
    n_it = 10, nThreads = 2L, verbose = FALSE
)
```

After obtaining these results, we may write them to disk.
However, for datasets with thousands of genes,
we will get millions upon millions of gene pairs.
Writing the results to disk is likely to fill up gigabytes
of valuable storage space
while the disk IO itself might take a considerable amount of time.
Furthermore, we must reduce the information load to create meaningful results,
so we better to that while the data is still in memory.
We decide to select the 100 edges with highest C, S, and D-score.

```{r}
pairs_to_pick <- 100L
c_filter <- partial_argsort(csd_results$cVal, pairs_to_pick)
c_frame <- csd_results[c_filter, ]
s_filter <- partial_argsort(csd_results$sVal, pairs_to_pick)
s_frame <- csd_results[s_filter, ]
d_filter <- partial_argsort(csd_results$dVal, pairs_to_pick)
d_frame <- csd_results[d_filter, ]
```

Why does the `csdR` package provide a general `partial_argsort` function
which takes in a numeric vector 
and spits out the indecies of the largest elements
instead of a more specialized function
directly extracting the top results from the dataframe?
The answer is flexibility.
Writing an additional line of code and a dollar sign
is not that much work after all
and we may want more flexible approaches such
as displaying the union of the C, S- and D-edges:

```{r}
csd_filter <- c_filter %>%
    union(s_filter) %>%
    union(d_filter)
csd_frame <- csd_results[csd_filter, ]
```

## How to we approach from here?

The next logical step is to construct a network and do some analysis.
This is outside the scope of this package,
but we will provide some pointers for completeness.
One viable approach is to use the ordinary `write.table` function
to write the results of a file
and then use an external tools such as Cytoscape to further make conclusions.
Often, you may want to make an ontology enrichment of the genes.

The other option is of course to continue using R.
Here, we provide an example of combining the C-, S- and D-networks
and coloring the edges blue, green and red,
respectively depending of where they come from.
```{r}
c_network <- graph_from_data_frame(c_frame, directed = FALSE)
s_network <- graph_from_data_frame(s_frame, directed = FALSE)
d_network <- graph_from_data_frame(d_frame, directed = FALSE)
E(c_network)$edge_type <- "C"
E(s_network)$edge_type <- "S"
E(d_network)$edge_type <- "D"
combined_network <- igraph::union(c_network, s_network, d_network)
# Auxillary function for combining
# the attributes of the three networks in a proper way
join_attributes <- function(graph, attribute) {
    ifelse(
        test = is.na(edge_attr(graph, glue("{attribute}_1"))),
        yes = ifelse(
            test = is.na(edge_attr(graph, glue("{attribute}_2"))),
            yes = edge_attr(graph, glue("{attribute}_3")),
            no = edge_attr(graph, glue("{attribute}_2"))
        ),
        no = edge_attr(graph, glue("{attribute}_1"))
    )
}
E(combined_network)$edge_type <- join_attributes(combined_network, "edge_type")
layout <- layout_nicely(combined_network)
E(combined_network)$color <- recode(E(combined_network)$edge_type,
    C = "darkblue", S = "green", D = "darkred"
)
plot(combined_network, layout = layout,
    vertex.size = 3, edge.width = 2, vertex.label.cex = 0.001)
```



# Considerations to note

## Number of bootstrap iterations
As with any bootstrap procedure the number of iterations represented by
the argument `n_it` needs to be *sufficiently large* 
in order to get reproducible results.
What this means is a matter of trial and error.
In general this means that you should re-run the computations
with a different random seed to see whether
the number of bootstrap iterations are sufficient.
Experience has shown that ~ 100 iterations might
be sufficient to reproduce almost the same results in some cases,
whereas in other cases,
especially when the values are close,
you may choose to run several thousand iterations.

## Number of threads
The parallelization of `csdR` occurs within each individual iteration. While parallelizing across the iterations would in theory yield better CPU utilization, this approach is unfeasible due the fact that memory consumption use would almost be proportional to the number of threads. Instead, parallelization is used in computing the co-expression matrix and updating the mean and variance for each iteration. As shown in the `csdR` paper, parallelism can provide 2x-3x speedup, but there are diminishing returns in the performance gained, so running more than 10 threads is most likely a waste a resources. Limited memory bandwidth is the most likely culprit behind the failure to scale, so systems with higher memory bandwidth are more likely to utilize multiple threads better.

## Memory consumption

For datasets with 20 000 to 30 000 genes,
a considerable amount of memory is consumed during the computations.
It it therefore not recommended in such cases to run CSD on your laptop
or even a workstation,
but rather on a compute server with several hundreds GB of RAM.

## Number of top gene pairs to pick

How many gene pairs to select depends on the specific needs
and how big a network you want to handle.
A 10 000 edge network may not be easy to visualize,
but quantitative network metrics can still be extracted.
Also, generating more edges than necessary usually does not make
any major harm as superfluous edges can quickly be filter out afterwards.
However, if you select fewer edges than you actually need,
you have to re-do all calculations to increase the number.

# Session info for this vignette
```{r}
sessionInfo()
```



# References