---
title: "Sharing analyses with ISAnalytics"
author: 
  - name: Giulia Pais
    affiliation: | 
     San Raffaele Telethon Institute for Gene Therapy - SR-Tiget, 
     Via Olgettina 60, 20132 Milano - Italia
    email: giuliapais1@gmail.com, calabria.andrea@hsr.it
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r doc_date()`"
package: "`r pkg_ver('ISAnalytics')`"
vignette: >
  %\VignetteIndexEntry{sharing_analyses}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---
```{r include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## Related to
    ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)
```

```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE}
## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ISAnalytics = citation("ISAnalytics")[1],
    eulerr = citation("eulerr")[1]
)
```

# Introduction

In this vignette we explain in more detail how to perform sharing analyses 
with `r Biocpkg("ISAnalytics")` and its dedicated sharing functions.

```{r echo=FALSE}
inst_chunk_path <- system.file("rmd", "install_and_options.Rmd", package = "ISAnalytics")
```

```{r child=inst_chunk_path}

```

# Shared integration sites

An integration site is always characterized by a triple of values:
`(chr, integration_locus, strand)`, hence these attributes are always present
in integration matrices.

```{r}
library(ISAnalytics)
data("integration_matrices")
data("association_file")
```

```{r echo=FALSE}
print(head(integration_matrices))
```

We can aggregate our data in different ways according to our needs (to
know more about this topic take a look at the vignette 
`vignette("workflow_start", package = "ISAnalytics")`), obtaining
therefore different groups. Each group has an associated set of integration
sites.

```{r}
## Aggregation by standard key
agg <- aggregate_values_by_key(integration_matrices,
                               association_file,
                               value_cols = c("seqCount", "fragmentEstimate"))
agg <- agg %>% dplyr::filter(TimePoint %in% c("0030", "0060"))
```

```{r echo=FALSE}
print(agg, width = Inf)
```

An integration site is *shared* between two or more groups if the same triple
is observed in all the groups considered.

# Automated sharing counts

ISAnalytics provides the function `is_sharing()` for computing automated 
sharing counts. The function has several arguments that can be tuned according
to user needs.

## SCENARIO 1: single input data frame and single grouping key

```{r}
sharing_1 <- is_sharing(agg, 
                        group_key = c("SubjectID", "CellMarker", 
                                      "Tissue", "TimePoint"),
                        n_comp = 2,
                        is_count = TRUE,
                        relative_is_sharing = TRUE,
                        minimal = TRUE,
                        include_self_comp = FALSE, 
                        keep_genomic_coord = TRUE)
sharing_1
```

In this configuration we set:

* A single input data frame: `agg`
* A single grouping key by setting the argument `grouping_key`. In this
specific case, our groups will be identified by a unique combination of
`SubjectID`, `CellMarker`, `Tissue` and `TimePoint`
* `n_comp` represents the number of comparisons to compute: 2 means we're 
interested in knowing the sharing for PAIRS of distinct groups
* We want to keep the counts of distinct integration sites for each group
by setting `is_count` to `TRUE`
* `relative_is_sharing` if set to `TRUE` adds sharing expressed as a percentage,
more precisely it adds a column `on_g1` that is calculated as the 
absolute number of shared integrations divided by the cardinality of the
first group, `on_g2` is analogous but is computed on the cardinality of the
second group and finally `on_union` is computed on the cardinality 
of the union of the two groups.
* By setting the argument `minimal` to `TRUE` we tell the function to avoid
redundant comparisons: in this way only **combinations** and not permutations
are included in the output table
* `include_self_comp` adds rows in the table that are labelled with the same
group: these rows always have a 100% sharing with all other groups. There are
few scenarios where this is useful, but for now we set it to `FALSE` since
we don't need it
* `keep_genomic_coord` allows us to keep the genomic coordinates of the 
shared integration sites as a separate table

### Changing the number of comparisons

```{r}
sharing_1_a <- is_sharing(agg, 
                        group_key = c("SubjectID", "CellMarker", 
                                      "Tissue", "TimePoint"),
                        n_comp = 3,
                        is_count = TRUE,
                        relative_is_sharing = TRUE,
                        minimal = TRUE,
                        include_self_comp = FALSE, 
                        keep_genomic_coord = TRUE)
sharing_1_a
```

Changing the `n_comp` to 3 means that we want to calculate the sharing between
3 different groups. Note that the `shared` column contains the counts of 
integrations that are shared by **ALL** groups, which is equivalent to 
a set intersection.

Beware of the fact that the more comparisons are requested the more time
the computation requires.

### A case when it is useful to set `minimal = FALSE` 

Setting `minimal = FALSE` produces all possible permutations of the groups
and the corresponding values. In combination with `include_self_comp = TRUE`,
this is useful when we want to know the sharing between pairs of groups and
plot results as a heatmap.

```{r}
sharing_1_b <- is_sharing(agg,
                          group_key = c("SubjectID", "CellMarker", 
                                      "Tissue", "TimePoint"),
                          n_comp = 2,
                          is_count = TRUE,
                          relative_is_sharing = TRUE,
                          minimal = FALSE,
                          include_self_comp = TRUE)
sharing_1_b
heatmaps <- sharing_heatmap(sharing_1_b)
```

The function `sharing_heatmap()` automatically plots sharing between 2 groups.
There are several arguments to this function that allow us to obtain heatmaps
for the absolute sharing values or the relative (percentage) values.

```{r}
heatmaps$absolute
heatmaps$on_g1
heatmaps$on_union
```

Beware of the fact that calculating all permutations takes longer than 
calculating only distinct combinations, therefore if you don't need your
results plotted or you have more than 2 groups to compare you should stick 
with `minimal = TRUE` and `include_self_comp = FALSE`.

## SCENARIO 2: single input data frame and multiple grouping keys

In the first scenario, groups were homogeneous, that is, they were grouped all
with the same key. In this other scenario we want to start with data contained
in just one data frame but we want to compare sets of integrations that are
grouped differently. To do this we give as input a **list of keys** through
the argument `group_keys`.

```{r}
sharing_2 <- is_sharing(agg,
                        group_keys = list(
                          g1 = c("SubjectID", "CellMarker", 
                                 "Tissue", "TimePoint"),
                          g2 = c("SubjectID", "CellMarker"),
                          g3 = c("CellMarker", "Tissue")
                          ))
sharing_2
```

There are a few things to keep in mind in this case:

* The arguments `group_key` (notice the absence of plural), 
`n_comp` and `include_self_comp` are ignored: the number of comparisons is 
automatically detected by counting the provided keys and a self comparison
doesn't make sense since group labels are different
* If you provide a list of identical keys or just one key
you fall back to scenario 1

## SCENARIO 3: multiple input data frame and single grouping key

Providing multiple input data frames and the same grouping key is an effective
way to reduce the number of comparisons performed.
Let's make an example: suppose we're interested in comparing groups labelled
by a unique combination of `SubjectID`, `CellMarker`, `Tissue` and `TimePoint`,
but this time we want the first group to contain only integrations relative to
`PT001_MNC_BM_0030` and the second group to contain only integrations 
relative to `PT001_MNC_BM_0060`.

We're going to filter the original data frame in order to obtain only 
relevant data in 2 separated tables and then proceed by calling the function.

```{r}
first_sample <- agg %>%
  dplyr::filter(SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM", 
         TimePoint == "0030")
second_sample <- agg %>%
  dplyr::filter(SubjectID == "PT001", CellMarker == "MNC", Tissue == "BM", 
         TimePoint == "0060")
sharing_3 <- is_sharing(first_sample, second_sample,
                        group_key = c("SubjectID", "CellMarker", 
                                      "Tissue", "TimePoint"),
                        is_count = TRUE,
                        relative_is_sharing = TRUE,
                        minimal = TRUE)
sharing_3
```

Once again the arguments `n_comp` and `include_self_comp` are ignored:
the number of comparisons is equal to the number of data frames in input.

## SCENARIO 4: multiple input data frame and multiple grouping keys

Finally, the most general scenario is when we have multiple data frames
with multiple keys. In this case the number of data frames must be equal to
the number of provided keys and grouping keys are applied in order (
data frame 1 is grouped with key 1, data frame 2 is grouped with key 2, and
so on).

```{r}
df1 <- agg %>%
  dplyr::filter(TimePoint == "0030")
df2 <- agg %>%
  dplyr::filter(TimePoint == "0060")
df3 <- agg %>%
  dplyr::filter(Tissue == "BM")

keys <- list(g1 = c("SubjectID", "CellMarker", "Tissue"),
             g2 = c("SubjectID", "Tissue"),
             g3 = c("SubjectID", "CellMarker", "Tissue"))

sharing_4 <- is_sharing(df1, df2, df3, group_keys = keys)
sharing_4
```

Notice that in this example the keys for g1 and g3 are the same, meaning the
labels of the groups are actually the same, however you should remember that
the groups contain a **different set of integration sites** since they come
from different data frames.

# Plotting sharing results

When we have more than 2 comparisons it is convenient to plot them as venn or
euler diagrams. ISAnalytics has a fast way to do that through the functions
`is_sharing()` and `sharing_venn()`.

```{r}
sharing_5 <- is_sharing(agg,
                        group_keys = list(
                          g1 = c("SubjectID", "CellMarker", 
                                 "Tissue", "TimePoint"),
                          g2 = c("SubjectID", "CellMarker"),
                          g3 = c("CellMarker", "Tissue")
                          ), table_for_venn = TRUE)
sharing_5
```

The argument `table_for_venn = TRUE` will add a new column `truth_tbl_venn`
that contains corresponding truth tables for each row.

```{r}
sharing_plots1 <- sharing_venn(sharing_5, row_range = 1, euler = TRUE)
sharing_plots2 <- sharing_venn(sharing_5, row_range = 1, euler = FALSE)
```

Say that we're interested in plotting just the first row of our sharing data 
frame. Then we can call the function `sharing_venn` and specify in the 
`row_range` argument the index 1. Note that this function requires the package
`r CRANpkg("eulerr")` to work. The argument `euler` indicates if the function
should produce euler or venn diagrams instead. 

Once obtained the lists of euler/venn objects we can plot them by simply 
calling the function `plot()`:

```{r}
plot(sharing_plots1[[1]])
plot(sharing_plots2[[1]])
```

There are several options that can be set, for this please refer to 
[eulerr docs](https://jolars.github.io/eulerr/reference/plot.euler.html).

# Reproducibility

`R` session information.

```{r reproduce3, echo=FALSE}
## Session info
library("sessioninfo")
options(width = 120)
session_info()
```

# Bibliography

This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])`
with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes.

Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`.

```{r vignetteBiblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE}
## Print bibliography
PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html"))
```