---
title: "rifiComparative"

author: "Loubna Youssar"

date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    number_sections: false
vignette: >
  %\VignetteIndexEntry{rifiComparative}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
```{r, include=FALSE}
devtools::load_all(".")
```

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      message = TRUE,
                      warning = TRUE)
```


``` {r, echo = FALSE, message = FALSE}
require(rifiComparative)
suppressPackageStartupMessages(library(SummarizedExperiment))
```

## 0. Installation

Required dependencies are cited on README, please make sure they are properly
installed (README).
All functions should be located on the same folder and add them to your 
path directory.

## I. Introduction

rifiComparative is a successor framework of `r BiocStyle::Biocpkg("rifi")`
*(https://github.com/CyanolabFreiburg/rifi)*. Generated outputs from the same
organism with different treatments could be compared. Trying to combine segments
of the same gene from different conditions is not straight forward and makes the
data analysis nearly impossible. Therefore we developed a new workflow, 
rifiComparative, with an easy strategy to make 2 conditions comparable. 
The principle of rifiComparative consists on segmenting the half-life 
(difference between half-life (condition1) and half-life (condition2) 
at probe/bin level) and segmenting intensity using the log2FC(mRNA at time 0). 
The workflow does not apply any hierarchy. Half-life (in some cases, HL) and 
intensity segmentation are independent.  
The fragments result of clustering from half-life and intensity are compared using
log2FC(log2FC(half-life)/log2FC(intensity)). These values are a pre-analysis for
transcription and post-transcription regulation. Events for each treatment are 
depicted with the position on the genome (For more detail, refer to section visualization).
P-values from statistical tests are estimated. rifiComparative generates data frame summary, genome plot and several figures (refer to section Plots for more details).


## II. Workflow

### 1. Joining data

The first step is combining the data from two conditions. The data are combined
by row on one hand and combined by column on the other hand. Both objects are 
saved and used as input for the next analysis.

The functions used are:

`loading_fun`: you need to load either `rifi_fit` or `rifi_stats` outputs from
each condition and place all in one directory. `rifi_fit` is sufficient to run the 
workflow unless if you want to select more column from `rifi_stats` for more 
analysis or plot. The "cdt" is added referring to the sample condition. 

<span style="color:red"> Very important: </span> you will need to run the 
differential expression at probe/bin level. This is the log2FC(intensity) or 
log2FC(mRNA at time 0). Pick-up the logFC, the p_value adjusted, probe position 
and strand columns. Save the first two as `logFC_int` and `P.Value`. You can use either `left_join` or `right_join` from the `r BiocStyle::CRANpkg("dplyr")` 
package to join both data by strand and position. 
<br/><br/>

``` {r loading_fun, eval = TRUE}
data(stats_se_cdt1)
data(stats_se_cdt2)
data(differential_expression)
inp_s <-  
    loading_fun(stats_se_cdt1, stats_se_cdt2, differential_expression)[[1]]
head(inp_s, 5)
inp_f <- 
    loading_fun(stats_se_cdt1, stats_se_cdt2, differential_expression)[[2]]
head(inp_f, 5)
```
<br/><br/>

`joining_data_row`: contains `joining_data_row` function. It gathers data frame 
from both conditions in one by rows. The object is called data_combined_se.rda
<br/><br/>

``` {r joining_data_row, eval = TRUE}
data(inp_s)
data(inp_f)
data_combined_minimal <- 
joining_data_row(input1 = inp_s, input2 = inp_f)
head(data_combined_minimal, 5)
```
<br/><br/>

`joining_data_column`: contains `joining_data_column` function. It gathers 
data frame from both conditions in one by columns. The object is called 
df_comb_se.rda
<br/><br/>

``` {r joinging_data_column, eval = TRUE}
data(data_combined_minimal)
df_comb_minimal <- joining_data_column(data = data_combined_minimal)
head(df_comb_minimal, 5)
```

### 2. Penalties

Same as `r BiocStyle::Biocpkg("rifi")` workflow, to get the best segmentation we
need the optimal penalties. 
To calculate half-life penalty, the difference between half-life from both conditions is calculated and added as `distance_HL` variable to `df_comb_minimal`
data frame.
On other hand the `logFC_int` is used to assign penalties for intensity values 
and added as distance_int variable. `df_comb_minimal` with the additional variables
is named `penalties_df`.

The functions needed for penalty are:

`make_pen` calls one of two available penalty functions to automatically assign
penalties for the dynamic programming. Four functions are called:

 * `make_pen`
 * `fragment_HL_pen`
 * `fragment_inty_pen`
 * `score_fun_ave`

#### 1. `make_pen`

`make_pen` calls one of two available penalty functions to automatically
assign penalties for the dynamic programming. the function iterates over many
penalty pairs and picks the most suitable pair based on the difference between
wrong and correct splits. The sample size, penalty range and resolution as well
as the number of cycles can be customized. The primary start parameters create 
a matrix with n = `rez_pen` rows and n = `rez_pen_out` columns with values between
sta_pen/sta_pen_out and end_pen/end_pen_out. The best penalty pair is
picked. If dept is bigger than 1 the same process is repeated with a new matrix
of the same size based on the result of the previous cycle. Only position
segments with length within the sample size range are considered for the
penalties to increase run time. Also, outlier penalties cannot be smaller
than 40% of the respective penalty. For more detail check vignette from `r BiocStyle::Biocpkg("rifi")` package.
<br/><br/>

#### 2. `fragment_HL_pen`

`fragment_HL_pen` is called by `make_pen` function to automatically assign
penalties for the dynamic programming of half-life fragments. The function used 
for `fragment_HL_pen` is `score_fun_ave`. `score_fun_ave` scores the values of 
y on how close they are to the mean. for more details, see below.
<br/><br/>

``` {r make_pen HL, eval = FALSE}

df_comb_minimal[,"distance_HL"] <-
    df_comb_minimal[, "half_life.cdt1"] - df_comb_minimal[, "half_life.cdt2"]

pen_HL <- make_pen(
    probe = df_comb_minimal,
    FUN = rifiComparative:::fragment_HL_pen,
    cores = 2,
    logs = as.numeric(rep(NA, 8))
)
```
<br/><br/>

#### 3. `fragment_inty_pen`

`fragment_inty_pen` is called by `make_pen` function to automatically assign
penalties for the dynamic programming of intensity fragments. 
The function used is `score_fun_ave`.

``` {r make_pen int, eval = FALSE}

df_comb_minimal[,"distance_int"] <- df_comb_minimal[,"logFC_int"]

pen_int <- make_pen(
    probe = df_comb_minimal,
    FUN = rifiComparative:::fragment_inty_pen,
    cores = 2,
    logs = as.numeric(rep(NA, 8))
)
```
<br/><br/>

``` {r penalties, eval = TRUE}
data(df_comb_minimal) 
penalties_df <- penalties(df_comb_minimal)[[1]]
pen_HL <- penalties(df_comb_minimal)[[2]]
pen_int <- penalties(df_comb_minimal)[[3]]
head(penalties_df, 5)
```
<br/><br/>

#### 4. `score_fun_ave`

`score_fun_ave` scores the values of y on how close they are to the mean. 
for more details, see below.
<br/><br/>

### 3. Fragmentation

After finding the optimal set of penalties, fragmentation process could be applied.
The functions used are:
 
 `fragment_HL`
 `fragment_inty`
 `score_fun_ave`

#### 1. `fragment_HL`

`fragment_HL` performs the half_life fragmentation and assigns all gathered 
information to the probe based data frame. The columns `HL_comb_fragment` and 
`HL_mean_comb_fragment` are added to data frame. `fragment_HL` makes 
half-life_fragments and assigns the mean of each fragment.

``` {r fragment_HL, eval = TRUE}
penalties_df <-
    fragment_HL(
    probe = penalties_df,
    cores = 2,
    pen = pen_HL[[1]][[9]],
    pen_out = pen_HL[[1]][[10]]
)
```
<br/>

#### 2. `fragment_inty`

`fragment_inty` performs the intensity fragmentation and assigns all gathered
information to the probe based data frame. The columns `intensity_comb_fragment`
and `intensity_mean_comb_fragment` are added to the data frame. `fragment_inty` 
makes `intensity_fragments` and assigns the mean of each fragment.
The hierarchy is not followed, fragments from different size could be generated 
independently of half-life fragments.

``` {r fragment_inty, eval = TRUE}
fragment_int <-
    fragment_inty(
        probe = penalties_df,
        cores = 2,
        pen = pen_int[[1]][[9]],
        pen_out = pen_int[[1]][[10]]
    )
head(fragment_int, 5)
```
<br/>

#### 3. `score_fun_ave`

`score_fun_ave` is the score function used by dynamic programming for intensity
fragmentation, for more details, see below.

### 4. Statistics

To check segment significance, t-test with two.sided was used. Each fragment was
tested for the number of probes involved in each condition.

``` {r t_test_function_HL, eval = TRUE}
data(fragment_int)
stats_df_comb_minimal <- statistics(data= fragment_int)[[1]]
df_comb_uniq_minimal <- statistics(data= fragment_int)[[2]]
```

### 5. Visualization

The visualization depicts half-life and intensity slots of the fragments. Since 
hierarchy is not applied, the fragments from half-life and intensity are 
independent.

``` {r rifi_visualization_comparison, eval = FALSE}
data(data_combined_minimal)
data(stats_df_comb_minimal)
data(annot_g)
rifi_visualization_comparison(
     data = data_combined_minimal,
     data_c = stats_df_comb_minimal,
     genomeLength = annot_g[[2]],
     annot = annot_g[[1]]
     )
```

Three objects are required:

 `data_combined_minimal` : data frame from joined data by row.
 `df_comb_minimal` : data frame from joined data by column
 `annot` : ggf3 preprocessed (for more information, see below)
 
The plot is located on vignette "genome_fragments_comparison.pdf" and shows 3 sections: `annotation`, half-life difference and log2FC (mRNA=time0 or intensity).
Either half_life difference or log2FC(intensity), the line 0 indicates no changes between both conditions. Conditions 1 and 2 are indicated by blue and lilac color respectively. Fragments result of dynamic programming are indicated by different colors.
The annotation englobes genome annotation preprocessed by gff3_preprocessing function included on the package and a superposed TU annotation of both conditions from
`r BiocStyle::Biocpkg("rifi")` output.

<br/><br/>
``` {r visualization, echo = FALSE, fig.cap = "**genome fragments visualization of both conditions**", out.width = '100%'}
knitr::include_graphics("visualization_cdts.png")
```
<br/>

## III. Outputs

### 1. `adjusting_HLToInt`

`adjusting_HLToInt` function combines half-life and intensity fragments generated 
without hierarchy on one hand and the genome annotation on other hand. The first
step is adjusting the fragments from half-life to intensity and vise-versa and join 
them to the genome annotation. To make half-life and intensity segments comparable, 
`log2FC(HL)` is used instead of `distance_HL`. At least one fragment should have 
a significant p_value from t-test, either half-life or intensity. 

To generate the data frame, two objects are required:

 `df_comb_minimal` : data frame from joined data by column.
 `annot` : ggf3 preprocessed (for more information, see below).

The functions used are:

`p_value_function` extracts and return the p_values of half-life and intensity segments respectively.

`eliminate_outlier_hl` eliminates outliers from half-life fragments.

`eliminate_outlier_int` eliminates outliers from intensity fragments.

`mean_length_int` extracts the mean of the log2FC(intensity) fragments adapted 
to HL_fragments and their lengths.

`mean_length_hl` extracts the mean of log2FC(HL) fragments adapted to the 
intensity fragments and their lengths.

`calculating_rate` calculates decay rate and log2FC(intensity). Both are used to
calculate synthesis rate.

The output data frame contains the corresponding columns:

 **position**: position of the first fragment
 **region**:  region annotation covering the fragments
 **gene**: gene annotation covering the fragments
 **locus_tag**: locus_tag annotation covering the fragments
 **strand**: The bin/probe specific strand (+/-)
 **fragment_HL**: Half-life fragments
 **fragment_int**: intensity fragments
 **position_frg_int**: position of the first fragment and the last position of 
 the last fragment.
 **mean_HL_fragment**: mean of the HL of the fragments involved.
 **mean_int_fragment**: mean of the intensity of the fragments involved.
 **log2FC(decay_rate)**: log2FC(decay(condition1)/decay(condition2)).
 **log2FC(synthesis_rate)**: sum of log2FC(decay_rate) and log2FC(intensity).
 **Log2FC(HL)-Log2FC(int)**: sum of log2FC(decay_rate) and log2FC(intensity).
 **intensity_FC**: log2FC(mean(intensity(condition1))/mean(intensity(condition2))).
 **Log2FC(HL)-Log2FC(int)**: sum of log2FC(decay_rate) and log2FC(intensity).
 **p_value**: indicated by "*" means at least one fragment either half-life fragment or intensity fragment has a significant p_value.

``` {r adjusting_HLToInt, eval = TRUE}
data(stats_df_comb_minimal) 
data(annot_g)
df_adjusting_HLToInt <- adjusting_HLToInt(data = stats_df_comb_minimal, 
                                          annotation = annot_g[[1]])
head(df_adjusting_HLToInt, 5)
```

## IV. Plots

A serie of plots could be generated using the `figures_fun`. The functions
included are:

`plot_decay_synt`

`plot_heatscatter`

`plot_density`

`plot_histogram`

`plot_scatter`

`plot_volcano`

<br/>
``` {r figures_fun, eval = FALSE, echo = TRUE, out.width = '100%'}
 data(data_combined_minimal)
 data(df_comb_minimal)
 data(differential_expression) 
 data(df_mean_minimal)
 figures_fun(data.1 = df_mean_minimal, data.2 = data_combined_minimal, 
 input.1 = df_comb_minimal, input.2 = differential_expression, cdt1 = "sc", 
 cdt2 = "fe") 
```
<br/>

### 1. `plot_decay_synt`

The generated data frame `df_mean_minimal` could be used to plot changes in RNA
decay rates (log fold, x-axis) versus the changes in RNA synthesis rates (log fold, y-axis) in the condition 1 versus condition 2. The black lines horizontal, vertical and diagonal are the median of synthesis_rate, decay_rate and mRNA at time 0 respectively. Dashed gray lines indicate 0.5-fold changes from 0 (gray lines) referring to unchanged fold.
Coloration could be adjusted upon the parameter selected. In this case decay rate 
above 0.5 and synthesis rate below -0.5 are highlighted in yellow.
Segments could be labeled using `geom_text_repel` function, they are commented on
the function.

<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Decay rate vs. Synthesis rate**", out.width = '100%'}
knitr::include_graphics("Decay_rate_vs_Synthesis_rate.png")
```
<br/>


### 2. `plot_heatscatter`

Heatscatter plot could be generated using "df_mean_minimal" data frame. It plots
the changes in RNA decay rates (log fold, x-axis) versus the changes in RNA 
synthesis rates (log fold, y-axis) in the condition 1 versus condition 2. 
The coloring indicates the local point density.

<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Heatscatter Decay rate vs. Synthesis rate**", out.width = '100%'}
knitr::include_graphics("heatscatter_Decay_rate_vs_Synthesis_rate.png")
```
<br/>

### 3. `plot_density`

The function uses the `data_combined_minimal` to plot the probe/bin half-life 
density in both conditions. Condition 1 and 2 could be indicated.

<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Half-life Density**", out.width = '100%'}
knitr::include_graphics("density_HL.png")
```
<br/>

### 4. `plot_histogram`

The function uses `df_comb_minimal` to plot a histogram of probe/bin half-life 
categories from 2 to 20 minutes in both conditions. Condition 1 and 2 could be
indicated.

<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Half-life Classification**", out.width = '100%'}
knitr::include_graphics("histogram_count.png")
```
<br/>

### 5. `plot_scatter`

A scatter plot of the bin/probe half-life in condition 1 vs. condition 2. 

<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Half-life Scatter Plot **", out.width = '100%'}
knitr::include_graphics("scatter_plot_HL.png")
```
<br/>


### 6. `plot_volcano`

A volcano plot of statistical significance (P value) versus magnitude of change 
(fold change).
    
<br/><br/>
``` {r, echo = FALSE, fig.cap = "**\\label{fig:figs}Volcano Plot**", out.width = '100%'}
knitr::include_graphics("volcano_plot.png")
```
<br/>

## III. Additional functions

### 1. `score_fun_ave`

`score_fun_ave` scores the difference of the values from their mean.
`score_fun_ave` calculates the mean of a minimum 2 values **y** and substrates
the difference from their mean. The IDs **z** and the sum of differences from
the mean are stored. A new value y is added, the mean is calculated and the new
IDs and sum of differences are stored. After several rounds, the minimum score
and the corresponding IDs is selected and stored as the best fragment.
`score_fun_ave` selects simultaneously for outliers, the maximum number is fixed
previously. Outliers are those values with high difference from the mean, they
are stored but excluded from the next calculation. The output of the function is
a vector of IDs separated by ",", a vector of mean separated by "_" and a
vector of outliers separated by ",".

### 2. `gff3_preprocess`

`gff3_preprocess` processes gff3 file from database, extracting gene names and
locus_tag from all coding regions (CDS). Other features like UTRs, ncRNA, asRNA
ect.. if available and the genome length are extracted. The output is a list of 
2 elements.

The output data frame from `gff3_preprocess` function contains the following
columns:

a. *region*: CDS or any other available like UTRs, ncRNA, asRNA
b. *start*: start position of the gene
c. *end*: end position of the gene
d. *strand*: +/-
e. *gene*: gene annotation if available otherwise locus_tag annotation replaces
it
f. *locus_tag*: locus_tag annotation
<br/><br/>
``` {r gff3_preprocess}
gff3_preprocess(path = gzfile(
   system.file("extdata", "gff_synechocystis_6803.gff.gz", package = "rifiComparative")
 ))
```
<br/><br/>

``` {r}
sessionInfo()
```