---
title: "The Netboost users guide"
author: "Pascal Schlosser, Jochen Knaus"
package: "`r pkg_ver('netboost')`"
output: 
  BiocStyle::html_document:
    md_extensions: "-autolink_bare_uris"
vignette: >
  %\VignetteIndexEntry{The Netboost users guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, cache = F, echo = FALSE}
knitr::opts_chunk$set(error = TRUE)
```

# Introduction

The `r Biocpkg("netboost")` package, implements a three-step dimension reduction
technique. First, a boosting-based filter is combined with the topological 
overlap measure to identify the essential edges of the network. Second, sparse 
hierarchical clustering is applied on the selected edges to identify modules
and finally module information is aggregated by the first principal 
components. The primary analysis is then carried out on these summary measures
instead of the original data.

# Loading an example dataset 

The package comes with an example dataset included. We import the acute myeloid 
leukemia patient data from The Cancer Genome Atlas public domain database. 
The dataset consists of one thousand DNA methylation sites and gene expression levels on 
chromosome 18 for 80 patients.

```{r load_data}
require("netboost")
data("tcga_aml_meth_rna_chr18", package = "netboost")
dim(tcga_aml_meth_rna_chr18)
```

The `netboost()` function integrates all major analysis steps and generates 
multiple plots. In this step we also set analysis parameters:

 `stepno` defines the number of boosting steps taken

 `soft_power` (if null, automatically chosen) the exponent in the transformation
of the correlation

 `min_cluster_size` the minimal size of clusters, `n_pc` the number of maximally 
computed principal components

 `scale` if data should be scaled and centered prior to analysis

 `ME_diss_thres` defines the merging threshold for identified clusters. 

 For details on the options please see `?netboost` and the corresponding paper
Schlosser et al. 2020.

```{r netboost_run}
results <- netboost(datan = tcga_aml_meth_rna_chr18, stepno = 20L, 
soft_power = 3L, min_cluster_size = 10L, n_pc = 2, scale = TRUE, ME_diss_thres = 0.25) 
```

For each detected 
independent tree in the dataset (here one) the first graph shows a dendrogram of 
initial modules and at which level they are merged, the second graph a module 
dendrogram after merging and the third the dendrogram of features including the 
module-color-code.

`results` contains the dendrograms (dendros), feature identifier (names) matched 
to module assignment (colors), the aggregated dataset (MEs), the rotation matrix 
to compute the aggregated dataset (rotation) and the proportion of variance 
explained by the aggregate measures (var_explained).
Dependent on the minimum proportion of variance explained set in the 
`netboost()` call (default 0.5) up to `n_pc` principal components are exported.

```{r results}
names(results)
colnames(results$MEs)
```

As you see for most modules the first principal component already explained more 
than 50% of the variance in the original features of this module. 
ME0_*X*_pc*Y* denotes the background module (unclustered features) of the 
independent tree *X*.

Explained variance is reported by a matrix for the first `n_pc` principal 
components. Here we list the first 5 modules:

```{r variance}
results$var_explained[,1:5]
```

`results$colors` use a numeric coding for the modules which matches their module 
name. To list features of module ME8 we can extract them by:

```{r module_members}
results$names[results$colors==8]
```

The final dendrogram including all trees can be plotted including labels (`results$names`) for 
individual features. `colorsrandom` controls if module-color matching should be 
randomized to get a clearly differentiable pattern of the potentially many 
modules. Labels are only suitable in applications with few features or with a 
appropriately large pdf device.

```{r plot}
set.seed(123)
nb_plot_dendro(nb_summary = results, labels = FALSE, colorsrandom = TRUE)
```

Next the primary analysis on the aggregated dataset (`results$MEs`) can be 
computed.
We also implemented a convenience function to transfer a clustering to a new 
dataset. Here, we transfer the clustering to the same dataset resulting in 
identical aggregate measures.

```{r transfer}
    ME_transfer <- nb_transfer(nb_summary = results, 
    new_data = tcga_aml_meth_rna_chr18, scale = TRUE)
    all(round(results$MEs, 10) == round(ME_transfer, 10))
```

Netboost now also has a fully non-parametric implementation. Code is not run here to showcase the multicore option (Bioconductor vignette builder does not allow for multicore execution). Adjust `cores` to your machine:
```{r robust_netboost_run}
#results <- netboost(datan = tcga_aml_meth_rna_chr18,cores=10L,
#soft_power = 3L, min_cluster_size = 10L, n_pc = 2, qc_plot = FALSE,
#filter_method = "spearman", robust_PCs = TRUE, method = "spearman")
```

# Session Info
```{r sessionInfo}
sessionInfo()
warnings()
```