---
title: "metabinR"
author: 
-   name: "Anestis Gkanogiannis"
    email: anestis@gkanogiannis.com
package: metabinR
output: 
    BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{metabinR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# About metabinR

The goal of metabinR is to provide functions for performing abundance and 
composition based binning on metagenomic samples, directly from FASTA or 
FASTQ files.

Abundance based binning is performed by analyzing sequences with long kmers 
(k>8), whereas composition based binning is performed by utilizing short kmers 
(k<8).

# Installation

To install `metabinR` package:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("metabinR")
```

# Preparation

## Allocate RAM and load required libraries

In order to allocate RAM, a special parameter needs to be passed while JVM 
initializes. JVM parameters can be passed by setting `java.parameters` option.
The `-Xmx` parameter, followed (without space) by an integer value and a 
letter, is used to tell JVM what is the maximum amount of heap RAM that it can
use. The letter in the parameter (uppercase or lowercase), indicates RAM units.
For example, parameters `-Xmx1024m` or `-Xmx1024M` or `-Xmx1g` or `-Xmx1G`, all
allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM.

```{r, eval=TRUE, message=FALSE}
options(java.parameters="-Xmx1500M")
unloadNamespace("metabinR")
library(metabinR)
library(data.table)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(cvms)
library(sabre)
```

# Abundance based binning example

In this example we use the simulated metagenome sample (see sample data) to 
perform abundance based binning.
The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) 
that have been sampled from 10 bacterial genomes in such a way 
(log-norm abundances) that each read is belongs to one of two abundance classes
(class 1 of high abundant taxa and class 2 of low abundant taxa).

We first get the abundance information for the simulated metagenome :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
abundances <- read.table(
    system.file("extdata", "distribution_0.txt",package = "metabinR"),
    col.names = c("genome_id", "abundance" ,"AB_id"))
```
In `abundances` data.frame, column `genome_id` is the bacterial genome id, 
column `abundance` is the abundance ratio 
and column `AB_id` is the original abundance class (in this example 1 or 2).

Then we get the read mapping information (from which bacterial genome 
each read is originating from and in which abundance class belongs) :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
reads.mapping <- fread(system.file("extdata", "reads_mapping.tsv.gz",
                                   package = "metabinR")) %>%
    merge(abundances[, c("genome_id","AB_id")], by = "genome_id") %>%
    arrange(anonymous_read_id)
```
In `reads.mapping` data.frame, column `anonymous_read_id` is the read id, 
column `genome_id` is the original bacterial genome id and 
column `AB_id` is the original abundance class id.

We perform Abundance based Binning on the simulated reads, 
for 2 abundance classes and analyzing data with 10-mers.
The call returns a dataframe of the assigned abundance cluster 
and distances to all clusters for each read :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
assignments.AB <- abundance_based_binning(
        system.file("extdata","reads.metagenome.fasta.gz", package="metabinR"),
        numOfClustersAB = 2, 
        kMerSizeAB = 10, 
        dryRun = FALSE, 
        outputAB = "vignette") %>%
    arrange(read_id)
```
Note that read id of fasta header 
matches `anonymous_read_id` of `reads.mapping`.

Call to \link[metabinR]{abundance_based_binning} will produce 2 fasta file, 
one for each of the abundance classes, containing fasta reads assigned to each 
class.
It will also produce a file containing histogram information of kmers counted.
We can plot this histogram as :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
histogram.AB <- read.table("vignette__AB.histogram.tsv", header = TRUE)
ggplot(histogram.AB, aes(x=counts, y=frequency)) + 
    geom_area() +
    labs(title = "kmer counts histogram") + 
    theme_bw()
```

We get the assigned abundance class for each read in `assignments.AB$AB`

Then we evaluate predicted abundance class and plot confusion matrix :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
eval.AB.cvms <- cvms::evaluate(data = data.frame(
                                    prediction=as.character(assignments.AB$AB),
                                    target=as.character(reads.mapping$AB_id),
                                    stringsAsFactors = FALSE),
                                target_col = "target",
                                prediction_cols = "prediction",
                                type = "binomial"
)
eval.AB.sabre <- sabre::vmeasure(as.character(assignments.AB$AB),
                                as.character(reads.mapping$AB_id))

p <- cvms::plot_confusion_matrix(eval.AB.cvms) +
        labs(title = "Confusion Matrix", 
                x = "Target Abundance Class", 
                y = "Predicted Abundance Class")
tab <- as.data.frame(
    c(
        Accuracy =  round(eval.AB.cvms$Accuracy,4),
        Specificity =  round(eval.AB.cvms$Specificity,4),
        Sensitivity =  round(eval.AB.cvms$Sensitivity,4),
        Fscore =  round(eval.AB.cvms$F1,4),
        Kappa =  round(eval.AB.cvms$Kappa,4),
        Vmeasure = round(eval.AB.sabre$v_measure,4)
    )
)
grid.arrange(p, ncol = 1)
knitr::kable(tab, caption = "AB binning evaluation", col.names = NULL)
```

# Composition based binning example

In a similar way, we analyze the simulated metagenome sample 
with the Composition based Binning module.

The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) 
that have been sampled from 10 bacterial genomes.
The originating bacteria genome is therefore the true class information of
each read in this example.

We first get the read mapping information (from which bacterial genome each 
read is originating from) :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
reads.mapping <- fread(
        system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>%
    arrange(anonymous_read_id)
```
In `reads.mapping` data.frame, column `anonymous_read_id` is the read id and 
column `genome_id` is the original bacterial genome id.

We perform Composition based Binning on the simulated reads, 
for 10 composition classes (one for each bacterial genome) and analyzing data 
with 6-mers.
The call returns a dataframe of the assigned composition cluster 
and distances to all clusters for each read :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
assignments.CB <- composition_based_binning(
        system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"),
        numOfClustersCB = 10, 
        kMerSizeCB = 4, 
        dryRun = TRUE, 
        outputCB = "vignette") %>%
    arrange(read_id)
```
Note that read id of fasta header 
matches `anonymous_read_id` of `reads.mapping`.

Since this is a clustering problem, it only makes sense to calculate `Vmeasure` 
and other an extrinsic measures like `Homogeneity` and `completeness`.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
eval.CB.sabre <- sabre::vmeasure(as.character(assignments.CB$CB),
                                as.character(reads.mapping$genome_id))
tab <- as.data.frame(
    c(
        Vmeasure = round(eval.AB.sabre$v_measure,4),
        Homogeneity = round(eval.AB.sabre$homogeneity,4),
        Completeness = round(eval.AB.sabre$completeness,4)
    )
)
knitr::kable(tab, caption = "CB binning evaluation", col.names = NULL)
```

# Hierarchical (2step ABxCB) binning example

Finally, we analyze the simulated metagenome sample 
with the Hierarchical Binning module.

The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) 
that have been sampled from 10 bacterial genomes.
The originating bacteria genome is therefore the true class information of
each read in this example.

We first get the read mapping information (from which bacterial genome each 
read is originating from) :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
reads.mapping <- fread(
        system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>%
    arrange(anonymous_read_id)
```
In `reads.mapping` data.frame, column `anonymous_read_id` is the read id and 
column `genome_id` is the original bacterial genome id.

We perform Hierarchical Binning on the simulated reads, 
for initially 2 abundance classes.
Data is analyzed with 10-mers for the AB part 
and with 4-mers for the following CB part.
The call returns a dataframe of the assigned final hierarchical cluster (ABxCB) 
and distances to all clusters for each read :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
assignments.ABxCB <- hierarchical_binning(
        system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"),
        numOfClustersAB = 2,
        kMerSizeAB = 10,
        kMerSizeCB = 4, 
        dryRun = TRUE, 
        outputC = "vignette") %>%
    arrange(read_id)
```
Note that read id of fasta header 
matches `anonymous_read_id` of `reads.mapping`.

Calculate `Vmeasure` 
and other an extrinsic measures like `Homogeneity` and `completeness`.
```{r, eval=TRUE, message=FALSE, warning=FALSE}
eval.ABxCB.sabre <- sabre::vmeasure(as.character(assignments.ABxCB$ABxCB),
                                    as.character(reads.mapping$genome_id))
tab <- as.data.frame(
    c(
        Vmeasure = round(eval.ABxCB.sabre$v_measure,4),
        Homogeneity = round(eval.ABxCB.sabre$homogeneity,4),
        Completeness = round(eval.ABxCB.sabre$completeness,4)
    )
)
knitr::kable(tab, caption = "ABxCB binning evaluation", col.names = NULL)
```

Clean files :
```{r, eval=TRUE, message=FALSE, warning=FALSE}
unlink("vignette__*")
```

# Session Info
```{r setup}
utils::sessionInfo()
```