--- title: "metabinR" author: - name: "Anestis Gkanogiannis" email: anestis@gkanogiannis.com package: metabinR output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{metabinR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # About metabinR The goal of metabinR is to provide functions for performing abundance and composition based binning on metagenomic samples, directly from FASTA or FASTQ files. Abundance based binning is performed by analyzing sequences with long kmers (k>8), whereas composition based binning is performed by utilizing short kmers (k<8). # Installation To install `metabinR` package: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("metabinR") ``` # Preparation ## Allocate RAM and load required libraries In order to allocate RAM, a special parameter needs to be passed while JVM initializes. JVM parameters can be passed by setting `java.parameters` option. The `-Xmx` parameter, followed (without space) by an integer value and a letter, is used to tell JVM what is the maximum amount of heap RAM that it can use. The letter in the parameter (uppercase or lowercase), indicates RAM units. For example, parameters `-Xmx1024m` or `-Xmx1024M` or `-Xmx1g` or `-Xmx1G`, all allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM. ```{r, eval=TRUE, message=FALSE} options(java.parameters="-Xmx1500M") unloadNamespace("metabinR") library(metabinR) library(data.table) library(dplyr) library(ggplot2) library(gridExtra) library(cvms) library(sabre) ``` # Abundance based binning example In this example we use the simulated metagenome sample (see sample data) to perform abundance based binning. The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes in such a way (log-norm abundances) that each read is belongs to one of two abundance classes (class 1 of high abundant taxa and class 2 of low abundant taxa). We first get the abundance information for the simulated metagenome : ```{r, eval=TRUE, message=FALSE, warning=FALSE} abundances <- read.table( system.file("extdata", "distribution_0.txt",package = "metabinR"), col.names = c("genome_id", "abundance" ,"AB_id")) ``` In `abundances` data.frame, column `genome_id` is the bacterial genome id, column `abundance` is the abundance ratio and column `AB_id` is the original abundance class (in this example 1 or 2). Then we get the read mapping information (from which bacterial genome each read is originating from and in which abundance class belongs) : ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads.mapping <- fread(system.file("extdata", "reads_mapping.tsv.gz", package = "metabinR")) %>% merge(abundances[, c("genome_id","AB_id")], by = "genome_id") %>% arrange(anonymous_read_id) ``` In `reads.mapping` data.frame, column `anonymous_read_id` is the read id, column `genome_id` is the original bacterial genome id and column `AB_id` is the original abundance class id. We perform Abundance based Binning on the simulated reads, for 2 abundance classes and analyzing data with 10-mers. The call returns a dataframe of the assigned abundance cluster and distances to all clusters for each read : ```{r, eval=TRUE, message=FALSE, warning=FALSE} assignments.AB <- abundance_based_binning( system.file("extdata","reads.metagenome.fasta.gz", package="metabinR"), numOfClustersAB = 2, kMerSizeAB = 10, dryRun = FALSE, outputAB = "vignette") %>% arrange(read_id) ``` Note that read id of fasta header matches `anonymous_read_id` of `reads.mapping`. Call to \link[metabinR]{abundance_based_binning} will produce 2 fasta file, one for each of the abundance classes, containing fasta reads assigned to each class. It will also produce a file containing histogram information of kmers counted. We can plot this histogram as : ```{r, eval=TRUE, message=FALSE, warning=FALSE} histogram.AB <- read.table("vignette__AB.histogram.tsv", header = TRUE) ggplot(histogram.AB, aes(x=counts, y=frequency)) + geom_area() + labs(title = "kmer counts histogram") + theme_bw() ``` We get the assigned abundance class for each read in `assignments.AB$AB` Then we evaluate predicted abundance class and plot confusion matrix : ```{r, eval=TRUE, message=FALSE, warning=FALSE} eval.AB.cvms <- cvms::evaluate(data = data.frame( prediction=as.character(assignments.AB$AB), target=as.character(reads.mapping$AB_id), stringsAsFactors = FALSE), target_col = "target", prediction_cols = "prediction", type = "binomial" ) eval.AB.sabre <- sabre::vmeasure(as.character(assignments.AB$AB), as.character(reads.mapping$AB_id)) p <- cvms::plot_confusion_matrix(eval.AB.cvms) + labs(title = "Confusion Matrix", x = "Target Abundance Class", y = "Predicted Abundance Class") tab <- as.data.frame( c( Accuracy = round(eval.AB.cvms$Accuracy,4), Specificity = round(eval.AB.cvms$Specificity,4), Sensitivity = round(eval.AB.cvms$Sensitivity,4), Fscore = round(eval.AB.cvms$F1,4), Kappa = round(eval.AB.cvms$Kappa,4), Vmeasure = round(eval.AB.sabre$v_measure,4) ) ) grid.arrange(p, ncol = 1) knitr::kable(tab, caption = "AB binning evaluation", col.names = NULL) ``` # Composition based binning example In a similar way, we analyze the simulated metagenome sample with the Composition based Binning module. The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes. The originating bacteria genome is therefore the true class information of each read in this example. We first get the read mapping information (from which bacterial genome each read is originating from) : ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads.mapping <- fread( system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>% arrange(anonymous_read_id) ``` In `reads.mapping` data.frame, column `anonymous_read_id` is the read id and column `genome_id` is the original bacterial genome id. We perform Composition based Binning on the simulated reads, for 10 composition classes (one for each bacterial genome) and analyzing data with 6-mers. The call returns a dataframe of the assigned composition cluster and distances to all clusters for each read : ```{r, eval=TRUE, message=FALSE, warning=FALSE} assignments.CB <- composition_based_binning( system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"), numOfClustersCB = 10, kMerSizeCB = 4, dryRun = TRUE, outputCB = "vignette") %>% arrange(read_id) ``` Note that read id of fasta header matches `anonymous_read_id` of `reads.mapping`. Since this is a clustering problem, it only makes sense to calculate `Vmeasure` and other an extrinsic measures like `Homogeneity` and `completeness`. ```{r, eval=TRUE, message=FALSE, warning=FALSE} eval.CB.sabre <- sabre::vmeasure(as.character(assignments.CB$CB), as.character(reads.mapping$genome_id)) tab <- as.data.frame( c( Vmeasure = round(eval.AB.sabre$v_measure,4), Homogeneity = round(eval.AB.sabre$homogeneity,4), Completeness = round(eval.AB.sabre$completeness,4) ) ) knitr::kable(tab, caption = "CB binning evaluation", col.names = NULL) ``` # Hierarchical (2step ABxCB) binning example Finally, we analyze the simulated metagenome sample with the Hierarchical Binning module. The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes. The originating bacteria genome is therefore the true class information of each read in this example. We first get the read mapping information (from which bacterial genome each read is originating from) : ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads.mapping <- fread( system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>% arrange(anonymous_read_id) ``` In `reads.mapping` data.frame, column `anonymous_read_id` is the read id and column `genome_id` is the original bacterial genome id. We perform Hierarchical Binning on the simulated reads, for initially 2 abundance classes. Data is analyzed with 10-mers for the AB part and with 4-mers for the following CB part. The call returns a dataframe of the assigned final hierarchical cluster (ABxCB) and distances to all clusters for each read : ```{r, eval=TRUE, message=FALSE, warning=FALSE} assignments.ABxCB <- hierarchical_binning( system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"), numOfClustersAB = 2, kMerSizeAB = 10, kMerSizeCB = 4, dryRun = TRUE, outputC = "vignette") %>% arrange(read_id) ``` Note that read id of fasta header matches `anonymous_read_id` of `reads.mapping`. Calculate `Vmeasure` and other an extrinsic measures like `Homogeneity` and `completeness`. ```{r, eval=TRUE, message=FALSE, warning=FALSE} eval.ABxCB.sabre <- sabre::vmeasure(as.character(assignments.ABxCB$ABxCB), as.character(reads.mapping$genome_id)) tab <- as.data.frame( c( Vmeasure = round(eval.ABxCB.sabre$v_measure,4), Homogeneity = round(eval.ABxCB.sabre$homogeneity,4), Completeness = round(eval.ABxCB.sabre$completeness,4) ) ) knitr::kable(tab, caption = "ABxCB binning evaluation", col.names = NULL) ``` Clean files : ```{r, eval=TRUE, message=FALSE, warning=FALSE} unlink("vignette__*") ``` # Session Info ```{r setup} utils::sessionInfo() ```