metabinR 1.6.0
The goal of metabinR is to provide functions for performing abundance and composition based binning on metagenomic samples, directly from FASTA or FASTQ files.
Abundance based binning is performed by analyzing sequences with long kmers (k>8), whereas composition based binning is performed by utilizing short kmers (k<8).
To install metabinR package:
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("metabinR")In order to allocate RAM, a special parameter needs to be passed while JVM
initializes. JVM parameters can be passed by setting java.parameters option.
The -Xmx parameter, followed (without space) by an integer value and a
letter, is used to tell JVM what is the maximum amount of heap RAM that it can
use. The letter in the parameter (uppercase or lowercase), indicates RAM units.
For example, parameters -Xmx1024m or -Xmx1024M or -Xmx1g or -Xmx1G, all
allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM.
options(java.parameters="-Xmx1500M")
unloadNamespace("metabinR")
library(metabinR)
library(data.table)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(cvms)
library(sabre)In this example we use the simulated metagenome sample (see sample data) to perform abundance based binning. The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes in such a way (log-norm abundances) that each read is belongs to one of two abundance classes (class 1 of high abundant taxa and class 2 of low abundant taxa).
We first get the abundance information for the simulated metagenome :
abundances <- read.table(
    system.file("extdata", "distribution_0.txt",package = "metabinR"),
    col.names = c("genome_id", "abundance" ,"AB_id"))In abundances data.frame, column genome_id is the bacterial genome id,
column abundance is the abundance ratio
and column AB_id is the original abundance class (in this example 1 or 2).
Then we get the read mapping information (from which bacterial genome each read is originating from and in which abundance class belongs) :
reads.mapping <- fread(system.file("extdata", "reads_mapping.tsv.gz",
                                   package = "metabinR")) %>%
    merge(abundances[, c("genome_id","AB_id")], by = "genome_id") %>%
    arrange(anonymous_read_id)In reads.mapping data.frame, column anonymous_read_id is the read id,
column genome_id is the original bacterial genome id and
column AB_id is the original abundance class id.
We perform Abundance based Binning on the simulated reads, for 2 abundance classes and analyzing data with 10-mers. The call returns a dataframe of the assigned abundance cluster and distances to all clusters for each read :
assignments.AB <- abundance_based_binning(
        system.file("extdata","reads.metagenome.fasta.gz", package="metabinR"),
        numOfClustersAB = 2, 
        kMerSizeAB = 10, 
        dryRun = FALSE, 
        outputAB = "vignette") %>%
    arrange(read_id)Note that read id of fasta header
matches anonymous_read_id of reads.mapping.
Call to will produce 2 fasta file, one for each of the abundance classes, containing fasta reads assigned to each class. It will also produce a file containing histogram information of kmers counted. We can plot this histogram as :
histogram.AB <- read.table("vignette__AB.histogram.tsv", header = TRUE)
ggplot(histogram.AB, aes(x=counts, y=frequency)) + 
    geom_area() +
    labs(title = "kmer counts histogram") + 
    theme_bw()We get the assigned abundance class for each read in assignments.AB$AB
Then we evaluate predicted abundance class and plot confusion matrix :
eval.AB.cvms <- cvms::evaluate(data = data.frame(
                                    prediction=as.character(assignments.AB$AB),
                                    target=as.character(reads.mapping$AB_id),
                                    stringsAsFactors = FALSE),
                                target_col = "target",
                                prediction_cols = "prediction",
                                type = "binomial"
)
eval.AB.sabre <- sabre::vmeasure(as.character(assignments.AB$AB),
                                as.character(reads.mapping$AB_id))
p <- cvms::plot_confusion_matrix(eval.AB.cvms) +
        labs(title = "Confusion Matrix", 
                x = "Target Abundance Class", 
                y = "Predicted Abundance Class")
tab <- as.data.frame(
    c(
        Accuracy =  round(eval.AB.cvms$Accuracy,4),
        Specificity =  round(eval.AB.cvms$Specificity,4),
        Sensitivity =  round(eval.AB.cvms$Sensitivity,4),
        Fscore =  round(eval.AB.cvms$F1,4),
        Kappa =  round(eval.AB.cvms$Kappa,4),
        Vmeasure = round(eval.AB.sabre$v_measure,4)
    )
)
grid.arrange(p, ncol = 1)knitr::kable(tab, caption = "AB binning evaluation", col.names = NULL)| Accuracy | 0.8700 | 
| Specificity | 0.9058 | 
| Sensitivity | 0.7608 | 
| Fscore | 0.7430 | 
| Kappa | 0.6560 | 
| Vmeasure | 0.3553 | 
In a similar way, we analyze the simulated metagenome sample with the Composition based Binning module.
The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes. The originating bacteria genome is therefore the true class information of each read in this example.
We first get the read mapping information (from which bacterial genome each read is originating from) :
reads.mapping <- fread(
        system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>%
    arrange(anonymous_read_id)In reads.mapping data.frame, column anonymous_read_id is the read id and
column genome_id is the original bacterial genome id.
We perform Composition based Binning on the simulated reads, for 10 composition classes (one for each bacterial genome) and analyzing data with 6-mers. The call returns a dataframe of the assigned composition cluster and distances to all clusters for each read :
assignments.CB <- composition_based_binning(
        system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"),
        numOfClustersCB = 10, 
        kMerSizeCB = 4, 
        dryRun = TRUE, 
        outputCB = "vignette") %>%
    arrange(read_id)Note that read id of fasta header
matches anonymous_read_id of reads.mapping.
Since this is a clustering problem, it only makes sense to calculate Vmeasure
and other an extrinsic measures like Homogeneity and completeness.
eval.CB.sabre <- sabre::vmeasure(as.character(assignments.CB$CB),
                                as.character(reads.mapping$genome_id))
tab <- as.data.frame(
    c(
        Vmeasure = round(eval.AB.sabre$v_measure,4),
        Homogeneity = round(eval.AB.sabre$homogeneity,4),
        Completeness = round(eval.AB.sabre$completeness,4)
    )
)
knitr::kable(tab, caption = "CB binning evaluation", col.names = NULL)| Vmeasure | 0.3553 | 
| Homogeneity | 0.3514 | 
| Completeness | 0.3594 | 
Finally, we analyze the simulated metagenome sample with the Hierarchical Binning module.
The simulated metagenome contains 26664 Illumina reads (13332 pairs of 2x150bp) that have been sampled from 10 bacterial genomes. The originating bacteria genome is therefore the true class information of each read in this example.
We first get the read mapping information (from which bacterial genome each read is originating from) :
reads.mapping <- fread(
        system.file("extdata", "reads_mapping.tsv.gz",package = "metabinR")) %>%
    arrange(anonymous_read_id)In reads.mapping data.frame, column anonymous_read_id is the read id and
column genome_id is the original bacterial genome id.
We perform Hierarchical Binning on the simulated reads, for initially 2 abundance classes. Data is analyzed with 10-mers for the AB part and with 4-mers for the following CB part. The call returns a dataframe of the assigned final hierarchical cluster (ABxCB) and distances to all clusters for each read :
assignments.ABxCB <- hierarchical_binning(
        system.file("extdata","reads.metagenome.fasta.gz",package ="metabinR"),
        numOfClustersAB = 2,
        kMerSizeAB = 10,
        kMerSizeCB = 4, 
        dryRun = TRUE, 
        outputC = "vignette") %>%
    arrange(read_id)Note that read id of fasta header
matches anonymous_read_id of reads.mapping.
Calculate Vmeasure
and other an extrinsic measures like Homogeneity and completeness.
eval.ABxCB.sabre <- sabre::vmeasure(as.character(assignments.ABxCB$ABxCB),
                                    as.character(reads.mapping$genome_id))
tab <- as.data.frame(
    c(
        Vmeasure = round(eval.ABxCB.sabre$v_measure,4),
        Homogeneity = round(eval.ABxCB.sabre$homogeneity,4),
        Completeness = round(eval.ABxCB.sabre$completeness,4)
    )
)
knitr::kable(tab, caption = "ABxCB binning evaluation", col.names = NULL)| Vmeasure | 0.2830 | 
| Homogeneity | 0.4722 | 
| Completeness | 0.2021 | 
Clean files :
unlink("vignette__*")utils::sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
#>  [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
#>  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
#>  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
#>  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
#> [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] sabre_0.4.3       cvms_1.6.1        gridExtra_2.3     ggplot2_3.5.1    
#> [5] dplyr_1.1.4       data.table_1.15.4 metabinR_1.6.0    BiocStyle_2.32.0 
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5        xfun_0.43           bslib_0.7.0        
#>  [4] raster_3.6-26       rJava_1.0-11        lattice_0.22-6     
#>  [7] yulab.utils_0.1.4   vctrs_0.6.5         tools_4.4.0        
#> [10] generics_0.1.3      tibble_3.2.1        proxy_0.4-27       
#> [13] fansi_1.0.6         highr_0.10          pkgconfig_2.0.3    
#> [16] R.oo_1.26.0         KernSmooth_2.23-22  ggnewscale_0.4.10  
#> [19] ggplotify_0.1.2     checkmate_2.3.1     RColorBrewer_1.1-3 
#> [22] lifecycle_1.0.4     compiler_4.4.0      farver_2.1.1       
#> [25] munsell_0.5.1       tinytex_0.50        terra_1.7-71       
#> [28] codetools_0.2-20    ggimage_0.3.3       ggfun_0.1.4        
#> [31] htmltools_0.5.8.1   class_7.3-22        sass_0.4.9         
#> [34] yaml_2.3.8          pillar_1.9.0        jquerylib_0.1.4    
#> [37] tidyr_1.3.1         R.utils_2.12.3      classInt_0.4-10    
#> [40] cachem_1.0.8        entropy_1.3.1       magick_2.8.3       
#> [43] tidyselect_1.2.1    digest_0.6.35       sf_1.0-16          
#> [46] purrr_1.0.2         bookdown_0.39       labeling_0.4.3     
#> [49] rsvg_2.6.0          fastmap_1.1.1       grid_4.4.0         
#> [52] colorspace_2.1-0    cli_3.6.2           magrittr_2.0.3     
#> [55] utf8_1.2.4          e1071_1.7-14        withr_3.0.0        
#> [58] scales_1.3.0        backports_1.4.1     sp_2.1-4           
#> [61] rmarkdown_2.26      R.methodsS3_1.8.2   memoise_2.0.1      
#> [64] evaluate_0.23       knitr_1.46          gridGraphics_0.5-1 
#> [67] rlang_1.1.3         Rcpp_1.0.12         glue_1.7.0         
#> [70] DBI_1.2.2           BiocManager_1.30.22 pROC_1.18.5        
#> [73] jsonlite_1.8.8      R6_2.5.1            plyr_1.8.9         
#> [76] fs_1.6.4            units_0.8-5