--- title: "Storing Mass Spectrometry Data in SQL Databases" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{Storing Mass Spectrometry Data in SQL Databases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignettePackage{MsBackendSql} %\VignetteDepends{MsBackendSql,BiocStyle,RSQLite,msdata,microbenchmark,mzR} --- ```{r style, echo = FALSE, results = 'asis', message=FALSE} BiocStyle::markdown() ``` **Package**: `r Biocpkg("MsBackendSql")`
**Authors**: `r packageDescription("MsBackendSql")[["Author"]] `
**Compiled**: `r date()` ```{r, echo = FALSE, message = FALSE} library(MsBackendSql) knitr::opts_chunk$set(echo = TRUE, message = FALSE) library(BiocStyle) ``` # Introduction The `r Biocpkg("Spectra")` Bioconductor package provides a flexible and expandable infrastructure for Mass Spectrometry (MS) data. The package supports interchangeable use of different *backends* that provide additional file support or different ways to store and represent MS data. The `r Biocpkg("MsBackendSql")` package provides backends to store data from whole MS experiments in SQL databases. The data in such databases can be easily (and efficiently) accessed using `Spectra` objects that use the `MsBackendSql` class as an interface to the data in the database. Such `Spectra` objects have a minimal memory footprint and hence allow analysis of very large data sets even on computers with limited hardware capabilities. For certain operations, the performance of this data representation is superior to that of other low-memory (*on-disk*) data representations such as `Spectra`'s `MsBackendMzR` backend. Finally, the `MsBackendSql` supports also remote data access to e.g. a central database server hosting several large MS data sets. # Installation The package can be installed with the `BiocManager` package. To install `BiocManager` use `install.packages("BiocManager")` and, after that, `BiocManager::install("MsBackendSql")` to install this package. # Creating and using `MsBackendSql` SQL databases `MsBackendSql` databases can be created either by importing (raw) MS data from MS data files using the `createMsBackendSqlDatabase` or using the `backendInitialize` function by providing in addition to the database connection also the full MS data to import as a `DataFrame`. In the first example we use the `createMsBackendSqlDatabase` function which takes a connection to an (empty) database and the names of the files from which the data should be imported as input parameters creates all necessary database tables and stores the full data into the database. Below we create an empty SQLite database (in a temporary file) and fill that with MS data from two mzML files (from the `r Biocpkg("msdata")` package). ```{r, message = FALSE, results = "hide"} library(RSQLite) dbfile <- tempfile() con <- dbConnect(SQLite(), dbfile) library(MsBackendSql) fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE) createMsBackendSqlDatabase(con, fls) ``` By default the m/z and intensity values are stored as *BLOB* data types in the database. This has advantages on the performance to extract peaks data from the database but would for example not allow to filter peaks by m/z values directly in the database. As an alternative it is also possible to the individual m/z and intensity values in separate rows of the database table. This *long table format* results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases). The *MsBackendSql* package provides two backends to interact with such databases: the (default) `MsBackendSql` class and the `MsBackendOfflineSql`, that inherits all properties and functions from the former, but which does not store the connection to the database within the object but connects (and disconnects) to (and from) the database in each function call. This allows to use the latter also for parallel processing setups. To access the data in the database we create below a `Spectra` object providing the connection to the database in the constructor call and specifying to use the `MsBackendSql` as *backend* using the `source` parameter. ```{r} sps <- Spectra(con, source = MsBackendSql()) sps ``` `Spectra` objects allow also to change the backend to any other backend (extending `MsBackend`) using the `setBackend` function. Below we use this function to first load all data into memory by changing from the `MsBackendSql` to a `MsBackendMemory`. ```{r} sps_mem <- setBackend(sps, MsBackendMemory()) sps_mem ``` With this function it is also possible to change from any backend to a `MsBackendSql` in which case a new database is created and all data from the originating backend is stored in this database. We thus have to provide in addition also a connection to an (empty) database using the `dbcon` parameter. Below we create a new empty SQLite database and store all data from the `Spectra` object into this database using the `setBackend` method. ```{r, warnings = FALSE} tmpcon <- dbConnect(SQLite(), tempfile()) sps2 <- setBackend(sps_mem, MsBackendSql(), dbcon = tmpcon) sps2 ``` Similar to any other `Spectra` object we can retrieve the available *spectra variables* using the `spectraVariables` function. ```{r} spectraVariables(sps) ``` The MS peak data can be accessed using either the `mz`, `intensity` or `peaksData` functions. Below we extract the peaks matrix of the 5th spectrum and display the first 6 rows. ```{r} peaksData(sps)[[5]] |> head() ``` All data (peaks data or spectra variables) are **always** retrieved on the fly from the database resulting thus in a minimal memory footprint for the `Spectra` object. ```{r} print(object.size(sps), units = "KB") ``` The backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum. ```{r} sps$rtime <- sps$rtime + 10 ``` Such operations do however **not** change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable. ```{r} print(object.size(sps), units = "KB") ``` Such `$<-` operations can also be used to *cache* spectra variables (temporarily) in memory which can eventually improve performance. Below we test the time it takes to extract the MS level from each spectrum from the database, then cache the MS levels in memory using `$msLevel <-` and test the timing to extract these cached variable. ```{r} system.time(msLevel(sps)) sps$msLevel <- msLevel(sps) system.time(msLevel(sps)) ``` We can also use the `reset` function to *reset* the data to its original state (this will cause any local spectra variables to be deleted and the backend to be initialized with the original data in the database). ```{r} sps <- reset(sps) ``` To use the `MsBackendOfflineSql` backend we need to provide all information required to connect to the database along with the *database driver* to the `Spectra` function. Which parameters are required to connect to the database depends on the SQL database and the used driver. In our example the data is stored in a SQLite database, hence we use the `SQLite()` database driver and only need to provide the database name with the `dbname` parameter. For a MySQL/MariaDB database we would use the `MariaDB()` driver and would have to provide the database name, user name, password as well as the host name and port through which the database is accessible. ```{r} sps_off <- Spectra(SQLite(), dbname = dbfile, source = MsBackendOfflineSql()) sps_off ``` This backend provides the exact same functionality than `MsBackendSql` with the difference that the connection to the database is opened and closed for each function call. While this leads to a slightly lower performance, it allows to use the backend (and hence the `Spectra` object) also in a parallel processing setup. In contrast, for the `MsBackendSql` parallel processing is disabled since it is not possible to share the active backend connection within the object across different parallel processes. Below we compare the performance of the two backends. The performance difference is the result from opening and closing the database connection for each call. Note that this will also depend on the SQL server that is being used. For SQLite databases there is almost no overhead. ```{r} library(microbenchmark) microbenchmark(msLevel(sps), msLevel(sps_off)) ``` # Performance comparison with other backends The need to retrieve any spectra data on-the-fly from the database will have an impact on the performance of data access function of `Spectra` objects using the `MsBackendSql` backends. To evaluate its impact we next compare the performance of the `MsBackendSql` to other `Spectra` backends, specifically, the `MsBackendMzR` which is the default backend to read and represent raw MS data, and the `MsBackendMemory` backend that keeps all MS data in memory (and is thus not suggested for larger MS experiments). Similar to the `MsBackendMzR`, also the `MsBackendSql` keeps only a limited amount of data in memory. These *on-disk* backends need thus to retrieve spectra and MS peaks data on-the-fly from either the original raw data files (in the case of the `MsBackendMzR`) or from the SQL database (in the case of the `MsBackendSql`). The in-memory backend `MsBackendMemory` is supposed to provide the fastest data access since all data is kept in memory. Below we thus create `Spectra` objects from the same data but using the different backends. ```{r} sps <- Spectra(con, source = MsBackendSql()) sps_mzr <- Spectra(fls, source = MsBackendMzR()) sps_im <- setBackend(sps_mzr, backend = MsBackendMemory()) ``` At first we compare the memory footprint of the 3 backends. ```{r} print(object.size(sps), units = "KB") print(object.size(sps_mzr), units = "KB") print(object.size(sps_im), units = "KB") ``` The `MsBackendSql` has the lowest memory footprint of all 3 backends because it does not keep any data in memory. The `MsBackendMzR` keeps all spectra variables, except the MS peaks data, in memory and has thus a larger size. The `MsBackendMemory` keeps all data (including the MS peaks data) in memory and has thus the largest size in memory. Next we compare the performance to extract the MS level for each spectrum from the 4 different `Spectra` objects. ```{r} library(microbenchmark) microbenchmark(msLevel(sps), msLevel(sps_mzr), msLevel(sps_im)) ``` Extracting MS levels is thus slowest for the `MsBackendSql`, which is not surprising because both other backends keep this data in memory while the `MsBackendSql` needs to retrieve it from the database. We next compare the performance to access the full peaks data from each `Spectra` object. ```{r} microbenchmark(peaksData(sps, BPPARAM = SerialParam()), peaksData(sps_mzr, BPPARAM = SerialParam()), peaksData(sps_im, BPPARAM = SerialParam()), times = 10) ``` As expected, the `MsBackendMemory` has the fasted access to the full peaks data. The `MsBackendSql` outperforms however the `MsBackendMzR` providing faster access to the m/z and intensity values. Performance can be improved for the `MsBackendMzR` using parallel processing. Note that the `MsBackendSql` does **not support** parallel processing and thus parallel processing is (silently) disabled in functions such as `peaksData`. ```{r} m2 <- MulticoreParam(2) microbenchmark(peaksData(sps, BPPARAM = m2), peaksData(sps_mzr, BPPARAM = m2), peaksData(sps_im, BPPARAM = m2), times = 10) ``` We next compare the performance of subsetting operations. ```{r} microbenchmark(filterRt(sps, rt = c(50, 100)), filterRt(sps_mzr, rt = c(50, 100)), filterRt(sps_im, rt = c(50, 100))) ``` The two *on-disk* backends `MsBackendSql` and `MsBackendMzR` show a comparable performance for this operation. This filtering does involves access to a spectra variables (the retention time in this case) which, for the `MsBackendSql` needs first to be retrieved from the backend. The `MsBackendSql` backend allows however also to *cache* spectra variables (i.e. they are stored within the `MsBackendSql` object). Any access to such cached spectra variables can eventually be faster because no dedicated SQL query is needed. To evaluate the performance of a *pure* subsetting operation we first define the indices of 10 random spectra and subset the `Spectra` objects to these. ```{r} idx <- sample(seq_along(sps), 10) microbenchmark(sps[idx], sps_mzr[idx], sps_im[idx]) ``` Here the `MsBackendSql` outperforms the other backends because it does not keep any data in memory and hence does not need to subset these. The two other backends need to subset the data they keep in memory which is in both cases a data frame with either a reduced set of spectra variables or the full MS data. At last we compare also the extraction of the peaks data from the such subset `Spectra` objects. ```{r} sps_10 <- sps[idx] sps_mzr_10 <- sps_mzr[idx] sps_im_10 <- sps_im[idx] microbenchmark(peaksData(sps_10), peaksData(sps_mzr_10), peaksData(sps_im_10), times = 10) ``` The `MsBackendSql` outperforms the `MsBackendMzR` while, not unexpectedly, the `MsBackendMemory` provides fasted access. # Other properties of the `MsBackendSql` The `MsBackendSql` backend does not support parallel processing since the database connection can not be shared across the different (parallel) processes. Thus, all methods on `Spectra` objects that use a `MsBackendSql` will automatically (and silently) disable parallel processing even if a dedicated parallel processing setup was passed along with the `BPPARAM` method. Some functions on `Spectra` objects require to load the MS peak data (i.e., m/z and intensity values) into memory. For very large data sets (or computers with limited hardware resources) such function calls can cause out-of-memory errors. One example is the `lengths` function that determines the number of peaks per spectrum by loading the peak matrix first into memory. Such functions should ideally be called using the `peaksapply` function with parameter `chunkSize` (e.g., `peaksapply(sps, lengths, chunkSize = 5000L)`). Instead of processing the full data set, the data will be first split into chunks of size `chunkSize` that are stepwise processed. Hence, only data from `chunkSize` spectra is loaded into memory in one iteration. # Summary The `MsBackendSql` provides an MS data representations and storage mode with a minimal memory footprint (in R) that is still comparably efficient for standard processing and subsetting operations. This backend is specifically useful for very large MS data sets, that could even be hosted on remote (MySQL/MariaDB) servers. A potential use case for this backend could thus be to set up a central storage place for MS experiments with data analysts connecting remotely to this server to perform initial data exploration and filtering. After subsetting to a smaller data set of interest, users could then retrieve/download this data by changing the backend to e.g. a `MsBackendMemory`, which would result in a *download* of the full data to the user computer's memory. # Session information ```{r} sessionInfo() ```