---
title: "MSnbase IO capabilities"
author: 
- name: Laurent Gatto
  affiliation: Computational Proteomics Unit, Cambridge, UK.
package: MSnbase
abstract: >
  This vignette describes *MSnbase*'s input and output capabilities.
bibliography: MSnbase.bib
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{MSnbase IO capabilities}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Mass Spectrometry, Proteomics, Infrastructure }
  %\VignetteEncoding{UTF-8}
---

```{r env, echo=FALSE}
suppressPackageStartupMessages(library("BiocStyle"))
suppressPackageStartupMessages(library("MSnbase"))
suppressPackageStartupMessages(library("pRolocdata"))
```

```{r include_forword, echo=FALSE, results="asis"}
cat(readLines("./Foreword.md"), sep = "\n")
```

```{r include_bugs, echo=FALSE, results="asis"}
cat(readLines("./Bugs.md"), sep = "\n")
```


# Overview

`r Biocpkg("MSnbase")`'s aims are to facilitate the reproducible
analysis of mass spectrometry data within the R environment, from raw
data import and processing, feature quantification, quantification and
statistical analysis of the results [@Gatto2012].  Data import
functions for several formats are provided and intermediate or final
results can also be saved or exported.  These capabilities are
presented below.

# Data input

#### Raw data {-}

Data stored in one of the published `XML`-based formats. i.e. `mzXML`
[@Pedrioli2004], `mzData` [@Orchard2007] or `mzML` [@Martens2010], can
be imported with the `readMSData` method, which makes use of the 
`r Biocpkg("mzR")` package to create `MSnExp` objects.  The files can be
in profile or centroided mode.  See `?readMSData` for details.
 
#### Peak lists {-}

Peak lists in the `mgf`
format^[http://www.matrixscience.com/help/data_file_help.html]
can be imported using the `readMgfData`.  In this case, the peak data
has generally been pre-processed by other software.  See
`?readMgfData` for details.

#### Quantitation data {-}

Third party software can be used to generate quantitative data and
exported as a spreadsheet (generally comma or tab separated format).
This data as well as any additional meta-data can be imported with the
`readMSnSet` function. See `?readMSnSet` for details.

`r Biocpkg("MSnbase")` also supports the `mzTab`
format^[https://github.com/HUPO-PSI/mzTab], a light-weight,
tab-delimited file format for proteomics data developed within the
Proteomics Standards Initiative (PSI).  `mzTab` files can be read into
R with `readMzTabData` to create and `MSnSet` instance.

![*MSnbase* input capabilities.  The white and red boxes represent R functions/methods and objects respectively.  The blue boxes represent different disk storage formats.](./Figures/MSnbase-io-in.png)

# Data output

#### RData files {-}

R objects can most easily be stored on disk with the `save` function.
It creates compressed binary images of the data representation that
can later be read back from the file with the `load` function.

#### mzML/mzXML files {-}

`MSnExp` and `OnDiskMSnExp` files can be written to MS data files in `mzML` or
`mzXML` files with the `writeMSData` method. See `?writeMSData` for details.

#### Peak lists {-}

`MSnExp` instances as well as individual spectra can be written as
`mgf` files with the `writeMgfData` method. Note that the meta-data in
the original R object can not be included in the file. See
`?writeMgfData` for details.

#### Quantitation data {-}

Quantitation data can be exported to spreadsheet files with the
`write.exprs` method. Feature meta-data can be appended to the feature
intensity values. See `?writeMgfData` for details.

**Deprecated** `MSnSet` instances can also be exported to `mzTab`
files using the `writeMzTabData` function.

![*MSnbase* output capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.](./Figures/MSnbase-io-out.png)
	

# Creating `MSnSet` from text spread sheets

This section describes the generation of `MSnSet` objects using data
available in a text-based spreadsheet. This entry point into R and
`r Biocpkg("MSnbase")` allows to import data processed by any of the
third party mass-spectrometry processing software available and
proceed with data exploration, normalisation and statistical analysis
using functions available in \R and the numerous Bioconductor
packages.

## A complete work flow

The following section describes a work flow that uses three input
files to create the `MSnSet`. These files respectively describe the
quantitative expression data, the sample meta-data and the feature
meta-data.  It is taken from the `r Biocpkg("pRoloc")` tutorial and
uses example files from the `r Biocpkg("pRolocdat")` package.

We start by describing the `csv` to be used as input using the
`read.csv` function.

```{r readCsvData0}
## The original data for replicate 1, available
## from the pRolocdata package
f0 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, 
          pattern = "pr800866n_si_004-rep1.csv")
csv <- read.csv(f0)
```

The three first lines of the original spreadsheet, containing the data
for replicate one, are illustrated below (using the function
`head`). It contains `r nrow(csv)` rows (proteins) and `r ncol(csv)`
columns, including protein identifiers, database accession numbers,
gene symbols, reporter ion quantitation values, information related to
protein identification, ...

```{r showOrgCsv}
head(csv, n=3)
```

Below read in turn the spread sheets that contain the quantitation
data (`exprsFile.csv`), feature meta-data (`fdataFile.csv`) and sample
meta-data (`pdataFile.csv`).

```{r readCsvData1}
## The quantitation data, from the original data
f1 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "exprsFile.csv")
exprsCsv <- read.csv(f1)
## Feature meta-data, from the original data
f2 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "fdataFile.csv")
fdataCsv <- read.csv(f2)
## Sample meta-data, a new file
f3 <- dir(system.file("extdata", package = "pRolocdata"), 
          full.names = TRUE, pattern = "pdataFile.csv")
pdataCsv <- read.csv(f3)
```



`exprsFile.csv` contains the quantitation (expression) data for the
`r nrow(exprsCsv)` proteins and 4 reporter tags.
  
```{r showExprsFile}
head(exprsCsv, n = 3)
```

`fdataFile.csv` contains meta-data for the `r nrow(fdataCsv)`
features (here proteins).

```{r showFdFile}
head(fdataCsv, n = 3)
```


`pdataFile.csv` contains samples (here fractions) meta-data. This
simple file has been created manually.
  
  
```{r showPdFile}
pdataCsv
```


The self-contained `MSnSet` can now easily be generated using the
`readMSnSet` constructor, providing the respective `csv` file names
shown above and specifying that the data is comma-separated (with `sep
= ","`). Below, we call that object `res` and display its content.

```{r makeMSnSet}
library("MSnbase")
res <- readMSnSet(exprsFile = f1,
                  featureDataFile = f2,
                  phenoDataFile = f3,
                  sep = ",")
res
```

### The `MSnSet` class

Although there are additional specific sub-containers for additional
meta-data (for instance to make the object MIAPE compliant), the
feature (the sub-container, or slot `featureData`) and sample (the
`phenoData` slot) are the most important ones. They need to meet the
following validity requirements (see figure below):

- the number of row in the expression/quantitation data and feature
  data must be equal and the row names must match exactly, and
  
- the number of columns in the expression/quantitation data and number
  of row in the sample meta-data must be equal and the column/row
  names must match exactly.

A detailed description of the `MSnSet` class is available by typing
`?MSnSet` in the R console.


![Dimension requirements for the respective expression, feature and sample meta-data slots.](./Figures/msnset.png)


The individual parts of this data object can be accessed with their respective accessor methods: 

- the quantitation data can be retrieved with `exprs(res)`,
- the feature meta-data with `fData(res)` and 
- the sample meta-data with `pData(res)`. 


## A shorter work flow

The `readMSnSet2` function provides a simplified import workforce.  It
takes a single spreadsheet as input (default is `csv`) and extract the
columns identified by `ecol` to create the expression data, while the
others are used as feature meta-data. `ecol` can be a `character` with
the respective column labels or a numeric with their indices. In the
former case, it is important to make sure that the names match
exactly. Special characters like `'-'` or `'('` will be transformed by
R into `'.'` when the `csv` file is read in.  Optionally, one can also
specify a column to be used as feature names.  Note that these must be
unique to guarantee the final object validity.

```{r readMSnSet2}
ecol <- paste("area", 114:117, sep = ".")
fname <- "Protein.ID"
eset <- readMSnSet2(f0, ecol, fname)
eset
```
 

The `ecol` columns can also be queried interactively from R using the
`getEcols` and `grepEcols` function. The former return a character
with all column names, given a splitting character, i.e. the
separation value of the spreadsheet (typically `","` for `csv`, `"\t"`
for `tsv`, ...). The latter can be used to grep a pattern of interest
to obtain the relevant column indices.

```{r ecols}
getEcols(f0, ",")
grepEcols(f0, "area", ",")
e <- grepEcols(f0, "area", ",")
readMSnSet2(f0, e)
```

The `phenoData` slot can now be updated accordingly using the
replacement functions `phenoData<-` or `pData<-` (see `?MSnSet` for
details).


# Session information

```{r}
sessionInfo()
```

# References {-}