---
title: "MSstatsPTM LabelFree Workflow"
author: "Devon Kohler (<kohler.d@northeastern.edu>)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MSstatsPTM LabelFree Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=8, 
  fig.height=8
)
```

This vignette provides an example workflow for how to use the package 
`MSstatsPTM` for a mass spectrometry based proteomics experiment acquired with 
a labelfree acquisition methods, such as DDA/DIA/SRM/PRM. It also provides 
examples and an analysis of how adjusting for global protein levels allows for 
better interpretations of PTM modeling results.

## Installation

To install this package, start R (version "4.0") and enter:

``` {r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstatsPTM")
```

```{r, message=FALSE, warning=FALSE}
library(MSstatsPTM)
library(MSstats)
```

## 1. Workflow

### 1.1 Raw Data Format

**Note: We are actively developing dedicated converters for `MSstatsPTM`. If you 
have data from a processing tool that does not have a dedicated converter in 
MSstatsPTM please add a github issue 
`https://github.com/Vitek-Lab/MSstatsPTM/issues` and we will add the 
converter.**

The first step is to load in the raw dataset for both the PTM and Protein 
datasets. Each dataset can formatted using dedicated converters in `MSstatsPTM`,
such as `FragePipetoMSstatsPTMFormat`. If there is not a dedicated converter 
for your tool in `MSstatsPTM`, you can alternatively leverage converters from 
base `MSstats`. If using converters from `MSstats` note they will need to be 
run both on the global protein and PTM datasets.

You might notice a FASTA file is also needed for some converters. This
FASTA file can be obtained by querying
[Uniprot](https://www.uniprot.org/id-mapping) with all of the protein
IDs present in your PTM dataset. The FASTA file is a necessary input
because some tools (e.g. MaxQuant) do not report the specific amino acid
that is modified relative to the whole *protein* sequence. Rather, they
report the specific amino acid relative to the reported *peptide*. This
distinction is important because modifications like phosphorylation,
methylation, or acetylation often have specific roles depending on where
they occur within the full-length protein. With the help of a FASTA
file, MSstatsPTM can determine the specific amino acid that is modified
in the context of the whole protein sequence.

Please note for the PTM dataset, both the protein and modification site (or 
peptide), must be added into the `ProteinName` column. This allows for the 
package to summarize to the peptide level, and avoid the off chance there are 
matching peptides between proteins. For an example of how this can be done 
please see the code below.

The output of the converter is a list with two formatted data.tables. One each 
for the PTM and Protein datasets. If a global profiling run was not performed, 
the Protein data.table will just be `NULL`

#### 1.1.1 MaxQuant - `MaxQtoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for MaxQuant output. Experiments can
be acquired with label-free or TMT labeling methods. No matter the experiment, 
we recommend using the `evidence.txt` file, and not a PTM specific file such as 
the `Phospho (STY).txt` file. The `MaxQtoMSstatsPTMFormat` converter includes a
variety of parameters. Examples for processing a TMT and LF dataset can be seen 
below.

``` {r maxq, eval=TRUE}
# TMT experiment
head(maxq_tmt_evidence)
head(maxq_tmt_annotation)
 
msstats_format_tmt = MaxQtoMSstatsPTMFormat(evidence=maxq_tmt_evidence,
                                    annotation=maxq_tmt_annotation,
                                    fasta=system.file("extdata", 
                                                      "maxq_tmt_fasta.fasta", 
                                                      package="MSstatsPTM"),
                                    fasta_protein_name="uniprot_ac",
                                    mod_id="\\(Phospho \\(STY\\)\\)",
                                    use_unmod_peptides=TRUE,
                                    labeling_type = "TMT",
                                    which_proteinid_ptm = "Proteins")

head(msstats_format_tmt$PTM)
head(msstats_format_tmt$PROTEIN)

# LF experiment
head(maxq_lf_evidence)
head(maxq_lf_annotation)

msstats_format_lf = MaxQtoMSstatsPTMFormat(evidence=maxq_lf_evidence,
                                     annotation=maxq_lf_annotation,
                                     fasta=system.file("extdata", 
                                                       "maxq_lf_fasta.fasta", 
                                                       package="MSstatsPTM"),
                                     fasta_protein_name="uniprot_ac",
                                     mod_id="\\(Phospho \\(STY\\)\\)",
                                     use_unmod_peptides=TRUE,
                                     labeling_type = "LF",
                                     which_proteinid_ptm = "Proteins")

head(msstats_format_lf$PTM)
head(msstats_format_lf$PROTEIN)

```

#### 1.1.2 FragPipe - `FragPipetoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for FragPipe output. Experiments 
must be acquired with TMT labeling methods (a label-free converter is currently 
in development).The input is the `msstats.csv` and annotation files 
automatically generated by FragPipe. FragPipe provides additional info on the 
localization of modification sites, and the `FragPipetoMSstatsPTMFormat` 
converter includes localization options that are not present in other 
converters.

``` {r fragpipe, eval=TRUE}
head(fragpipe_input)
head(fragpipe_annotation)
head(fragpipe_input_protein)
head(fragpipe_annotation_protein)

msstats_data = FragPipetoMSstatsPTMFormat(fragpipe_input,
                                          fragpipe_annotation,
                                          fragpipe_input_protein, 
                                          fragpipe_annotation_protein,
                                          mod_id_col = "STY",
                                          localization_cutoff=.75,
                                          remove_unlocalized_peptides=TRUE)
head(msstats_data$PTM)
head(msstats_data$PROTEIN)
```





#### 1.1.3 Proteome Discoverer - `PDtoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for Proteome Discoverer output. 
Experiments can be acquired with label-free or TMT labeling methods. The input 
is the `psm` file and a custom built annotation file. Proteome Discoverer 
provides additional info on the localization of modification sites, and the 
`PDtoMSstatsPTMFormat` converter includes localization options that are not 
present in other converters.

``` {r pd, eval=TRUE}
head(pd_psm_input)
head(pd_annotation)

msstats_format = PDtoMSstatsPTMFormat(pd_psm_input, 
                                 pd_annotation, 
                                 system.file("extdata", "pd_fasta.fasta", 
                                             package="MSstatsPTM"),
                                 use_unmod_peptides=TRUE, 
                                 which_proteinid = "Master.Protein.Accessions")

head(msstats_format$PTM)
head(msstats_format$PROTEIN)
```


#### 1.1.4 Spectronaut - `SpectronauttoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for Spectronaut output. 
Experiments can be acquired with label-free labeling methods.

``` {r Spectronaut, eval=TRUE}
head(spectronaut_input)
head(spectronaut_annotation)

msstats_input = SpectronauttoMSstatsPTMFormat(spectronaut_input, 
                  annotation=spectronaut_annotation, 
                  fasta_path=system.file("extdata", "spectronaut_fasta.fasta", 
                                         package="MSstatsPTM"),
                  use_unmod_peptides=TRUE,
                  mod_id = "\\[Phospho \\(STY\\)\\]",
                  fasta_protein_name = "uniprot_iso"
                  )
 
head(msstats_input$PTM)
head(msstats_input$PROTEIN)

```


#### 1.1.5 Skyline - `SkylinetoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for Skyline output. 
Experiments can be acquired with label-free labeling methods.


#### 1.1.6 Peak Studio - `PStoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for Peak Studio output. 
Experiments can be acquired with label-free labeling methods.

#### 1.1.6 Progenesis - `ProgenesistoMSstatsPTMFormat`

`MSstatsPTM` includes a dedicated converter for Progenesis output. 
Experiments can be acquired with label-free labeling methods.

#### 1.1.7 Additional tools

If there is not a dedicated `MSstatsPTM` converter for a processing tool, 
the existing converters in `MSstats` and `MSstatsTMT` converters can be used as 
described below. 

Note, in order to do this, it is critical that the ProteinName column be a 
combination of the Protein Name and modification site.

As an example, if you would like to analyze an experiment processed with DIANN, 
you can leverage the `DIANNtoMSstatsFormat` in `MSstats`. Given two datasets, 
named `raw_ptm_df` and `raw_protein_df`, and an annotation file, we can process 
the data as follows.

```{r raw_data, eval = FALSE}
# Add site into ProteinName column
raw_ptm_df$ProteinName = paste(raw_ptm_df$ProteinName,
                                raw_ptm_df$Site, sep = "_")

# Run MSstats Converters
PTM_data = MSstats::DIANNtoMSstatsFormat(raw_ptm_df, annotation)
PROTEIN_data = MSstats::DIANNtoMSstatsFormat(raw_protein_df, annotation)

# Combine into one list
msstatsptm_input_data = list(PTM = PTM_data, PROTEIN = PROTEIN_data)
```

The variable `msstatsptm_input_data` can now be used as the input to the 
remainder of the `MSstatsPTM` processing pipeline.

### 1.2 Summarization - `dataSummarizationPTM`

After loading in the input data, the next step is to use the 
`dataSummarizationPTM` function. This provides the summarized dataset needed to 
model the protein/PTM abundance. The function will summarize the 
Protein dataset up to the protein level and will summarize the PTM dataset up to
the peptide level. There are multiple options for normalization and missing 
value imputation. These options should be reviewed in the package documentation.

```{r summarize, message=FALSE, warning=FALSE}

MSstatsPTM.summary = dataSummarizationPTM(raw.input, verbose = FALSE, 
                                          use_log_file = FALSE, append = FALSE)

head(MSstatsPTM.summary$PTM$ProteinLevelData)
head(MSstatsPTM.summary$PROTEIN$ProteinLevelData)
```

The summarize function returns a list with PTM and Protein summarization 
information. Each PTM and Protein include a list of data.tables: `FeatureLevelData`
is a data.table of reformatted input of dataSummarizationPTM, `ProteinLevelData` is 
the run level summarization data.

### 1.2.1 QCPlot

Once summarized, MSstatsPTM provides multiple plots to analyze the experiment. 
Here we show the quality control boxplot. The first plot shows the modified data
and the second plot shows the global protein dataset.

```{r qcplot, message=FALSE, warning=FALSE}
dataProcessPlotsPTM(MSstatsPTM.summary,
                    type = 'QCPLOT',
                    which.PTM = "allonly",
                    address = FALSE)
```

### 1.2.2 Profile Plot

Here we show a profile plot. Again the top plot shows the modified peptide, and 
the bottom shows the overall protein.

```{r profileplot, message=FALSE, warning=FALSE}

dataProcessPlotsPTM(MSstatsPTM.summary,
                    type = 'ProfilePlot',
                    which.Protein = "Q9Y6C9",
                    address = FALSE)
```

### 1.3 Modeling - `groupComparisonPTM`

After summarization, the summarized datasets can be modeled using the 
`groupComparisonPTM` function. This function will model the PTM and Protein 
summarized datasets, and then adjust the PTM model for changes in overall 
protein abundance. The output of the function is a list containing these three 
models named: `PTM.Model`, `PROTEIN.Model`, `ADJUSTED.Model`.

```{r model, message=FALSE, warning=FALSE}

# Specify contrast matrix
comparison = matrix(c(-1,0,1,0),nrow=1)
row.names(comparison) = "CCCP-Ctrl"
colnames(comparison) = c("CCCP", "Combo", "Ctrl", "USP30_OE")

MSstatsPTM.model = groupComparisonPTM(MSstatsPTM.summary, 
                                      data.type = "LabelFree",
                                      contrast.matrix = comparison,
                                      use_log_file = FALSE, append = FALSE,
                                      verbose = FALSE)
head(MSstatsPTM.model$PTM.Model)
head(MSstatsPTM.model$PROTEIN.Model)
head(MSstatsPTM.model$ADJUSTED.Model)
```

### 1.3.1 Volcano Plot

The models from the `groupComparisonPTM` function can be used in the model 
visualization function, `groupComparisonPlotsPTM`. Here we show Volcano Plots 
for the models.

``` {r volcano, message=FALSE, warning=FALSE}
groupComparisonPlotsPTM(data = MSstatsPTM.model,
                        type = "VolcanoPlot",
                        FCcutoff= 2,
                        logBase.pvalue = 2,
                        address=FALSE)
```


### 1.3.2 Heatmap Plot

Here we show a Heatmap for the models.

``` {r heatmap, message=FALSE, warning=FALSE}
groupComparisonPlotsPTM(data = MSstatsPTM.model,
                        type = "Heatmap",
                        which.PTM = 1:30,
                        address=FALSE)
```

### 1.4 Sample Size Calculation - `designSampleSizePTM`

Finally, sample size calculation can be performed using the output of the model 
and the `designSampleSizePTM`

```{r sample_size, message=FALSE, warning=FALSE}

# Specify contrast matrix
sample_size = designSampleSizePTM(MSstatsPTM.model, c(2.0, 2.75), FDR = 0.05, 
                                  numSample = TRUE, power = 0.8)

head(sample_size)
```

### 1.4.1 Sample Size Plot

The output of the sample size function can be plotted using the `MSstats` 
`designSampleSizePlots` function.

```{r sample_size_plot, message=FALSE, warning=FALSE}

MSstats::designSampleSizePlots(sample_size)

```