---
title: "Using rgoslin to parse and normalize lipid nomenclature"
shorttitle: "Using rgoslin"
author: 
- name: Nils Hoffmann
  affiliation: Forschungszentrum JÃ¼lich, Institute for Bio- and Geosciences, IBG-5, Bielefeld, Germany
date: 24 March 2022
output:
    BiocStyle::html_document:
package: rgoslin
vignette: >
    %\VignetteIndexEntry{Using R Goslin to parse and normalize lipid nomenclature}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
---
# Introduction

This project is a parser, validator and normalizer implementation for shorthand lipid nomenclatures, base on the Grammar of Succinct Lipid Nomenclatures project.

[Goslin](https://github.com/lifs-tools/goslin) defines multiple grammars for different sources of shorthand lipid nomenclature. This allows to generate parsers based on the defined grammars, which provide immediate feedback whether a processed lipid shorthand notation string is compliant with a particular grammar, or not.

> **_NOTE:_**  Please report any issues you might find to help improve it!

Here, rgoslin 2.0 uses the Goslin grammars and the cppgoslin parser to support the following general tasks:

1. Facilitate the parsing of shorthand lipid names dialects.
2. Provide a structural representation of the shorthand lipid after parsing.
3. Use the structural representation to generate normalized names, following the latest shorthand nomenclature.

## Related Projects

- [Goslin grammars and reference test files](http://github.com/lifs-tools/goslin)
- [C++ implementation](https://github.com/lifs-tools/cppgoslin)
- [C# implementation](https://github.com/lifs-tools/csgoslin)
- [Java implementation](https://github.com/lifs-tools/jgoslin)
- [Python implementation](https://github.com/lifs-tools/pygoslin)
- [Webapplication and REST API](https://github.com/lifs-tools/goslin-webapp)

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r include=FALSE, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
library(rgoslin)
library(dplyr)
library(knitr)
library(kableExtra)
})
scrollBoxWidth <- "650px"
```
## Supported nomenclatures

- [LIPID MAPS](https://www.lipidmaps.org)
- [SwissLipids](https://www.swisslipids.org/)
- Updated Shorthand nomenclature [2020 Update](https://pubmed.ncbi.nlm.nih.gov/33037133/),[2013 Version](https://pubmed.ncbi.nlm.nih.gov/23549332/)
- [HMDB](https://hmdb.ca/)
- [FattyAcids](https://iupac.qmul.ac.uk/lipid/appABC.html#appA)

## Changes from Version 1

- The fragment grammar parser has been removed but may be re-added in the future. Please use our [GitHub Issues](https://github.com/lifs-tools/goslin/issues) if you need this parser.
- The column names of the R implementation have been changed to contain "." instead of " " (space). This makes handling and access easier, since quoting is no longer necessary.

# Installation

The package can be installed with the `BiocManager` package as follows:

```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("rgoslin")
```

# Example use cases

In order to use the provided translation functions of rgoslin, you first need to load the library. If rgoslin is not yet available, please follow the instructions in the previous section on "Installation".
```{r, eval = FALSE}
library(rgoslin)
```

If you want to check, which grammars are supported, use the following command:
```{r}
listAvailableGrammars()
```

To check, whether a given lipid name can be parsed by any of the parsers, you can use the `isValidLipidName` method. It will return `TRUE` if the given name can be parsed by any of the available parsers and `FALSE` if the name was not parseable.
```{r}
isValidLipidName("PC 32:1")
```

## Parsing a single lipid name

Using `parseLipidNames` with a lipid name returns a data frame of properties of the parsed lipid name as columns.
```{r}
df <- parseLipidNames("PC 32:1")
```


```{r echo = FALSE, results = 'asis'}
kable(df %>% select(-starts_with("FA"),-starts_with("LCB")), caption = "Lipid name parsing results for PC 32:1, FA and LCB columns omitted, since they are unpopulated (`NA`) on the lipid species level.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "150px")
```

If you want to set the grammar to parse against manually, this is also possible as the second argument:

```{r}
originalName <- "TG(16:1(5E)/18:0/20:2(3Z,6Z))"
tagDf <- parseLipidNames(originalName, grammar = "LipidMaps")
```

```{r echo = FALSE, results = 'asis'}
kable(tagDf %>% select(-starts_with("FA"),-starts_with("LCB")), caption = "Lipid name parsing results for TG isomeric subspecies, FA and LCB columns omitted for brevity.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "200px")
```

```{r echo = FALSE, results = 'asis'}
kable(tagDf %>% select(Normalized.Name, starts_with("FA"),starts_with("LCB")), caption = "Lipid name parsing results for TG isomeric subspecies with FA and LCB columns.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "200px")
```

## Parsing multiple lipid names

If you want to parse multiple lipid names, use the `parseLipidNames` method with a vector of lipid names. This returns a data frame of properties of the parsed lipid names with one row per lipid.

> **_NOTE:_** Omitting the grammar argument will test all available parsers, the first one to successfully parse the name  wins. This will consequently take longer than explicitly setting the grammar to select the parser for.

```{r}
multipleLipidNamesDf <- parseLipidNames(c("PC 32:1","LPC 34:1","TG(18:1_18:0_16:1)"))
```

```{r echo = FALSE, results = 'asis'}
kable(multipleLipidNamesDf %>% select(-starts_with("FA"),-starts_with("LCB")), caption = "Lipid name parsing results for PC 32:1, LPC 34:1, TG(18:1_18:0_16:1), FA and LCB columns omitted for brevity.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```
```{r echo = FALSE, results = 'asis'}
kable(multipleLipidNamesDf %>% select(Normalized.Name, starts_with("FA"), starts_with("LCB")), caption = "Lipid name parsing results for PC 32:1, LPC 34:1, TG(18:1_18:0_16:1) with FA and LCB columns.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```

Finally, if you want to parse multiple lipid names and want to use one particular grammar, simply add its name as the "grammar" argument.

```{r}
originalNames <- c("PC 32:1","LPC 34:1","TAG 18:1_18:0_16:1")
multipleLipidNamesWithGrammar <- parseLipidNames(originalNames, grammar = "Goslin")
```

```{r echo = FALSE, results = 'asis'}
kable(multipleLipidNamesWithGrammar %>% select(-starts_with("FA"),-starts_with("LCB")), caption = "Lipid name parsing results for Goslin grammar and lipids PC 32:1, LPC 34:1, TG(18:1_18:0_16:1), FA and LCB columns omitted for brevity.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```

```{r echo = FALSE, results = 'asis'}
kable(multipleLipidNamesWithGrammar %>% select(Normalized.Name, starts_with("FA"), starts_with("LCB")), caption = "Lipid name parsing results for Goslin grammar and lipids PC 32:1, LPC 34:1, TG(18:1_18:0_16:1) with FA and LCB columns.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```

## Parsing IUPAC-compliant Fatty Acid Names

LIPID MAPS has a number of fatty acids that use names following the IUPAC-IUB Fatty Acids nomenclature. These can now also be parsed and converted to the updated lipid shorthand nomenclature. We are using [LMFA01020216](https://www.lipidmaps.org/databases/lmsd/LMFA01020216) and [LMFA08040030](https://www.lipidmaps.org/databases/lmsd/LMFA08040030) as examples here:

```{r}
originalNames <- c("LMFA01020216"="5-methyl-octadecanoic acid", "LMFA08040030"="N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine")
normalizedFattyAcidsNames <- parseLipidNames(originalNames, "FattyAcids")
```

```{r echo = FALSE, results = 'asis'}
kable(normalizedFattyAcidsNames %>% select(-starts_with(c("LCB","FA2","FA3","FA4","Adduct", "Adduct.Charge"))), caption = "Lipid name parsing results for Goslin grammar and fatty acids 5-methyl-octadecanoic acid N-((+/-)-8,9-dihydroxy-5Z,11Z,14Z-eicosatrienoyl)-ethanolamine, some columns omitted for brevity.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```

## Parsing Adducts

The Goslin parser also support reading of lipid shorthand names with adducts:

```{r}
originalNames <- c("PC 32:1[M+H]1+", "PC 32:1 [M+H]+","PC 32:1")
lipidNamesWithAdduct <- parseLipidNames(originalNames, "Goslin")
```

This will populate the columns "Adduct" and "AdductCharge" with the respective values. Please note that we recommend to use the adduct and its charge in full IUPAC recommended nomenclature: 

```{r echo = FALSE, results = 'asis'}
kable(lipidNamesWithAdduct %>% select(-starts_with("FA"),-starts_with("LCB")), caption = "Lipid name parsing results for Goslin grammar and lipids PC 32:1[M+H]1+, PC 32:1 [M+H]+ and PC 32:1, FA and LCB columns omitted for brevity.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```

## Using rgoslin with lipidr

[lipidr](https://bioconductor.org/packages/release/bioc/html/lipidr.html) is a Bioconductor package with specific support for QC checking, uni- and multivariate analysis and visualization of lipidomics data acquired with Skyline or from metabolomics workbench. It uses a custom implementation for lipid name handling that does not yet support the updated shorthand nomenclature.

```{r, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("lipidr")
```

After installation of lipidr, we need to load the following libraries for this example:

```{r, eval = FALSE}
library(rgoslin)
library(lipidr)
library(stringr)
library(ggplot2)
```
```{r include=FALSE, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
library(lipidr)
library(stringr)
library(ggplot2)
})
```
We will use [lipidr's workflow example](https://bioconductor.org/packages/release/bioc/vignettes/lipidr/inst/doc/workflow.html) to illustrate how to apply rgoslin in such a use-case. We read in an example dataset exported from Skyline based on a targeted lipidomics experiment:

```{r}
datadir = system.file("extdata", package="lipidr")
filelist = list.files(datadir, "data.csv", full.names = TRUE) # all csv files
d = read_skyline(filelist)
clinical_file = system.file("extdata", "clin.csv", package="lipidr")
d = add_sample_annotation(d, clinical_file)
```

This dataset contains the lipid names in the `Molecule` column and the name preprocessed with `lipidr` in the `clean_name` column. In this excerpt, you can see `PE 34:1 NEG` in row 5, which indicates measurement of this lipid in negative mode. Please note that this is specific to this example and not a generally applied naming convention. 

```{r echo = FALSE, results = 'asis'}
kable(rowData(d[1:10, 1:10]), caption = "Subset of first ten rows of row data.") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", font_size = 7)) %>% scroll_box(width = scrollBoxWidth, height = "300px")
```
Now, let's try to parse the clean lipid names to enrich the data table.

> **_Note_**: In this case, we expect to see error messages, since some lipid names use a) unsupported head group names or b) unsupported suffixes to indicate isotopically labeled lipids.

```{r}
lipidNames <- parseLipidNames(rowData(d)$clean_name)
```

We see that lipidr's example dataset uses the `(d9)` suffix to indicate isotopically labeled lipids (d=Deuterium) that are used as internal standards for quantification. 
We need to convert `Sa1P`, sphinganine-1-phosphate, to `SPBP`, as well as `So1P`, sphingosine-1-phosphate to align with recent nomenclature updates. 

```{r}
# lipidr stores original lipid names in the Molecule column
old_names <- rowData(d)$Molecule
# split lipid prefix from potential (d7) suffix for labeled internal standards
new_names <- rowData(d)$clean_name %>% str_match(pattern="([\\w :-]+)(\\(\\w+\\))?")
# extract the first match group (the original word is at column index 1)
normalized_new_names <- new_names[,2] %>% str_replace_all(c("Sa1P"="SPBP","So1P"="SPBP")) %>% parseLipidNames(.)
```

We will receive a number of warnings in the next step, since lipidr currently checks lipid class names against a predefined, internal list that does not contain the updated class names according to the new shorthand nomenclature. The updated Molecule names will however appear in downstream visualizations.

```{r}
updated <- update_molecule_names(d, old_names, normalized_new_names$Normalized.Name)
```

We can augment the class column using rgoslin's LipidMaps main class, since row order in rgoslin's output is the same as in its input. 
Additionally, further information may be interesting to include, such as the mass and sum formula of the uncharged lipids. 
We will select the same lipid classes as in the lipidr targeted lipidomics vignette: Ceramides (Cer), Lyso-Phosphatidylcholines (LPC) and Phosphatidylcholines (PC).

```{r}
rowData(updated)$Class <- normalized_new_names$Lipid.Maps.Main.Class
rowData(updated)$Category <- normalized_new_names$Lipid.Maps.Category
rowData(updated)$Molecule <- normalized_new_names$Normalized.Name
rowData(updated)$LipidSpecies <- normalized_new_names$Species.Name
rowData(updated)$Mass <- normalized_new_names$Mass
rowData(updated)$SumFormula <- normalized_new_names$Sum.Formula
# select Ceramides, Lyso-Phosphatidylcholines and Phosphatidylcholines (includes plasmanyls and plasmenyls)
lipid_classes <- rowData(updated)$Class %in% c("Cer","LPC", "PC")
d <- updated[lipid_classes,]
```

In the next step, we use a non-exported method provided by lipidr to convert the row data into a format more suitable for plotting with ggplot. We will plot the area distribution of lipid species as boxplots, colored by lipid class and facetted by filename, similar to some plot examples in lipidr's vignette.

```{r}
ddf <- lipidr:::to_long_format(d)
ggplot(data=ddf, mapping=aes(x=Molecule, y=Area, fill=Class)) + geom_boxplot() + facet_wrap(~filename, scales = "free_y") + scale_y_log10() + coord_flip()
```

In the same manner, other columns of lipidr's `rowData` can be updated with columns from rgoslin's output data frame. Further downstream processing is then possible with lipidr's own functions.

# Getting help & support
If you find any issues with the library, would like to have other functionality included, require help or would like to contribute, please contact us via our [GitHub Project](https://github.com/lifs-tools/rgoslin) or via the [LIFS support page](https://lifs-tools.org/support.html).

# Session information {.unnumbered}

```{r sessioninfo, echo=FALSE}
sessionInfo()
```