---
title: "A transfer learning algorithm for spatial proteomics"
author:
- name: Lisa M. Breckels
  affiliation: Cambridge Centre for Proteomics, University of Cambridge, UK
- name: Laurent Gatto
  affiliation: de Duve Institute, UCLouvain, Belgium
package: pRoloc
abstract: >
  This vignette illustrates the application of a *transfer learning*
  algorithm to assign proteins to sub-cellular localisations. The
  *knntlClassification* algorithm combines *primary* experimental
  spatial proteomics data (LOPIT, PCP, etc.)  and an *auxiliary* data
  set (for example binary data based on Gene Ontology terms) to
  improve the sub-cellular assignment given an optimal combination of
  these data sources.
output:
  BiocStyle::html_document:
   toc_float: true
bibliography: pRoloc.bib
vignette: >
  %\VignetteIndexEntry{A transfer learning algorithm for spatial proteomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteKeywords{Bioinformatics, Machine learning, Organelle, Spatial Proteomics}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r env, include=FALSE, echo=FALSE, cache=FALSE}
library("knitr")
opts_chunk$set(stop_on_error = 1L)
suppressPackageStartupMessages(library("MSnbase"))
suppressWarnings(suppressPackageStartupMessages(library("pRoloc")))
suppressPackageStartupMessages(library("pRolocdata"))
suppressPackageStartupMessages(library("class"))
set.seed(1)
setStockcol(NULL)
```

# Introduction {#sec:intro}

Our main data source to study protein sub-cellular localisation are
high-throughput mass spectrometry-based experiments such as LOPIT, PCP
and similar designs (see [@Gatto2010] for an general
introduction). Recent optimised experiments result in high quality
data enabling the identification of over 6000 proteins and
discriminate numerous sub-cellular and sub-organellar niches
[@Christoforou:2016]. Supervised and semi-supervised machine learning
algorithms can be applied to assign thousands of proteins to annotated
sub-cellular niches [@Breckels2013,Gatto:2014] (see also the
*pRoloc-tutorial* vignette). These data constitute our main
source for protein localisation and are termed thereafter
*primary* data.

There are other sources of data about sub-cellular localisation of
proteins, such as the Gene Ontology [@Ashburner:2000] (in
particular the cellular compartment name space), quantitative features
derived from protein sequences (such as pseudo amino acid composition)
or the Human Protein Atlas [@Uhlen:2010] to cite a few. These
data, while not optimised to a specific system at hand and, in the
case of annotation features, whilst not as reliable as our experimental
data, constitute an invaluable, often plentiful source of *auxiliary*
information.

The aim of a *transfer learning* algorithm is to combine
different sources of data to improve overall classification. In
particular, the goal is to support/complement the primary target
domain (experimental data) with auxiliary data (annotation) features
without compromising the integrity of our primary data. In this
vignette, we describe the application of transfer learning algorithms
for the localisation of proteins from the `r Biocpkg("pRoloc")` package, as
described in

> Breckels LM, Holden S, Wonjar D, Mulvey CM, Christoforou A, Groen A,
> Trotter MW, Kohlbacker O, Lilley KS and Gatto L (2016). *Learning
> from heterogeneous data sources: an application in spatial
> proteomics*. PLoS Comput Biol 13;12(5):e1004920. doi:
> [10.1371/journal.pcbi.1004920](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004920).

Two algorithms were developed: a transfer learning algorithm based on
the $k$-nearest neighbour classifier, coined kNN-TL hereafter,
described in section \@ref(sec:knntl), and one based on the support
vector machine algorithm, termed SVM-TL, described in section
\@ref(sec:svmtl).

```{r loadpkg}
library("pRoloc")
```

# Preparing the data

To run the family of transfer learning algorithms in `pRoloc` 
two datasets are required;

(1) A primary dataset, constructed as an `MSnSet`,  
(2) an auxiliary dataset, constructed as an `MSnSet`, and

also a set of common protein markers.

Example/test datasets are readily available in the 
`r Biocexptpkg("pRolocdata")` package. We examine the structure of 
these datasets in the subsequent sections.

# Primary data {#sec:prim}

We will not go into detail on how to construct a `MSnSet` from protein
correlation profiling data. We refer users to the main *proloc-tutorial*
vignette. We load a precomputed `MSnSet` called `andy2011`. This dataset was a
proof-of-concept dataset from 2011, for the use of LOPIT with an adherent
mammalian cell culture. Human embryonic kidney fibroblast cells (HEK293T) were
used and LOPIT was employed with 8-plex iTRAQ reagents, thus returning eight
values per protein profile within a single labelling experiment. Nuclei were
discarded at an early stage in the fractionation scheme as previously described,
and membranes were not carbonate washed in order to retain peripheral membrane
and lumenal proteins for analysis.

```{r}
library("pRolocdata")
data("andy2011")
```

If we first look at the LOPIT data we see we have 1371 proteins and 8 columns
of quantitation data.

```{r}
andy2011
```

The quantitation information is stored in the `exprs` slot.

```{r}
head(exprs(andy2011))
```

The protein markers are found in the `"markers"` column of the `fData`.

```{r}
getMarkers(andy2011, fcol = "markers.tl")
```

# Auxiliary data {#sec:aux}

Auxiliary data can be generated from a number of data sources and we give a few
examples of some common sources in the proceeding sections. Whatever source of
information is used the data must be constructed as an `MSnSet` to use with the
`knntl` algorithms.

## The Gene Ontology {#sec:goaux}

Gene Ontology (GO) terms can be used as a auxiliary source of information
(providing they were not used to inform the proteins markers annotation). In
Breckels et al 2016 the auxiliary data was prepared from the primary data's
features i.e. the proteins in the dataset. All the GO terms associated to these
proteins are retrieved and used to create a binary matrix where a one (zero) at
position $(i,j)$ indicates that term $j$ has (not) been used to annotate feature
$i$.

GO terms can be retrieved from an appropriate repository, for example, using the `r
Biocpkg("biomaRt")` package. The specific Biomart repository and query will
depend on the species under study and the type of features. See, the 
[vignettes/documentation](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) 
section of the R Bioconductor `biomaRt` package.

The `r Biocexptpkg("pRolocdata")` package contains five example Gene Ontology
auxiliary datasets that were used in [@Breckels:2016]. In the below code chunk
the associated GO dataset to the `andy2011` dataset, this is called
`andy2011goCC`.

```{r loaddata}
data("andy2011goCC")
andy2011goCC
```

We see for the this GO set we have 1371 proteins and 569 columns. We pull just
the first 10 rows and 10 columns below to show the binary structure of the data.

```{r}
dim(andy2011goCC)
exprs(andy2011goCC)[1:10, 1:10]
```

We note the we have the same proteins (and in the same order) in both datasets.

```{r}
all(featureNames(andy2011) == featureNames(andy2011goCC))

head(featureNames(andy2011))
head(featureNames(andy2011goCC))
```

### A note on reproducibility {#sec:annotrepro}

The pulling of auxiliary data such as GO utilises online servers, which undergo
regular updates, does not guarantee reproducibility of feature/term association
over time. It is always recommended to save and store the data noting software
version numbers and dates. In addition to the `r Biocpkg("biomaRt")` package
other Bioconductor infrastructure is available, such as specific organism
annotations and the `r Biocannopkg("GO.db")` package to use specific versioned
(and thus traceable) annotations.

## The Human Protein Atlas {#sec:hpaaux}

The Human Protein Atlas (HPA) provides another source of auxiliary data for the
prediction of a proteins location from correlation data. The datasets
`andy2011hpa` was constructed (with HPA, version 13, released on 11/06/2014) and
we again load this directly from the `pRolocdata` package. 

```{r}
data("andy2011hpa")
andy2011
```

Similar to the structure of the GO `MSnSet`, this auxiliary data, was encoded as
a binary matrix describing the localisation of 670 proteins in 18 sub-cellular
localisations. Information for 192 of the 381 labelled marker proteins were
available.

## Protein-protein interactions {#sec:ppi}

Protein-protein interaction data can also be used as auxiliary data
input to the transfer learning algorithm. Several sources can be used
to do so directly from R:

* The `r Biocpkg("PSICQUIC")` package provides an R interfaces to the
  HUPO Proteomics Standard Initiative (HUPO-PSI) project, which
  standardises programmatic access to molecular interaction
  databases. This approach enables to query great many resources in
  one go but, as noted in the vignettes, for bulk interactions, it is
  recommended to directly download databases from individual PSICQUIC
  providers.

* The `r Biocpkg("STRINGdb")` package provides a direct interface to
  the STRING protein-protein interactions database. This package can
  be used to generate a table as the one used below. The exact
  procedure is described in the `STRINGdb` vignette and involves
  mapping the protein identifiers with the *map* and retrieve the
  interaction partners with the *get_neighbors* method.

* Finally, it is possible to use any third-party PPI inference results
  and adequately prepare these results for transfer learning. Below,
  we will described this case with PPI data in a tab-delimited format,
  as retrieved directly from the STRING database.


Below, we access the PPI spreadsheet file for our test data, that is
distributed with the `r Biocexptpkg("pRolocdata")` package.

```{r tabdelim}
ppif <- system.file("extdata/tabdelimited._gHentss2F9k.txt.gz", package = "pRolocdata")
ppidf <- read.delim(ppif, header = TRUE, stringsAsFactors = FALSE)
head(ppidf)
```

The file contains `r nrow(ppidf)` pairwise interactions and the STRING
combined interaction score. Below, we create a contingency matrix that
uses these scores to encode and weight interactions.

```{r ppiset}
uid <- unique(c(ppidf$X.node1, ppidf$node2))
ppim <- diag(length(uid))
colnames(ppim) <- rownames(ppim) <- uid

for (k in 1:nrow(ppidf)) {
    i <- ppidf[[k, "X.node1"]]
    j <- ppidf[[k, "node2"]]
    ppim[i, j] <- ppidf[[k, "combined_score"]]
}

ppim[1:5, 1:8]
```

We now have a contingency matrix reflecting a total of
`r sum(ppim != 0)` interactions between `r nrow(ppim)`
proteins. Below, we only keep proteins that are also available in our
spatial proteomics data (renamed to `andyppi`), subset the PPI and
LOPIT data, create the appropriate `MSnSet` object, and filter out
proteins without any interaction scores.

```{r ppiset2}
andyppi <- andy2011
featureNames(andyppi) <- sub("_HUMAN", "", fData(andyppi)$UniProtKB.entry.name)
cmn <- intersect(featureNames(andyppi), rownames(ppim))
ppim <- ppim[cmn, ]
andyppi <- andyppi[cmn, ]

ppi <- MSnSet(ppim, fData = fData(andyppi),
              pData = data.frame(row.names = colnames(ppim)))
ppi <- filterZeroCols(ppi)
```

We now have two `MSnSet` objects containing respectively
`r nrow(andyppi)` primary experimental protein profiles along a
sub-cellular density gradient (`andyppi`) and `r nrow(ppi)` auxiliary
interaction profiles (`ppi`).

```{r}
andyppi
```

# Support vector machine transfer learning {#sec:svmtl}

The SVM-TL method descibed in [@Breckels:2016] has not yet been
incorporated in the `r Biocpkg("pRoloc")` package. The code
implementing the method is currently available in its own repository:

https://github.com/ComputationalProteomicsUnit/lpsvm-tl-code

# Nearest neighbour transfer learning {#sec:knntl}

## Optimal weights {#sec:theopt}

```{r mclasses, echo=FALSE}
data(andy2011) ## load clean LOPIT data

## marker classes for andy2011
m <- unique(fData(andy2011)$markers.tl)
m <- m[m != "unknown"]
```

The weighted nearest neighbours transfer learning algorithm estimates
optimal weights for the different data sources and the spatial niches
described for the data at hand with the *knntlOptimisation*
function. For instance, for the human data modelled by the `andy2011`
and `andygoset`
objects^[We will use the sub-cellular markers defined in the `markers.tl` feature variable, instead of the default `markers`.]
and the `r length(m)` annotated sub-cellular localisations
(`r paste(m[-1], collapse = ", ")` and `r m[1]`), we want to know how
to optimally combine primary and auxiliary data. If we look at figure
\@ref(fig:andypca), that illustrates the experimental separation of
the `r length(m)` spatial classes on a principal component plot, we
see that some organelles such as the mitochondrion or the cytosol and
cytosol/nucleus are well resolved, while others such as the Golgi or
the ER are less so. In this experiment, the former classes are not
expected to benefit from another data source, while the latter should
benefit from additional information.


```{r andypca, fig.width=6, fig.height=6, echo=FALSE, fig.cap = "PCA plot of `andy2011`. The multivariate protein profiles are summarised along the two first principal components. Proteins of unknown localisation are represented by empty grey points. Protein markers, which are well-known residents of specific sub-cellular niches are colour-coded and form clusters on the figure."}
setStockcol(paste0(getStockcol(), "80"))
plot2D(andy2011, fcol = "markers.tl")
setStockcol(NULL)
addLegend(andy2011, fcol = "markers.tl",
          where = "topright", bty = "n", cex = .7)
```

Let's define a set of three possible weights: 0, 0.5 and 1. A weight
of 1 indicates that the final results rely exclusively on the
experimental data and ignore completely the auxiliary data. A weight
of 0 represents the opposite situation, where the primary data is
ignored and only the auxiliary data is considered. A weight of 0.5
indicates that each data source will contribute equally to the final
results. It is the algorithm's optimisation step task to identify the
optimal combination of class-specific weights for a given primary and
auxiliary data pair. The optimisation process can be quite time
consuming for many weights and many sub-cellular classes, as all
combinations (there are $number~of~classes^{number~of~weights}$
possibilities; see below). One would generally defined more weights
(for example `r seq(0, 1, by = 0.25)` or `r round(seq(0, 1, length.out = 4), 2)`)
to explore more fine-grained integration opportunities. The possible
weight combinations can be calculated with the *thetas* function:

* 3 classes, 3 weights

```{r thetas0, echo=TRUE}
head(thetas(3, by = 0.5))
dim(thetas(3, by = 0.5))
```

* 5 classes, 4 weights

```{r thetas1, echo=TRUE}
dim(thetas(5, length.out = 4))
```

* for the human `andy2011` data, considering 4 weights, there are very
  many combinations:

```{r thetaandy}
## marker classes for andy2011
m <- unique(fData(andy2011)$markers.tl)
m <- m[m != "unknown"]
th <- thetas(length(m), length.out=4)
dim(th)
```


The actual combination of weights to be tested can be defined in
multiple ways: by passing a weights matrix explicitly (as those
generated with *thetas* above) through the `th` argument; or by
defining the increment between weights using `by`; or by specifying
the number of weights to be used through the `length.out` argument.

Considering the sub-cellular resolution for this experiment, we would
anticipate that the mitochondrion, the cytosol and the cytosol/nucleus
classes would get high weights, while the ER and Golgi would be
assigned lower weights.

As we use a nearest neighbour classifier, we also need to know how
many neighbours to consider when classifying a protein of unknown
localisation. The *knnOptimisation* function (see the
*pRoloc-tutorial* vignette and the functions manual page) can be run
on the primary and auxiliary data sources independently to estimate
the best $k_P$ and $k_A$ values. Here, based on *knnOptimisation*, we
use 3 and 3, for $k_P$ and $k_A$ respectively.

Finally, to assess the validity of the weight selection, it should be
repeated a certain number of times (default value is 50). As the
weight optimisation can become very time consuming for a wide range of
weights and many target classes, we would recommend to start with a
lower number of iterations, pre-analyse the results, proceed with
further iterations and eventually combine the optimisation results
data with the *combineThetaRegRes* function before
proceeding with the selection of best weights.

```{r thetaopt0, eval=FALSE}
topt <- knntlOptimisation(andy2011, andy2011goCC,
                          th = th,
                          k = c(3, 3),
                          fcol = "markers.tl",
                          times = 50)
```

The above code chunk would take too much time to be executed in the
frame of this vignette. Below, we pass a very small subset of theta
matrix to minimise the computation time. The *knntlOptimisation*
function supports parallelised execution using various backends thanks
to the `r Biocpkg("BiocParallel")` package; an appropriate backend
will be defined automatically according to the underlying architecture
and user-defined backends can be defined through the `BPPARAM`
argument^[Large scale applications of this algorithms were run on a cluster using an MPI backend defined with `SnowParams(256, type="MPI")`.].
Also, in the interest of time, the weights optimisation is repeated
only 5 times below.

```{r thetaopt, eval=TRUE}
set.seed(1)
i <- sample(nrow(th), 12)
topt <- knntlOptimisation(andy2011, andy2011goCC,
                          th = th[i, ],
                          k = c(3, 3),
                          fcol = "markers.tl",
                          times = 5)
topt
```

The optimisation is performed on the labelled marker examples
only. When removing unlabelled non-marker proteins (the `unknowns`),
some auxiliary GO columns end up containing only 0 (the GO-protein
association was only observed in non-marker proteins), which are
temporarily removed.

The `topt` result stores all the result from the optimisation step,
and in particular the observed theta weights, which can be directly
plotted as shown on the [bubble plot](#fig:bubble) below. These bubble
plots give the proportion of best weights for each marker class that
was observed during the optimisation phase. We see that the
mitochondrion, the cytosol and cytosol/nucleus classes predominantly
are scored with height weights (2/3 and 1), consistent with high
reliability of the primary data. The Golgi and the ribosomal clusters
(and to a lesser extend the ER) favour smaller scores, indicating a
substantial benefit of the auxiliary data.

![Results obtained from an extensive optimisation on the primary `andy2011` and auxiliary `andygoset` data sets, as produced by `plot(topt)`. This figure is not the result for the previous code chunk, where only a random subset of 10 candidate weights have been tested.](./Figures/bubble-andy.png){#fig:bubble}

## Choosing weights {#sec:choosep}

A set of best weights must be chosen and applied to the classification
of the unlabelled proteins (formally annotated as `unknown`). These
can be defined manually, based on the pattern observed in the weights
[bubble plot](#fig:bubble), or automatically extracted with the
*getParams*
method^[Note that the scores extracted here are based on the random subsest of weights.]. See
*?getParams* for details and the *favourPrimary* function, if it is
desirable to systematically favour the primary data (i.e. high
weights) when different weight combinations perform equally well.


```{r getParam}
getParams(topt)
```

We provide the best parameters for the extensive parameter
optimisation search, as provided by *getParams*:

```{r besttheta}
(bw <- experimentData(andy2011)@other$knntl$thetas)
```

## Applying best *theta* weights {#sec:thclass}

To apply our best weights and learn from the auxiliary data
accordingly when classifying the unlabelled proteins to one of the
sub-cellular niches considered in `markers.tl` (as displayed on figure
\@ref(fig:andypca)), we pass the primary and auxiliary data sets, best
weights, best k's (and, on our case the marker's feature variable we
want to use, default would be `markers`) to the *knntlClassification*
function.

```{r tlclass}
andy2011 <- knntlClassification(andy2011, andy2011goCC,
                                bestTheta = bw,
                                k = c(3, 3),
                                fcol = "markers.tl")
```

This will generate a new instance of class *MSnSet*, identical to the
primary data, including the classification results and classifications
scores of the transfer learning classification algorithm (as `knntl`
and `knntl.scores` feature variables respectively). Below, we extract
the former with the *getPrediction* function and plot the results of
the classification.

```{r tlpreds}
andy2011 <- getPredictions(andy2011, fcol = "knntl")
```

```{r andypca2, fig.width=6, fig.height=6, fig.cap = "PCA plot of `andy2011` after transfer learning classification. The size of the points is proportional to the classification scores."}
setStockcol(paste0(getStockcol(), "80"))
ptsze <- exp(fData(andy2011)$knntl.scores) - 1
plot2D(andy2011, fcol = "knntl", cex = ptsze)
setStockcol(NULL)
addLegend(andy2011, where = "topright",
          fcol = "markers.tl",
          bty = "n", cex = .7)
```


Please read the *pRoloc-tutorial* vignette, and in particular the
classification section, for more details on how to proceed with
exploration the classification results and classification scores.

# Conclusions {#sec:ccl}

This vignette describes the application of a weighted $k$-nearest
neighbour transfer learning algorithm and its application to the
sub-cellular localisation prediction of proteins using quantitative
proteomics data as primary data and Gene Ontology-derived binary data
as auxiliary data source. The algorithm can be used with various data
sources (we show how to compile binary data from the Human Protein
Atlas in section \@ref(sec:hpaaux)) and have successfully applied the
algorithm [@Breckels:2016] on third-party quantitative auxiliary data.

# Session information {-}

All software and respective versions used to produce this document are
listed below.

```{r sessioninfo, echo=FALSE}
sessionInfo()
```

# References {-}