---
title: "SubCellBarCode: Integrated workflow for robust classification and
visualization of spatial proteome"
author:
- name: Taner Arslan
affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Lukas Orre
affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Mattias Vesterlund
affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Yanbo Pan
affiliation: "Karolinska Institute, Stockholm, Sweden"
- name: Janne Lehtiö
affiliation: "Karolinska Institute, Stockholm, Sweden"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc_fload: true
package: SubCellBarCode
abstract: In the SubCellBarCode project, mass spectrometry (MS)
based proteomics was used for proteome-wide mapping of protein
subcellular localization (Orre et al. 2019, Molecular Cell).
Subcellular fractionation and MS-based quantification can be
reproduced following the methods described in Orre et al.
Here, the SubCellBarCode R package offers tools to reproduce
the underlying bioinformatics pipeline for robust classification
and visualization of proteins into corresponding subcellular
localizations. Moreover, the R-package can be used for differential
localization analysis to investigate e.g. treatment induced protein
relocalization, condition dependent localization or cell type
specific localization. Last but not least, it provides peptide/
exon/transcript centric classification as well as PTM (Post
translation modifications) modified peptide classification.
vignette: >
%\VignetteIndexEntry{SubCellBarCode R Markdown vignettes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Installation of the package
`SubCellBarCode` can be installed through `BiocManager` package as follows:
``` {r install,eval = FALSE}
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("SubCellBarCode")
```
# Load the package
```{r Loadpackage}
library(SubCellBarCode)
```
# Data preparation and classification
## Example Data
As example data we here provide the publicly available HCC827 (human lung
adenocarcinoma cell line) TMT10plex labelled proteomics dataset
(*Orre et al.* 2019, Molecular Cell). The `data.frame` consists of
10480 proteins as rows (rownames are gene -centric protein ids)
and 5 fractions with duplicates as columns
(replicates must be named ".A." and ".B.", repectively).
```{r exampleData}
head(hcc827Ctrl)
```
## Marker Proteins
The classification of protein localisation using the SubCellBarCode
method is dependent on 3365 marker proteins as defined in Orre et al.
The markerProteins `data.frame` contain protein names (gene symbol),
associated subcellular localization (compartment), color code for
the compartment and the median normalized fractionation
profile (log2) based on five different human cell lines
(NCI-H322, HCC827, MCF7, A431and U251) here called the
“5CL marker profileâ€.
```{r markerdata}
head(markerProteins)
```
## Load and normalize data
Input `data.frame` is checked with *"NA"* values and for the correct
format. If there is any *"NA"* value, corresponding row is deleted.
Then, data frame is `log2` transformmed.
```{r loadData}
df <- loadData(protein.data = hcc827Ctrl)
```
```{r printDimData}
cat(dim(df))
head(df)
```
**Additional step:**
We use gene symbols for the protein identification. Therefore, we require
gene symbols for the identifiaction. However, if the input data has
other identifier e.g. UNIPROT, IPI, Entrez ID, you can convert it to gene
symbol by our defined function.
*Please be aware of possible (most likely few) id loss during the
conversion to one another.*
```{r convertID}
##Run if you have another identifier than gene symbols.
##Function will convert UNIPROT identifier to gene symbols.
##Deafult id is "UNIPROT", make sure you change it if you use another.
#df <- convert2symbol(df = df, id = "UNIPROT")
```
For the downstream analysis, we used the randomly selected subset data.
```{r subset data}
set.seed(2)
df <- df[sample(nrow(df), 6000),]
```
## Calculate covered marker proteins
The overlap between *marker proteins (3365)* and
*input data.frame* is calculated and visualized for each
compartment by a bar plot.
*Note that we recommend at least 20% coverage of marker proteins
for each compartment. If certain compartments are underrrepresented
we recommend you to perform the cell fractionation again.
If all compartments are low in coverage we recommend increasing
the analytical depth of the MS-analysis.*
```{r coverageMarkers, fig = TRUE}
c.prots <- calculateCoveredProtein(proteinIDs = rownames(df),
markerproteins = markerProteins[,1])
```
## Quality control of the marker proteins
To avoid reduced classification accuracy, marker proteins with
noisy quantification and marker proteins that are not representative
of their associated compartment (e.g.due to cell type specific
localization) are filtered out by a two-step quality control.
1. Marker proteins with pearson correlations
less than 0.8 between *A* and *B* duplicates for each cell line were
filtered out (Figure A).
2. Pairwise correlations between 5CL marker profile and input data for
each protein (A and B replicate experiments separately) were calculated
using both Pearson and Spearman correlation. The lowest value for each
method were then used for filtering with cut-offs set to 0.8 and 0.6
respectively, to exclude non-representative marker proteins (Figure B).
```{r markerQC}
r.markers <- markerQualityControl(coveredProteins = c.prots,protein.data = df)
r.markers[1:5]
```
**Optional step:** After removing non-marker proteins, you can re-calculate
and visualize the final coverage of the marker proteins.
```{r finalCoverage, fig = TRUE}
## uncomment the function when running
# f.prots <- calculateCoveredProtein(r.markers, markerProteins[,1])
```
## Visualization of marker proteins in t-SNE map
The spatial distribution of the marker proteins is visualized in
t-SNE map. This plot will be informative for the quality control
of the generated data as it offers evaluation of the spatial
distribution and separation of marker proteins.
```{r tsneparameter}
#Default parameters
#Run the broad-range parameters to optimize well !!!
#Output dimensionality
#dims = 3
#Speed/accuracy trade-off (increase for less accuracy)
#theta = c(0.1, 0.2, 0.3, 0.4, 0.5)
#Perplexity parameter
#perplexity = c(5, 10, 20, 30, 40, 50, 60)
```
Information about the different t-SNE parameters that can be
modified by the user is available by typing `?Rtsne` in the console.
Although the applications of t-SNE is widespread in the field
of machine learning, it can be misleading if it is not well
optimized. Therefore, we optimize t-SNE map by grid search,
a process that can take some time
```{r tsnedim3, fig = TRUE, fig.width = 6.5, fig.height = 6.5}
set.seed(6)
tsne.map <- tsneVisualization(protein.data = df,
markerProteins = r.markers,
dims = 3,
theta = c(0.1),
perplexity = c(60))
```
We recommend 3D vizualisation by setting `dims = 3`,
for optimal evaluation of marker protein cluster separation
and data modularity. You can also visualize the marker
proteins in 2 dimensional space by setting `dims = 2`, although
reducing the dimensionality results in loss of information and
underestimation of data resolution.
```{r tsnedim2, fig = TRUE}
set.seed(9)
tsne.map2 <- tsneVisualization(protein.data = df,
markerProteins = r.markers,
dims = 2,
theta = c(0.5),
perplexity = c(60))
```
## Build model and classify proteins
For replicate *A* and *B* separately, marker proteins are
used for training a Support Vector Machine (SVM) classifier
with a Gaussian radial basis function kernel algorithm.
After tuning the parameters, the SVM model predicts (classifies)
the subcellular localization for all proteins in the input
data with corresponding probabilities for *A* and *B* replicate
classification.
```{r buildSVM}
set.seed(4)
cls <- svmClassification(markerProteins = r.markers,
protein.data = df,
markerprot.df = markerProteins)
```
```{r testdata}
# testing data predictions for replicate A and B
test.A <- cls[[1]]$svm.test.prob.out
test.B <- cls[[2]]$svm.test.prob.out
head(test.A)
```
```{r allPred}
# all predictions for replicate A and B
all.A <- cls[[1]]$all.prot.pred
all.B <- cls[[2]]$all.prot.pred
```
### Estimate classification thresholds for compartment level
Classification probabilities close to 1 indicate high confidence
predictions, whereas probabilities close to 0 indicate low
confidence predictions. To increase the overal prediction accuracy
and to filter out poor predictions, one criterion and
two cut-offs are defined.
1. The criterion is the consensus of preliminary predictions
between biological duplicates. Proteins are kept in the analysis,
if there is an agreement between biological duplicates.
Subsequently, prediction probabilities from the two
duplicates are averaged for each protein.
2. Cut-off is (precision - based) set when precision reach 0.9
in the test data.
3. Cut-off is (recall - based) set as the probability of the
lowest true positive in the test data.
```{r compartmentThreshold}
t.c.df <- computeThresholdCompartment(test.repA = test.A, test.repB = test.B)
```
```{r headcompartmentThreshold}
head(t.c.df)
```
### Apply threshold to compartment level classifications
The determined thresholds for the compartment levels are applied
to all classifications.
```{r applycompartmentThreshold}
c.cls.df <- applyThresholdCompartment(all.repA = all.A, all.repB = all.B,
threshold.df = t.c.df)
```
```{r headcompartmentCls}
head(c.cls.df)
```
### Estimate classification thresholds for neighborhood level
Compartment level classification probabilities are summed to
neighborhood probabilities and thresholds for neighborhood
analysis are estimated as described above for compartment
level analysis except precision based cut-off is set to 0.95.
```{r neighborhoodThreshold}
t.n.df <- computeThresholdNeighborhood(test.repA = test.A, test.repB = test.B)
```
```{r headneighborhoodThreshold}
head(t.n.df)
```
### Apply threshold to neighborhood level classifications
The determined thresholds for the neighborhood levels are applied to all
classifications.
```{r applyNeighborhoodThreshold}
n.cls.df <- applyThresholdNeighborhood(all.repA = all.A, all.repB = all.B,
threshold.df = t.n.df)
```
```{r headNeighborhoodCls}
head(n.cls.df)
```
### Merge compartment and neighborhood classification
Individual classifications (compartment and neighborhood) are merged
into one data frame.
```{r mergecls}
cls.df <- mergeCls(compartmentCls = c.cls.df, neighborhoodCls = n.cls.df)
```
```{r headmerge}
head(cls.df)
```
# Visualization of the protein subcellular localization
## SubCellBarCode plot
You can query one protein at a time to plot barcode of the protein
of the interest.
`PSM` (Peptide-spectra-matching) count table is required for the plotting
SubCellBarCode. It is in `data.frame` format;
```{r hcc827psmcount}
head(hcc827CtrlPSMCount)
```
```{r plotbarcode, fig = TRUE, fig.width = 6, fig.height = 6}
plotBarcode(sampleClassification = cls.df, protein = "NLRP4",
s1PSM = hcc827CtrlPSMCount)
```
## Co-localization plot
To evaluate localization and of multiple proteins at
the same time, a vector of proteins (identified by gene symbols)
can be prepared and used to create a barplot showing the
distribution of classifications across compartments and
neighborhoods. This analysis could be helpful when
evaluating co-localization of proteins, protein complex
formation and compartmentalized protein level regulation.
```{r multipleprots, fig = TRUE, fig.width= 10, fig.height = 8}
# 26S proteasome complex (26s proteasome regulatory complex)
proteasome26s <- c("PSMA7", "PSMC3","PSMA4", "PSMB4",
"PSMB6", "PSMB5", "PSMC2","PSMC4",
"PSMB3", "PSMA6","PSMC5","PSMC6")
plotMultipleProtein(sampleClassification = cls.df, proteinList = proteasome26s)
```
# Differential localization analysis
Regulation of protein localization is a the key
process in cellular signalling. The `SubCellBarCode`
method can be used for differential localization analysis
given two conditions such as control vs treatment, cancer
cell *vs* normal cell, cell state A *vs* cell state B, *etc*.
As example, we compared untreated and gefitinib (EGFR inhibitor)
treated HCC827 cells (for details, see Orre et al.).
## Plot differentially localizing proteins
Neighborhood classifications for condition 1 (untreated) and
condition 2 (gefitinib) is first done separately, and
classifications for overlapping proteins are then vizualized by a
sankey plot.
*The HCC827 gefitinib cell lines classification was embedded
into the package for example analysis.*
```{r headHCC827GEFCls}
head(hcc827GEFClass)
```
```{r sankey, fig.width = 6, fig.height = 3}
sankeyPlot(sampleCls1 = cls.df, sampleCls2 = hcc827GEFClass)
```
## Filter Candidates
As the differential localization analysis is an outlier analysis,
it will include analytical noise. To filter out such noise,
`PSM` (Peptide-spectra-matching) counts and fractionation profile
correlation analysis (Pearson) was done to identify strong
candidates. The PSM count format for the input have to be the
same between the compared conditions;
```{r headPSMCount}
head(hcc827CtrlPSMCount)
```
For each protein, the minimum PSM count between the two conditions
is plotted against the fractionation profile (median) correlation
between the two conditions. For proteins with different localizations
between conditions, the fractionation profile differs and therefore
we are expecting a low fractionation profile correlation. As a standard
setting for filtering of analytical noise in the differential
localization analysis we suggest to demand a fractionation profile
correlation below 0.8, and a minimum PSM count of at least 3.
However, for the exploratory analysis, you can adjust
the `min.psm` and `pearson.cor` parameters.
```{r relocation parameters}
##parameters
#sampleCls1 = sample 1 classification output
#s1PSM = sample 2 PSM count
#s1Quant = Sample 1 Quantification data
#sampleCls2 = sample 2 classification output
#s2PSM = sample 2 classification output
#sample2Quant = Sample 2 Quantification data
#min.psm = minumum psm count
#pearson.cor = perason correlation coefficient
```
```{r strongCandidates, fig = TRUE}
candidate.df <- candidateRelocatedProteins(sampleCls1 = cls.df,
s1PSM = hcc827CtrlPSMCount,
s1Quant = hcc827Ctrl,
sampleCls2 = hcc827GEFClass,
s2PSM = hcc827GefPSMCount,
s2Quant = hcc827GEF,
min.psm = 2,
pearson.cor = 0.8)
```
```{r printdim}
cat(dim(candidate.df))
```
```{r printhead}
head(candidate.df)
```
Candidate subset of differentially localizing proteins can
be annotated with names by setting `annotation = TRUE`, `min.psm`
and `pearson.cor`
```{r strongCandidatesLabel, fig = TRUE}
candidate2.df <- candidateRelocatedProteins(sampleCls1 = cls.df,
s1PSM = hcc827CtrlPSMCount,
s1Quant = hcc827Ctrl,
sampleCls2 = hcc827GEFClass,
s2PSM = hcc827GefPSMCount,
s2Quant = hcc827GEF,
annotation = TRUE,
min.psm = 9,
pearson.cor = 0.05)
```
# Peptide/Exon/Transcript centric or PTM regulated localization
Our main analysis is based on gene-centric. However, based on the analytical
depth of the data, peptide/exon/transcript centric classifications can
be performed. Moreover, this approach is also applicable for
post translation modification (PTM) enriched data.
Fundamentally, we will use the same classifiers, compartment
and neighborhood levels thresholds and apply
them to peptide/exon/transcript data.
## Exon-centric classification
Here, we have provided subset of the HCC827 exon-centric data.
```{r head exon data}
head(hcc827exon)
```
For the classification, we will use the gene-centric model that we built in
section 3.7.
```{r recall model}
##recall the models
modelA <- cls[[1]]$model
modelB <- cls[[2]]$model
```
*`svmExternalData` function greps replicates A and B, repectively. So
the input data can include other features like, here, peptide counts.*
```{r exon centric cls}
exon.cls <- svmExternalData(df = hcc827exon, modelA = modelA, modelB= modelB)
```
```{r exon cls}
exon.A <- exon.cls[[1]]
exon.B <- exon.cls[[2]]
```
```{r exon cls rep A}
head(exon.A)
```
Next steps are exactly same as what we did in section 3.7. We will
use same thresholdsthat we have estimated in 3.7.1 and 3.7.3.
Finally, we will merge two classifications same function used
in 3.7.5.
```{r apply thresholds}
exon.comp.cls <- applyThresholdCompartment(all.repA = exon.A[,2:17],
all.repB = exon.B[,2:17],
threshold.df = t.c.df)
exon.neigh.cls <- applyThresholdNeighborhood(all.repA = exon.A[,2:17],
all.repB = exon.B[,2:17],
threshold.df = t.n.df)
exon.cls.df <- mergeCls(compartmentCls = exon.comp.cls,
neighborhoodCls = exon.neigh.cls)
#same order
exon.cls.df <- exon.cls.df[rownames(exon.A),]
# we will add gene symbols as well as peptide count
# (PSM count is also accepted) in case for comparing with
# gene-centric classifications
exon.cls.df$GeneSymbol <- exon.A$Gene_Symbol
exon.cls.df$PeptideCount <- hcc827exon$PeptideCount
```
```{r head exon cls}
head(exon.cls.df)
```
## Comparison between gene and exon centric classification
The insteresting part of the exon classification is to detect deviated
classification between gene centric classification. However, we enrich
more noise than gene centric classification due to analtyical dept of the exon
centric data. Therefore, we, additioanlly, internally calculate correlation
between gene-centric data `hcc827Ctrl` and exon-centric data `hcc827exon`,
and report
number of peptides (PSM count works, too) that match to corresponding exon
as an indication of the confidence of the results.
```{r compareCls}
comp.df <- compareCls(geneCls = cls.df, exonCls = exon.cls.df)
```
```{r head comp.df}
head(comp.df)
```
You can easily vizualize/filter the results using correlations and number of peptides.
# References
* Orre., et al. "SubCellBarCode: Proteome-wide Mapping of Protein
Localization and Relocalization." Molecular Cell (2019): 73(1):166-182.e7.
# Session Information
```{r}
sessionInfo()
```