---
title: "Disordered Matrices Vignette"
author: "William McFadden"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Disordered Matrices Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Substitution Matrices for Intrinsically Disordered Proteins


## Quick Intro
Intrinsically Disordered Proteins (IDPs) 
have many distinct properties compared to structured proteins, such as 
amino acid composition and rate of evolution. The most commonly used amino
acid substitution matrices (PAM and BLOSUM) are derived from or favor structured
proteins and therefore are not the most appropriate to use when analyzing IDPs.
The following amino acid substitution matrices are derived from sets of IDPs and
thus, are more appropriate.

A detailed comparison is available at 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8).  
Additionally, see the documentation for each matrix for quick information
about the derivation and use of each matrix.

The following matrices are available within **idpr**:

* EDSS Matrices ([Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8))
   + EDSSMat50
   + EDSSMat60
   + **EDSSMat62**
   + EDSSMat70
   + EDSSMat75
   + EDSSMat80
   + EDSSMat90
* "Disorder" Matrices ([Brown et al., 2009](https://doi.org/10.1093/molbev/msp277))
   + **Disorder40**
   + Disorder60
   + **Disorder85**
* DUNMat ([Radivojac et al., 2001](https://doi.org/10.1142/9789812799623_0055))
   + DUNMat

*Bolded Matrices are the best-preforming*

***


## Installation  

The idpr package can be installed from Bioconductor with the following line of 
code. It requires the BiocManager package to be installed
```{r}
#BiocManager::install("idpr") 
```

The most recent version of the package can be installed with the following line 
of code. This requires the devtools package to be installed.

```{r}
#devtools::install_github("wmm27/idpr")
```

To access the matrices, the **idpr** package needs to be loaded
```{r}
library(idpr)
```



## Background and Motivation
It has been shown that solvent-exposed amino acids experience higher levels of
evolution compared to the internal residues of a protein (
[Franzosa & Xia, 2009](https://doi.org/10.1093/molbev/msp146)). Intrinsically
Disordered Proteins (IDPs) and Intrinsically Disordered Regions (IDRs) are, 
by definition, more solvent exposed. One can predict that these proteins have
accelerated evolution, and this has been observed (
[Brown et al., 2011](https://doi.org/10.1016/j.sbi.2011.02.005)).


Additionally, a study that compared proteins with both ordered and disordered 
regions showed the IDRs, on average, evolved at faster rates than the ordered
regions (
[Brown et al., 2002](https://doi.org/10.1016/j.sbi.2011.02.005)).


The most commonly used amino acid substitution matrices are 
BLOSUM and PAM. These are currently the only accepted matrices for NCBI BLAST (
[Johnson et al., 2008](https://doi.org/10.1093/nar/gkn201)), as well as 
EMBOSS Needle and EMBOSS Water web serveries (
[Madeira et al., 2019](http://doi.org/10.1093/nar/gkz268)).

The sequences used to calculate the values for both of these matrices are from 
highly conserved protein families (Dayhoff et al. 1978;
[Henikoff & Henikoff, 1992](https://doi.org/10.1073/pnas.89.22.10915)).
While they do not explicitly exclude IDPs,
the sequences used for analysis are enriched in highly-ordered proteins. 
Structured proteins have served as the basis for sequence-level comparisons due 
to their residue and position conservation.
IDPs do not experience the same level of residue-identity conservation that has 
been observed in proteins with a specific, well-folded structure. 
IDPs often conserve the overall chemistry and the disordered state of the 
protein, but are less conserved at the residue and position level. 
Therefore using BLOSUM or PAM for studies comparing IDP evolution is 
not the most appropriate method,
since these matrices account for the evolutionary observations of ordered
proteins (
[Brown et al., 2011](https://doi.org/10.1016/j.sbi.2011.02.005)).


Additionally, the amino acid composition and accepted point mutations for IDPs 
differ compared to structured proteins. IDPs and IDRs tend to be enriched in 
hydrophillic and charged residues (
[Uversky, 2019](https://doi.org/10.3389/fphy.2019.00010)).
In addition to a distinct composition, IDPs 
tend to accept indels at higher rates (
[Brown et al., 2011](https://doi.org/10.1016/j.sbi.2011.02.005)).


Therefore, there are multiple reasons as to why BLOSUM and PAM matrices do not
accurately assess the evolutionary history of IDPs. The works of 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8),
[Brown et al., 2009](https://doi.org/10.1093/molbev/msp277), and
[Radivojac et al., 2001](https://doi.org/10.1142/9789812799623_0055) can serve
as references for IDP-based substitution matrices. Each use experimentally
confirmed or computationally predicted IDPs to create the data set of protein
sequences and protein families to calculate the accepted point mutations 
specific to disordered sequences. These are the EDSSMat set of matrices, the
"Disorder" set of matrices, and the DUNMat matrix.


Unlike most commonly used web-based tools, many sequence comparison functions
within other R packages allow for custom substitution matrices. **idpr** has 
added 11 disorder-based substitution matrices for IDP sequence conservation 
studies for user convenience. 

[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8) 
provides many details for comparing the EDSSMat matrices to the other disorder
matrices and other matrices like BLOSUM and PAM within sequence alignments.
The paper also calculated the optimal gap parameters for sequence alignments 
depending on how disordered the aligned sequences were predicted to be.
Protein families were sorted 
into 3 categories: Proteins containing Less Disorder (LD), defined as
[0-20%] disorder, Moderate Disorder (MD), defined as (20-40%] disorder, and
High Disorder (HD), defined as (40-100%] disorder. 

Additionally, EDSSMat62 was shown to identify both close and distant homologs of 
a specific IDP while other matrices could only identify some close homologs.
Please see 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8)  
for additional information and for comparisons to other matrices. 


## Matrices

### EDSS Matrices
EDSSMat62, like the other EDSS matrices, is derived from alignment blocks of
computationally predicted IDPs. 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8)  
go into detail of methods and comparisons. EDSSMat62 is the best preforming 
matrix in their study and was shown to identify both close and distant homologs 
of Secretogranin-1, an IDP, while other matrices could only identify some 
close homologs. On average, EDSSMat62 attained smaller E-values when aligning
Highly Disordered (HD) proteins compared to BLOSUM and PAM matrices, and against
some other disorder-based substitution matrices.

EDSS Matrices are symmetric, 24x24 matrices. They contain  the 20 standard amino
acids and 4 ambiguous residues: B, Asparagine or Aspartic Acid (Asx);
Z, Glutamine or Glutamic Acid (Glx); X, Unspecified or unknown amino acid; and 
*, Stop.
```{r}
EDSSMat62
```

### "Disorder" Matrices
In their study,
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8) 
show that the Disorder40 and Disorder85 matrices, on average, score lower 
E-values for the HD proteins. EDSSMat62 on average attains lower E-values for
the Low and Medium disordered proteins. 
The "Disorder" matrices are from 
[Brown et al., 2009](https://doi.org/10.1093/molbev/msp277), which identified
protein families based on homology to experimentally confirmed IDPs.
The "Disorder" matrices were not designed with the goal of improving sequence 
alignments. The authors note that the purpose of these matrices were to compare 
evolutionary characteristics of disordered and ordered proteins.
Therefore information for using these in sequence alignments, like gap scores
and entropy, are not provided by the original authors.  
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8)
determined the optimal gap parameters for these matrices to compare against 
EDSSMat62. 

Disorder40, Disorder60, and Disorder85 are symmetric, 24x24 matrices.
They contain the 20 standard amino acids
and 4 ambiguous residues: B, Asparagine or Aspartic Acid (Asx);
Z, Glutamine or Glutamic Acid (Glx); X, Unspecified or unknown amino acid; and 
*, Stop.
```{r}
Disorder40
```

```{r}
Disorder85
```


### DUNMat
DUNMat was described in 
[Radivojac et al., 2001](https://doi.org/10.1142/9789812799623_0055). 
This was an early attempt at improving alignments for IDPs. It is included in 
**idpr** due to its significance in developing substitution matrices for IDPs.
It is a symmetric 20x20 matrix. 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8) 
show that EDSSMat62, on average, attains lower E-values for HD proteins. 
```{r}
DUNMat
```


## Examples
### Pairwise Sequence Alignments (PSAs)

P53 is a well-known IDP that has been experimentally determined. 
Information was accessed by DisProt, a database for IDPs (
[Hatos et al., 2019](https://doi.org/10.1093/nar/gkz975)).
(https://www.disprot.org/)

These examples are to showcase the differences of BLOSUM62, a commonly used
matrix that can be loaded into R via the **Biostrings** package, and 
EDSSMat62, a matrix comes loaded with **idpr** specific for IDPs. 
EDSSMat62 can be used with any function that accepts custom substitution 
matrices, and these examples show some uses of it.

**Cellular tumor antigen p53**


* TP53 from *Homo sapiens*
    + UniProt ID: P04637
    + DisProt ID: DP00086
    + Disorder content 48.1%
* Tp53 from *Mus musculus*
    + UniProt ID: P02340
    + DisProt ID: NA
* TP53 from *Gorilla gorilla gorilla*
    + UniProt ID: A0A2I2Y7Z8
    + DisProt ID: NA  

The amino acid sequences were acquired from UniProt (UniProt Consortium, 2019)
and stored within the **idpr** package for examples.

```{r}
P53_MOUSE <- TP53Sequences[1]
print(P53_MOUSE)
P53_HUMAN <- TP53Sequences[2]
print(P53_HUMAN)
P53_GORILLA <- GorillaTP53
print(P53_GORILLA)
```


PSA of Human TP53 to Gorilla and Mouse P53 BLOSUM62 (
[Henikoff & Henikoff, 1992](https://doi.org/10.1073/pnas.89.22.10915))
with default gap opening/extension from EMBOSS Needle (
[Madeira et al., 2019](http://doi.org/10.1093/nar/gkz268)).
pairwiseAlignment function and BLOSUM62 matrix are from **Biostrings**
```{r}
library(Biostrings)
data("BLOSUM62") #loads the matrix from the Biostrings package
HUMAN_MOUSE_BLOSUM_PSA <- pairwiseAlignment(P53_MOUSE, P53_HUMAN,
                                            substitutionMatrix = BLOSUM62,
                                            gapOpening = 10,
                                            gapExtension = 0.5)
print(HUMAN_MOUSE_BLOSUM_PSA)

HUMAN_GORILLA_BLOSUM_PSA <- pairwiseAlignment(P53_GORILLA, P53_HUMAN,
                                             substitutionMatrix = BLOSUM62,
                                             gapOpening = 10,
                                             gapExtension = 0.5)
print(HUMAN_GORILLA_BLOSUM_PSA)

```

PSA of Human TP53 to Gorilla and Mouse P53 
using EDSSMat62 with gap costs recommended for IDPs in (
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8)).
```{r}
HUMAN_MOUSE_EDSS_PSA <- pairwiseAlignment(P53_MOUSE, P53_HUMAN,
                                            substitutionMatrix = EDSSMat62,
                                            gapOpening = 19,
                                            gapExtension = 2)
print(HUMAN_MOUSE_EDSS_PSA)

HUMAN_GORILLA_EDSS_PSA <- pairwiseAlignment(P53_GORILLA, P53_HUMAN,
                                             substitutionMatrix = EDSSMat62,
                                             gapOpening = 19,
                                             gapExtension = 2)
print(HUMAN_GORILLA_EDSS_PSA)
```

Comparing the printed alignment for each matrix comparing Human and Mouse P53,
we can see that the alignments differ depending on the matrix used to align. 
This is also seen in the second alignments for each matrix that compares 
Human and Gorilla P53. 


Using BLOSUM, the scores between each PSA differ greatly by over 100 points.
Using EDSSMat62 we can see almost identical scores between the alignment 
comparing Human p53 to Mouse p53 and in the alignment comparing Human p53 
to Gorilla p53. Scores are not directly comparable between matrices, but this
shows relative to each other, the scores with EDSSMat62 are similar for this
specific case. There will be some instances where BLOSUM62 may appear to have
better scores, but as previously stated, 
[Trivedi & Nagarajaram, 2019](https://doi.org/10.1038/s41598-019-52532-8) 
show that on average EDSSMat62 attains lower E-values for alignments of IDPs. 


### Multiple Sequence Alignments (MSAs)

In addition to PSAs, Multiple Sequence Alignments (MSAs) offer additional 
analysis for protein families.

First I will load several P53 sequences stored within **idpr**. 
These sequences were selected due to their highly similar identity on UniProt(
[The UniProt Consortium, 2019](https://doi.org/10.1093/nar/gky1049)).
```{r}
TP53_Sequences <- TP53Sequences
print(TP53_Sequences)
```


MSA of P53 proteins using the **msa** package
```{r}
library(msa)
BLOSUM_MSA <- msa(TP53_Sequences,
                 type = "protein",
                 substitutionMatrix = BLOSUM62,
                 gapOpening = 10,
                 gapExtension = 0.5)

print(BLOSUM_MSA, show="complete")
```


```{r}
EDSS_MSA <- msa(TP53_Sequences,
                type = "protein",
                substitutionMatrix = EDSSMat62,
                gapOpening = 19,
                gapExtension = 2)

print(EDSS_MSA, show="complete")
```



### Distance Trees

The user guide to **msa** shows an example of converting the sequence alignment
(like the one generated above) into a phylogenetic tree of sequence distances.
Therefore, the IDP-specific matrices can be used for this type of analysis.
The conversion uses both the **ape** and **seqinr** packages. 
```{r fig1, fig.height = 4, fig.width = 6}
EDSS_MSA_Tree <- msa::msaConvert(EDSS_MSA, type="seqinr::alignment")
d <- seqinr::dist.alignment(EDSS_MSA_Tree, "identity")
p53Tree <- ape::nj(d)
plot(p53Tree, main="Phylogenetic Tree of p53 Sequences\nAligned with EDSSMat62")
```


## References

### Packages
```{r, results="asis"}
citation("Biostrings")
citation("msa")
citation("ape")
citation("seqinr")
```



### Citations

Brown, C. J., Johnson, A. K., & Daughdrill, G. W. (2009). 
Comparing Models of Evolution for Ordered and Disordered Proteins.
Molecular Biology and Evolution, 27(3), 609-621. doi:10.1093/molbev/msp277


Brown, C. J., Johnson, A. K., Dunker, A. K., & Daughdrill, G. W. (2011).
Evolution and disorder. Current Opinion in Structural Biology, 21(3), 441-446.
doi:https://doi.org/10.1016/j.sbi.2011.02.005


Brown, C. J., Takayama, S., Campen, A. M., Vise, P., Marshall, T. W., 
Oldfield, C. J., . . . Dunker, A. K. (2002). Evolutionary rate heterogeneity 
in proteins with long disordered regions. Journal of molecular evolution, 55(1), 
104. 


Dayhoff, M., Schwartz, R., & Orcutt, B. (1978). 
22 a model of evolutionary change in proteins. In Atlas of protein sequence and 
structure (Vol. 5, pp. 345-352): 
National Biomedical Research Foundation Silver Spring MD.


Franzosa, E. A., & Xia, Y. (2009). 
Structural Determinants of Protein Evolution Are Context-Sensitive 
at the Residue Level. Molecular Biology and Evolution, 26(10), 2387-2395. 
doi:10.1093/molbev/msp146


Hatos, A., Hajdu-SoltÃ©sz, B., Monzon, A. M., Palopoli, N.,
Ãlvarez, L., Aykac-Fas, B., . . . Piovesan, D. (2019). 
DisProt: intrinsic protein disorder annotation in 2020.
Nucleic acids research, 48(D1), D269-D276. doi:10.1093/nar/gkz975


Henikoff, S., & Henikoff, J. G. (1992).
Amino acid substitution matrices from protein blocks. 
Proceedings of the National Academy of Sciences of the United States of America,
89(22), 10915-10919. doi:10.1073/pnas.89.22.10915


Johnson, M., Zaretskaya, I., Raytselis, Y., Merezhuk, Y., McGinnis, S.,
& Madden, T. L. (2008). NCBI BLAST: a better web interface. 
Nucleic acids research, 36(suppl_2), W5-W9. doi:10.1093/nar/gkn201


Madeira, F., Park, Y. M., Lee, J., Buso, N., Gur, T., Madhusoodanan, N., 
. . . Finn, R. D. (2019). The EMBL-EBI search and sequence analysis tools APIs 
in 2019. Nucleic acids research, 47(W1), W636-W641. 


Radivojac, P., Obradovic, Z., Brown, C. J., & Dunker, A. K. (2001). 
Improving sequence alignments for intrinsically disordered proteins.
In Biocomputing 2002 (pp. 589-600): World Scientific.


Trivedi, R., & Nagarajaram, H. A. (2019). Amino acid substitution scoring
matrices specific to intrinsically disordered regions in proteins. 
Scientific Reports, 9(1), 16380. doi:10.1038/s41598-019-52532-8


UniProt Consortium. (2019). 
UniProt: a worldwide hub of protein knowledge.
Nucleic acids research, 47(D1), D506-D515. 


Uversky, V. N. (2019). Intrinsically Disordered Proteins and Their â€œMysteriousâ€
(Meta)Physics. Frontiers in Physics, 7(10). doi:10.3389/fphy.2019.00010


### Additional Information
R Version
```{r}
R.version.string
```

System Infomation
```{r}
as.data.frame(Sys.info())
```

```{r, results="asis"}
citation()
```