Abstract
Plot multiple sequence alignment using ggplot2 with multiple color schemes supported.
Citation
To do
Introduction
Supports visualizing multiple sequence alignment of DNA and protein sequences using ggplot2 It supports a number of colour schemes, including Chemistry, Clustal, Shapely, Taylor and Zappo. Multiple sequence alignment can easily be combined with other ‘ggplot2’ plots, such as aligning a phylogenetic tree produced by ‘ggtree’ with multiple sequence alignment.
Installation
This R package (ggmsa, current version: 0.0.2) is avalable via CRAN. The way to install the package is the following:
Load sample data
The file sample.fasta is shipped with the ggmsa package. To determine where the file is located, enter the following command in your R session:
Printing Multiple Sequence Alignments
We offer 5 color schemes for multiple sequence alignments.
Clustal X Colour Scheme
This is an emulation of the default colourscheme used for alignments in Clustal X, a graphical interface for the ClustalW multiple sequence alignment program. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type.
Color by Chemistry
Amino acids are colored according to their side chain chemistry:
Color by Shapely
This color scheme matches the RasMol amino acid and RasMol nucleotide color schemes, which are, in turn, based on Robert Fletterick’s “Shapely models”.
Color by Taylor
This color scheme is taken from Taylor(Taylor 1997) and is also used in JalView(Waterhouse et al. 2009).
Color by Zappo
This scheme colors residues according to their physico-chemical properties, and is also used in JalView(Waterhouse et al. 2009).
Combining plots
ggmsa is based on ggplot2, so combining multiple sequence alignment with other plots generated by ggplot2 is simple
Combining with tree
Cross-link with ggtree(Yu et al. 2017) package:
library(Biostrings)
x <- readAAStringSet(sequences)
d <- as.dist(stringDist(x, method = "hamming")/width(x)[1])
library(ape)
tree <- bionj(d)
library(ggtree)
p <- ggtree(tree ) + geom_tiplab()
data = tidy_msa(x, 164, 213)
p + geom_facet(geom = geom_msa, data = data, panel = 'msa',
font = NULL, color = "Chemistry_AA") +
xlim_tree(1)
Combining with sequence logo
Cross-link with ggseqlogo(Wagih 2017) package. Bits of sequence letters can be indicated in the sequence logo.
f <- system.file("extdata", "LeaderRepeat_All.fa", package = "ggmsa")
s <- readDNAStringSet(f)
strings <- as.character(s)
p1 <- ggmsa(s, font = NULL, color = 'Chemistry_NT')
library(ggseqlogo)
library(cowplot)
p2 <- axis_canvas(p1, axis='x')+ geom_logo(strings, 'probability')
pp <- insert_xaxis_grob(p1, p2, position="top", grid::unit(.05, "null"))
ggdraw(pp)
References
Taylor, W R. 1997. “Residual Colours: A Proposal for Aminochromography.” Protein Eng 10 (7): 743–46.
Wagih, Omar. 2017. “Ggseqlogo: A Versatile R Package for Drawing Sequence Logos.” Bioinformatics 33 (22).
Waterhouse, A. M., J. B. Procter, D. M. Martin, M Clamp, and G. J. Barton. 2009. “Jalview Version 2–a Multiple Sequence Alignment Editor and Analysis Workbench.” Bioinformatics 25 (9): 1189.
Yu, Guangchuang, David K Smith, Huachen Zhu, Yi Guan, and Tommy Tsanyuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36.