ggmsa: Plot multiple sequence alignment using ggplot2

2020-01-08

Abstract

Plot multiple sequence alignment using ggplot2 with multiple color schemes supported.

Citation

To do

Introduction

Supports visualizing multiple sequence alignment of DNA and protein sequences using ggplot2 It supports a number of colour schemes, including Chemistry, Clustal, Shapely, Taylor and Zappo. Multiple sequence alignment can easily be combined with other ‘ggplot2’ plots, such as aligning a phylogenetic tree produced by ‘ggtree’ with multiple sequence alignment.

Installation

This R package (ggmsa, current version: 0.0.2) is avalable via CRAN. The way to install the package is the following:

## installing the package
install.packages("ggmsa")
## ## loading the package
library("ggmsa")

Load sample data

The file sample.fasta is shipped with the ggmsa package. To determine where the file is located, enter the following command in your R session:

sequences <- system.file("extdata", "sample.fasta", package = "ggmsa")
print(sequences)
#> [1] "/tmp/RtmpWSdJck/Rinst870267da3f6e7/ggmsa/extdata/sample.fasta"

Printing Multiple Sequence Alignments

We offer 5 color schemes for multiple sequence alignments.

Clustal X Colour Scheme

This is an emulation of the default colourscheme used for alignments in Clustal X, a graphical interface for the ClustalW multiple sequence alignment program. Each residue in the alignment is assigned a colour if the amino acid profile of the alignment at that position meets some minimum criteria specific for the residue type.

ggmsa(sequences, 320, 360, color = "Clustal")

Color by Chemistry

Amino acids are colored according to their side chain chemistry:

ggmsa(sequences, 320, 360, color = "Chemistry_AA")

Color by Shapely

This color scheme matches the RasMol amino acid and RasMol nucleotide color schemes, which are, in turn, based on Robert Fletterick’s “Shapely models”.

ggmsa(sequences, 320, 360, color = "Shapely_AA")

Color by Taylor

This color scheme is taken from Taylor(Taylor 1997) and is also used in JalView(Waterhouse et al. 2009).

ggmsa(sequences, 320, 360, color = "Taylor_AA")

Color by Zappo

This scheme colors residues according to their physico-chemical properties, and is also used in JalView(Waterhouse et al. 2009).

ggmsa(sequences, 320, 360, color = "Zappo_AA")

Do not print letters

If you specify font = NULL, only the background box will be printed.

ggmsa(sequences, 320, 360, font = NULL, color = "Chemistry_AA")

Combining plots

ggmsa is based on ggplot2, so combining multiple sequence alignment with other plots generated by ggplot2 is simple

Combining with tree

Cross-link with ggtree(Yu et al. 2017) package:

library(Biostrings)
x <- readAAStringSet(sequences)
d <- as.dist(stringDist(x, method = "hamming")/width(x)[1])
library(ape)
tree <- bionj(d)
library(ggtree)
p <- ggtree(tree ) + geom_tiplab()

data = tidy_msa(x, 164, 213)
p + geom_facet(geom = geom_msa, data = data,  panel = 'msa',
               font = NULL, color = "Chemistry_AA") +
    xlim_tree(1)

References

Taylor, W R. 1997. “Residual Colours: A Proposal for Aminochromography.” Protein Eng 10 (7): 743–46.

Waterhouse, A. M., J. B. Procter, D. M. Martin, M Clamp, and G. J. Barton. 2009. “Jalview Version 2–a Multiple Sequence Alignment Editor and Analysis Workbench.” Bioinformatics 25 (9): 1189.

Yu, Guangchuang, David K Smith, Huachen Zhu, Yi Guan, and Tommy Tsanyuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36.