Introduction
ggmsa is a package designed to plot multiple sequence alignments using ggplot2 .
This package implements functions to visualize publication-quality multiple sequence alignments (protein/DNA/RNA) in R extremely simple and powerful. It uses module design to annotate sequence alignments and allows to accept other datasets for diagrams combination.ggmsa, (current version: 1.0.2) is avalable via CRAN.
In this tutorial, we’ll work through the basics of using ggmsa.
library("ggmsa")
Importing MSA data
We’ll start by importing some example data to use throughout this tutorial. Expect FASTA files, some of the objects in R can also as input. available_msa()
can be used to list MSA objects currently available.
available_msa()
#> files currently available:
#> .fasta
#> XStringSet objects from 'Biostrings' package:
#> DNAStringSet RNAStringSet AAStringSet BStringSet DNAMultipleAlignment RNAMultipleAlignment AAMultipleAlignment
#> bin objects:
#> DNAbin AAbin
<- system.file("extdata", "sample.fasta", package = "ggmsa")
protein_sequences <- system.file("extdata", "seedSample.fa", package = "ggmsa")
miRNA_sequences <- system.file("extdata", "LeaderRepeat_All.fa", package = "ggmsa")
nt_sequences
Basic use: MSA Visualization
The most simple code to use ggmsa:
ggmsa(protein_sequences, 300, 350, color = "Clustal", font = "DroidSansMono", char_width = 0.5, seq_name = T )
Color Schemes
ggmsa predefines several color schemes for rendering MSA are shipped in the package. In the same ways, using available_msa()
to list color schemes currently available. Note that amino acids (protein) and nucleotides (DNA/RNA) have different names.
available_colors()
#> color schemes for nucleotide sequences currently available:
#> Chemistry_NT Shapely_NT Taylor_NT Zappo_NT
#> color schemes for AA sequences currently available:
#> Clustal Chemistry_AA Shapely_AA Zappo_AA Taylor_AA LETTER CN6 Hydrophobicity
Font
Several predefined fonts are shipped ggmsa. Users can use available_fonts()
to list the font currently available.
available_fonts()
#> font families currently available:
#> helvetical mono TimesNewRoman DroidSansMono
MSA Annotation
ggmsa supports annotations for MSA. Similar to the ggplot2, it implements annotations by geom
and users can perform annotation with +
, like this: ggmsa() + geom_*()
. Automatically generated annotations that containing colored labels and symbols are overlaid on MSAs to indicate potentially conserved or divergent regions.
For example, visualizing multiple sequence alignment with sequence logo and bar chart:
ggmsa(protein_sequences, 221, 280, seq_name = TRUE, char_width = 0.5) + geom_seqlogo(color = "Chemistry_AA") + geom_msaBar()
This table shows the annnotation layers supported by ggmsa as following:
Annotation modules | Type | Description |
---|---|---|
geom_seqlogo() | geometric layer | automatically generated sequence logos for a MSA |
geom_GC() | annotation module | shows GC content with bubble chart |
geom_seed() | annotation module | highlights seed region on miRNA sequences |
geom_msaBar() | annotation module | shows sequences conservation by a bar chart |
geom_helix() | annotation module | depicts RNA secondary structure as arc diagrams(need extra data) |
Combination Plot
Other molecular datasets can be accepted and rendered by ggmsa (i.e. gene structure diagram, phylogenetic tree diagram, gene arrow maps). ggmsa allows users to align such associated graphs in MSA plot.
MSA + Tree
The most important part of phylogenetic analysis is the visual inspection, annotation and exploration of phylogenetic trees. This is no surprise that aligning phylogenetic tree to the MSA plot. ggtree has called to plot phylogenetic trees.
library(Biostrings)
<- readAAStringSet(protein_sequences)
x <- as.dist(stringDist(x, method = "hamming")/width(x)[1])
d library(ape)
<- bionj(d)
tree library(ggtree)
<- ggtree(tree) + geom_tiplab()
p
= tidy_msa(x, 164, 213)
data + geom_facet(geom = geom_msa, data = data, panel = 'msa',
p font = NULL, color = "Chemistry_AA") +
xlim_tree(1)
Multiple graphs: MSA + Tree + gene map
A notable example is integrated three associated panels. MSA plot, phylogenetic tree and gene map produced by ggmsa, ggtree and gggenes respectively.
#import data
<- system.file("extdata", "tp53.fa", package = "ggmsa")
tp53_sequences <- system.file("extdata", "TP53_genes.xlsx", package = "ggmsa")
tp53_genes
#tree
<- readAAStringSet(tp53_sequences)
tp53 <- as.dist(stringDist(tp53, method = "hamming")/width(tp53)[1])
d <- bionj(d)
tree <- ggtree(tree, branch.length = 'none') + geom_tiplab()
p_tp53 #msa
<- tidy_msa(tp53)
data_53
#gene maps
<- readxl::read_xlsx(tp53_genes)
TP53_arrow $direction <- 1
TP53_arrow$strand == "reverse","direction"] <- -1
TP53_arrow[TP53_arrow
#color
library(RColorBrewer)
= aes(xmin = start, xmax = end, fill = gene, forward = direction)
mapping <- colorRampPalette(rev(brewer.pal(n = 10, name = "Set3")))
my_pal
#tree + gene maps + msa
library(ggnewscale)
+ geom_facet(geom = geom_msa, data = data_53,
p_tp53 panel = 'msa', font = NULL,
border = NA) + xlim_tree(3.5) +
new_scale_fill() +
scale_fill_manual(values = my_pal(10)) +
geom_facet(geom = geom_motif,
mapping = mapping, data = TP53_arrow,
panel = 'genes', on = 'TP53',
arrowhead_height = unit(3, "mm"),
arrowhead_width = unit(1, "mm"))
Layout
Different layouts allow users to display more data in a given limited space. We offer two layouts to change MSA visualization styles.
Broken down layout
The long sequence has be broken down and displayed in several lines by facet_msa()
.
# 4 fields
ggmsa(protein_sequences, start = 0, end = 400, font = NULL, color = "Chemistry_AA") + facet_msa(field = 100)
Circular layout
A specific layout of the alignment can also be displayed by linking ggtreeExtra. geom_fruit
will automatically align MSA graphs to the tree with a couple of layouts such as circular, fan andand radial.
library(ggtree)
library(ggtreeExtra)
<- system.file("extdata", "sequence-link-tree.fasta", package = "ggmsa")
sequences
<- readAAStringSet(sequences)
x <- as.dist(stringDist(x, method = "hamming")/width(x)[1])
d <- bionj(d)
tree <- tidy_msa(x, 120, 200)
data
<- ggtree(tree, layout = 'circular') +
p1 geom_tiplab(align = TRUE, offset = 0.545, size = 2) +
xlim(NA, 1.3)
+ geom_fruit(data = data, geom = geom_msa, offset = 0,
p1 pwidth = 1.2, font = NULL, border = NA)