---
title: "Importing trees with data"
author: "Guangchuang Yu\\

        School of Public Health, The University of Hong Kong"
date: "`r Sys.Date()`"
bibliography: treeio.bib
biblio-style: apalike
output:
  prettydoc::html_pretty:
    toc: true
    theme: cayman
    highlight: github
  pdf_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{01 Importing trees with data}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
  %\VignetteEncoding{UTF-8}
---

```{r style, echo=FALSE, results="asis", message=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
		   message = FALSE)
```


```{r echo=FALSE, results="hide", message=FALSE}
library(tidyr)
library(dplyr)
library(tidytree)
library(ggplot2)

library("treeio")

CRANpkg <- function (pkg) {
    cran <- "https://CRAN.R-project.org/package"
    fmt <- "[%s](%s=%s)"
    sprintf(fmt, pkg, cran, pkg)
}

Biocpkg <- function (pkg) {
    sprintf("[%s](http://bioconductor.org/packages/%s)", pkg, pkg)
}
```

# Introduction

Phylogenetic trees are commonly used to present evolutionary relationships of
species. Information associated with taxon species/strains may be further
analyzed in the context of the evolutionary history depicted by the phylogenetic
tree. For example, host information of the influenza virus strains in the tree
could be studied to understand host range of a virus linage. Moreover, such
meta-data (*e.g.*, isolation host, time, location, *etc.*) directly associated
with taxon strains are also often subjected to further evolutionary or
comparative phylogenetic models and analyses, to infer their dynamics associated
with the evolutionary or transmission processes of the virus. All these
meta-data or other phenotypic or experimental data are stored either as the
annotation data associated with the nodes or branches, and are often produced in
inconsistent format by different analysis programs.


Getting trees in to R is still limited. Newick and Nexus can be imported by
several packages, including `r CRANpkg("ape")`, `r CRANpkg("phylobase")`. NeXML
format can be parsed by `r CRANpkg("RNeXML")`. However, analysis results from
widely used software packages in this field are not well
supported. SIMMAP output can be parsed by `r CRANpkg("phyext2")` and `r CRANpkg("phytools")`.
Although [PHYLOCH](http://www.christophheibl.de/Rpackages.html) can
import BEAST and MrBayes output, only internal node attributes were parsed and
tip attributes were ignore. Many other software outputs are mainly required
programming expertise to import the tree with associated data. Linking external
data, including experimental and clinical data, to phylogeny is another obstacle
for evolution biologists.


The [treeio](https://bioconductor.org/packages/treeio/) package defines base
classes and functions for phylogenetic tree input and output. It is an
infrastructure that enables evolutionary evidences that inferred by commonly
used software packages to be used in `R`. For instance, *d~N~/d~S~* values or
ancestral sequences inferred
by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) [@yang_paml_2007],
clade support values (posterior) inferred
by [BEAST](http://beast2.org/) [@bouckaert_beast_2014] and short read placement
by
[EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html) [@berger_EPA_2011]
and [pplacer](http://matsen.fhcrc.org/pplacer/) [@matsen_pplacer_2010]. These
evolutionary evidences can be further analyzed in `R` and used to annotate
phylogenetic tree using [ggtree](https://bioconductor.org/packages/ggtree/)
[@yu_ggtree:_2017]. The growth of analysis tools and models available introduces
a challenge to integrate different varieties of data and analysis results from
different sources for an integral analysis on the the same phylogenetic tree
background. The [treeio](https://bioconductor.org/packages/treeio/) package
provides a `merge_tree` function to allow combining tree data obtained from
different sources. In addition, [treeio](https://bioconductor.org/packages/treeio/)
also enables external data to be linked to phylogenetic tree structure.


# Getting tree data from evolutionary analysis result

To fill the gap that most of the tree formats or software outputs cannot be
easily parsed with the same
software/platform, [treeio](https://bioconductor.org/packages/treeio/)
implemented several functions for parsing various tree file formats and outputs
of common evolutionary analysis software. Not only the tree structure can be
parsed but also the associated data and evolutionary inferences, including NHX
annotation, clock rate inferences (from [BEAST](http://beast2.org/)
or [r8s](http://ginger.ucdavis.edu/r8s) [@sanderson_r8s:_2003] programs),
snynonymous and non-synonymous substitutions (from CodeML), and ancestral
sequence construction (from
[HyPhy](https://veg.github.io/hyphy-site/),
[BaseML](http://abacus.gene.ucl.ac.uk/software/paml.html)
or [CodeML](http://abacus.gene.ucl.ac.uk/software/paml.html)), *etc.*.

Currently, [treeio](https://bioconductor.org/packages/treeio/) is able to read
the following file formats: Newick, Nexus, New Hampshire eXtended format (NHX),
jplace and Phylip as well as the data outputs from the following analysis programs:
[BEAST](http://beast2.org/),
[EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html),
[HyPhy](https://veg.github.io/hyphy-site/),
[MrBayes](http://nbisweden.github.io/MrBayes/),
[PAML](http://abacus.gene.ucl.ac.uk/software/paml.html),
[PHYLDOG](https://pbil.univ-lyon1.fr/software/phyldog/),
[pplacer](http://matsen.fhcrc.org/pplacer/),
[r8s](http://ginger.ucdavis.edu/r8s),
[RAxML](http://evomics.org/learning/phylogenetics/raxml/) and
[RevBayes](https://revbayes.github.io/intro.html).


```{r treeio-function, echo=F, message=FALSE}
ff <- matrix(c(
  "read.beast"      , "parsing output of BEAST",
  "read.codeml"     , "parsing output of CodeML (rst and mlc files)",
  "read.codeml_mlc" , "parsing mlc file (output of CodeML)",
  "read.hyphy"      , "parsing output of HYPHY",
  "read.jplace"     , "parsing jplace file including output of EPA and pplacer",
  "read.mrbayes"    , "parsing output of MrBayes",
  "read.newick"     , "parsing newick string, with ability to parse node label as support values",
  "read.nhx"        , "parsing NHX file including output of PHYLDOG and RevBayes",
  "read.paml_rst"   , "parsing rst file (output of BaseML or CodeML)",
  "read.phylip"     , "parsing phylip file (phylip alignment + newick string)",
  "read.r8s"        , "parsing output of r8s",
  "read.raxml"      , "parsing output of RAxML"
  ), ncol=2, byrow=TRUE)
ff <- as.data.frame(ff)
colnames(ff) <- c("Parser function", "Description")
knitr::kable(ff, caption = "Parser functions defined in treeio", booktabs = T)
```

After parsing, storage of the tree structure with associated data is made
through a S4 class, treedata, defined in the [treeio](https://bioconductor.org/packages/treeio/) package. These parsed data
are mapped to the tree branches and nodes inside `treedata` object, so that they
can be efficiently used to visually annotate the tree
using [ggtree](https://bioconductor.org/packages/ggtree/) package [@yu_ggtree:_2017].
[treeio](https://bioconductor.org/packages/treeio/) provides functions to merge these phylogeny-associated data for
comparison and further analysis.


## Parsing BEAST output

```{r}
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
beast
```

Since _`%`_ is not a valid character in _`names`_, all the feature names that contain _`x%`_ will convert to _`0.x`_. For example, _`length_95%_HPD`_ will be changed to _`length_0.95_HPD`_.

The _`get.fields`_ method return all available features that can be used for
annotation.

```{r}
get.fields(beast)
```


## Parsing MrBayes output

```{r}
file <- system.file("extdata/MrBayes", "Gq_nxs.tre", package="treeio")
read.mrbayes(file)
```

## Parsing PAML output

The `read.paml_rst` function can parse *rst* file
from [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html)
and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html). The only
difference is the space in the sequences.
For [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html), each ten bases
are separated by one space, while
for [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html), each three bases
(triplet) are separated by one space.

```{r fig.width=12, fig.height=10, warning=FALSE, fig.align="center"}
brstfile <- system.file("extdata/PAML_Baseml", "rst", package="treeio")
brst <- read.paml_rst(brstfile)
brst
```

Similarly, we can parse the *rst* file from [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html).

```{r}
crstfile <- system.file("extdata/PAML_Codeml", "rst", package="treeio")
## type can be one of "Marginal" or "Joint"
crst <- read.paml_rst(crstfile, type = "Joint")
crst
```

Ancestral sequences inferred by [BASEML](http://abacus.gene.ucl.ac.uk/software/paml.html)
or [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) via marginal or
joint ML reconstruction methods will be stored in the S4 object and mapped to
tree nodes. [treeio](https://bioconductor.org/packages/treeio/) will automatically determine the substitutions between the
sequences at the both ends of each branch. Amino acid substitution will also be
determined by translating nucleotide sequences to amino acid sequences. These
computed substitutions will also be stored in the S4 object.


[CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) infers selection
pressure and estimated *d~N~/d~S~*, *d~N~* and *d~S~*. These information are
stored in output file *mlc*, which can be parsed by `read.codeml_mlc` function.


```{r}
mlcfile <- system.file("extdata/PAML_Codeml", "mlc", package="treeio")
mlc <- read.codeml_mlc(mlcfile)
mlc
```


In previous session, we separately parsed *rst* and *mlc* files. However, they
can also be parsed together using `read.codeml` function.


```{r}
## tree can be one of "rst" or "mlc" to specify
## using tree from which file as base tree in the object
ml <- read.codeml(crstfile, mlcfile, tree = "mlc")
ml
```


All the features in both *rst* and *mlc* files were imported into a single S4
object and hence are available for further annotation and visualization. For
example, we can annotate and display both *d~N~/d~S~* (from *mlc* file) and
amino acid substitutions (derived from *rst* file) on the same phylogenetic tree.


## Parsing HyPhy output


Ancestral sequences inferred by [HyPhy](https://veg.github.io/hyphy-site/) are
stored in the Nexus output file, which contains the tree topology and ancestral
sequences. To parse this data file, users can use the `read.hyphy.seq` function.


```{r warning=FALSE}
ancseq <- system.file("extdata/HYPHY", "ancseq.nex", package="treeio")
read.hyphy.seq(ancseq)
```

To map the sequences on the tree, user shall also provide an
internal-node-labelled tree. If users want to determine substitution, they need
also provide tip sequences.

```{r warning=FALSE}
nwk <- system.file("extdata/HYPHY", "labelledtree.tree", package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
hy
```




## Parsing r8s output

[r8s](http://loco.biosci.arizona.edu/r8s/) uses parametric, semiparametric and
nonparametric methods to relax molecular clock to allow better estimations of
divergence times and evolution rates [@@sanderson_r8s:_2003]. It outputs three
trees in log file, namely *TREE*, *RATO* and *PHYLO* for time tree, rate tree
and absolute substitution tree respectively.


Time tree is scaled by divergence time, rate tree is scaled by substitution rate
and absolute substitution tree is scaled by absolute number of substitution.
After parsing the file, all these three trees are stored in a *multiPhylo* object.


```{r fig.width=4, fig.height=6, width=60, warning=FALSE, fig.align="center"}
r8s <- read.r8s(system.file("extdata/r8s", "H3_r8s_output.log", package="treeio"))
r8s
```


## Parsing output of RAxML bootstraping analysis


[RAxML](http://evomics.org/learning/phylogenetics/raxml/) bootstraping analysis
output a Newick tree text that is not standard as it stores bootstrap values
inside square brackets after branch lengths. This file usually cannot be parsed
by traditional Newick parser, such as `ape::read.tree`. The function
`read.raxml` can read such file and stored the bootstrap as an additional
features, which can be used to display on the tree or used to color tree
branches, *etc.*.

```{r fig.width=12, fig.height=10, width=60, warning=FALSE, fig.align="center"}
raxml_file <- system.file("extdata/RAxML", "RAxML_bipartitionsBranchLabels.H3", package="treeio")
raxml <- read.raxml(raxml_file)
raxml
```


## Parsing NHX tree

NHX (New Hampshire eXtended) format is an extension of Newick by introducing NHX
tags. NHX is commonly used in phylogenetics software
(including
[PHYLDOG](http://pbil.univ-lyon1.fr/software/phyldog/) [@boussau_genome-scale_2013],
[RevBayes](http://revbayes.github.io/intro.html) [@hohna_probabilistic_2014])
for storing statistical inferences. The following codes imported a NHX tree with
associated data inferred by PHYLDOG.

```{r}
nhxfile <- system.file("extdata/NHX", "phyldog.nhx", package="treeio")
nhx <- read.nhx(nhxfile)
nhx
```

## Parsing Phylip tree

Phylip format contains multiple sequence alignment of taxa in Phylip sequence
format with corresponding Newick tree text that was built from taxon sequences.
Sequence alignment can be sorted based on the tree structure and displayed at
the right hand side of the tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017].

```{r}
phyfile <- system.file("extdata", "sample.phy", package="treeio")
phylip <- read.phylip(phyfile)
phylip
```

## Parsing EPA and pplacer output

[EPA](http://sco.h-its.org/exelixis/web/software/epa/index.html)
[@berger_EPA_2011] and [PPLACER](http://matsen.fhcrc.org/pplacer/)
[@matsen_pplacer_2010] have common output file format, `jplace`, which can be
parsed by `read.jplace()` function.

```{r}
jpf <- system.file("extdata/EPA.jplace",  package="treeio")
jp <- read.jplace(jpf)
print(jp)
```

The number of evolutionary placement on each branch will be calculated and
stored as the *nplace* feature, which can be mapped to line size and/or color
using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017].


## Parsing jtree format{#jtree}

The *jtree* is a JSON based format that was defined in
this [treeio](https://bioconductor.org/packages/treeio/) package to support [tree
data inter change](Exporter.html#jtree).
Phylogenetic tree with associated data can be exported to a single *jtree*
file using `write.jtree` function. The *jtree* can be easily parsed using any
JSON parser. The *jtree* format contains three keys: tree, data and metadata.
The tree value contains tree text extended from Newick tree format by putting
the edge number in curly braces after branch length. The data value contains
node/branch-specific data, while metadata value contains additional meta information.


```{r}
jtree_file <- tempfile(fileext = '.jtree')
write.jtree(beast, file = jtree_file)
read.jtree(file = jtree_file)
```


# Linking external data to phylogeny

In addition to analysis findings that are associated with the tree as we showed
above, there is a wide range of heterogeneous data, including phenotypic data,
experimental data and clinical data *etc.*, that need to be integrated and
linked to phylogeny. For example, in the study of viral evolution, tree nodes may
associated with epidemiological information, such as location, age and subtype.
Functional annotations may need to be mapped on gene trees for comparative
genomics studies. To facilitate data
integration, [treeio](https://bioconductor.org/packages/treeio) provides
`full_join` method to link external data to phylogeny and stored in `treedata` object.


Here are examples of linking external data to a phylogenetic tree. After that,
we can use [exporter](Exporter.html) to combine the tree and the data to a
single tree file. The data that mapped on the phylogenetic tree can also be used to visualize or
annotate the tree using [ggtree](https://bioconductor.org/packages/ggtree/)
[@yu_ggtree:_2017].


```{r}
x <- data_frame(label = as.phylo(beast)$tip.label, trait = rnorm(Ntip(beast)))
full_join(beast, x, by="label")

N <- Nnode2(beast)
y <- data_frame(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))
full_join(beast, y, by="node")
```


# Combining tree data

The [treeio](https://bioconductor.org/packages/treeio/) package serves as an
infrastructure that enables various types of phylogenetic data inferred from
common analysis programs to be imported and used in R. For instance *d~N~/d~S~*
or ancestral sequences estimated
by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html),
and clade support values (posterior) inferred
by [BEAST](http://beast2.org/)/[MrBayes](http://nbisweden.github.io/MrBayes/).
In addition, [treeio](https://bioconductor.org/packages/treeio/) package
supports linking external data to phylogeny. It brings these external
phylogenetic data (either from software output or exteranl sources) to the R
community and make it available for further analysis in R.
Furthermore, [treeio](https://bioconductor.org/packages/treeio/) can combine
multiple phylogenetic trees together into one with their node/branch-specific
attribute data. Essentially, as a result, one such attribute (*e.g.*,
substitution rate) can be mapped to another attribute (*e.g.*, *d~N~/d~S~*) of
the same node/branch for comparison and further computations.


A previously published data set, seventy-six H3 hemagglutinin gene sequences of
a lineage containing swine and human influenza A viruses
[@liang_expansion_2014], was here to demonstrate the utilities of comparing
evolutionary statistics inferred by different software. The dataset was
re-analyzed by [BEAST](http://beast2.org/) for timescale estimation
and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) for synonymous and
non-synonymous substitution estimation. In this example, we first parsed the
outputs from [BEAST](http://beast2.org/) using `read.beast` and
from [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) using
`read.codeml` into two `treedata` objects. Then the two objects containing
separate sets of node/branch-specific data were merged via the `merge_tree` function.



```{r}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
rst_file <- system.file("examples/rst", package="ggtree")
mlc_file <- system.file("examples/mlc", package="ggtree")
beast_tree <- read.beast(beast_file)
codeml_tree <- read.codeml(rst_file, mlc_file)

merged_tree <- merge_tree(beast_tree, codeml_tree)
merged_tree
```

After merging the `beast_tree` and `codeml_tree` objects, all
node/branch-specific data imported from [BEAST](http://beast2.org/)
and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) output files are
all available in the `merged_tree` object. The tree object was converted to
tidy data frame using [tidytree](https://cran.r-project.org/package=tidytree)
package and visualized as hexbin scatterplot of *d~N~/d~S~*, *d~N~* and *d~S~* inferred
by [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html) versus *rate*
(substitution rate in unit of substitutions/site/year) inferred
by [BEAST](http://beast2.org/) on the same branches.

```{r warning=FALSE, fig.width=9, fig.height=3}
library(tidytree)
library(ggplot2)

as_data_frame(merged_tree) %>%
    dplyr::select(dN_vs_dS, dN, dS, rate) %>%
    subset(dN_vs_dS >=0 & dN_vs_dS <= 1.5) %>%
    tidyr::gather(type, value, dN_vs_dS:dS) %>%
    ggplot(aes(rate, value)) + geom_hex() +
    facet_wrap(~factor(type, levels = c('dN_vs_dS', 'dN', 'dS')),
               scale='free_y') +
    ylab(NULL)
```

Using `merge_tree`, we are able to compare analysis results using identical
model from different software packages or different models using different or
identical software. It also allows users to integrate different analysis finding
from different software packages. Merging tree data is not restricted to
software findings, associating external data to analysis findings is also
granted. The `merge_tree` function is chainable and allows several tree objects
to be merged into one.


```{r}
phylo <- as.phylo(beast_tree)
N <- Nnode2(phylo)
d <- data_frame(node = 1:N, fake_trait = rnorm(N), another_trait = runif(N))
fake_tree <- treedata(phylo = phylo, data = d)
triple_tree <- merge_tree(merged_tree, fake_tree)
triple_tree
```

The `triple_tree` object showed above contains analysis results obtained from [BEAST](http://beast2.org/)
and [CODEML](http://abacus.gene.ucl.ac.uk/software/paml.html), and evolutionary
trait from external sources. All these information can be used to annotate the
tree using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017].

# Getting information from *treedata* object

After the tree was imported, users may want to extract information that stored
in the `treedata` object. `r Biocpkg("treeio")` provides several accessor
methods to extract tree structure, features/attributes that stored in the object
and their corresponding values.

The `get.tree` or `as.phylo` methods can convert the `treedata` object to
`phylo` object which is the fundamental tree object in the R community and
many packages work with `phylo` object.

```{r}
# or get.tree
as.phylo(beast_tree)
```

The `get.fields` method return a vector of features/attributes that stored in
the object and associated with the phylogeny.

```{r}
get.fields(beast_tree)
```

The `get.data` method return a tibble of all the associated data.


```{r}
get.data(beast_tree)
```

If users are only interesting a subset of the features/attributes return by
`get.fields`, they can extract the information from the output of `get.data` or
directly subset the data by `[` or `[[`.

```{r}
beast_tree[, c("node", "height")]
head(beast_tree[["height_median"]])
```


# Manipulating tree data using *tidytree*

All the tree data parsed/merged
by [treeio](https://bioconductor.org/packages/treeio/) can be converted to tidy
data frame using the [tidytree](https://cran.r-project.org/package=tidytree)
package. The [tidytree](https://cran.r-project.org/package=tidytree) package
provides tidy interfaces to manipulate tree with associated data. For instances,
external data can be linked to phylogeny or evolutionary data obtained from
different sources can be merged using tidyverse verbs. After the tree data was
manipulated, it can be converted back to `treedata` object and [exported to a
single tree file](Exporter.html), further analyzed in R or visualized using [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017].


For more details, please refer to the [tidytree package vignette](https://cran.r-project.org/web/packages/tidytree/vignettes/tidytree.html).


# Visualizing tree data with *ggtree*

[treeio](https://bioconductor.org/packages/treeio/) is seamlessly integrated
into the [ggtree](https://bioconductor.org/packages/ggtree/) [@yu_ggtree:_2017]
package and all the information either directly imported or linking from
external sources can be used to visualize and annotate the tree.

See the `r Biocpkg("ggtree")` package vignettes for more details:

+ [Tree Visualization](treeVisualization.html)
+ [Tree Manipulation](treeManipulation.html)
+ [Tree Annotation](treeAnnotation.html)
+ [Phylomoji](https://cran.r-project.org/web/packages/emojifont/vignettes/phylomoji.html)
+ [Annotating phylogenetic tree with images](https://guangchuangyu.github.io/software/ggtree/vignettes/ggtree-ggimage.html)
+ [Annotate a phylogenetic tree with insets](https://guangchuangyu.github.io/software/ggtree/vignettes/ggtree-inset.html)


# Need helps?


If you have questions/issues, please visit
[treeio homepage](https://guangchuangyu.github.io/software/treeio/) first.
Your problems are mostly documented. If you think you found a bug, please follow
[the guide](https://guangchuangyu.github.io/2016/07/how-to-bug-author/) and
provide a reproducible example to be posted on
[github issue tracker](https://github.com/GuangchuangYu/treeio/issues).
For questions, please post
to [Bioconductor support site](https://support.bioconductor.org/) and tag your
post with *treeio*.


For Chinese user, you can follow me on [WeChat (å¾®ä¿¡)](https://guangchuangyu.github.io/blog_images/biobabble.jpg).


# References