1 Introduction
2 TreeSummarizedExperiment
3 Aggregation
4 Aggregation on the taxonomic table
5 Additional
- 5.1 More about functions working on the phylo object.
- 5.2 Customize functions for the TreeSummarizedExperiment class
6 Session Info
Reference

1 Introduction

The TreeSummarizedExperiment class is an extension of the SingleCellExperiment class (Lun and Risso 2019). It’s used to store rectangular data of experimental results as in a SingleCellExperiment, and also supports the storage of a hierarchical structure and its link information to the rectangular data.

2 TreeSummarizedExperiment

2.1 Anatomy of TreeSummarizedExperiment

Figure 1: The structure of the TreeSummarizedExperiment class

Compared with the SingleCellExperiment class, TreeSummarizedExperiment has four more slots.

rowTree: the hierarchical structure on the rows of the assays tables.
rowLinks: the link between rows of the assays tables and the rowTree.
colTree: the hierarchical structure on the columns of the assays tables.
colLinks: the link information between columns of assays tables and the colTree.

The rowTree and colTree could be empty (NULL) if no trees are available. Correspondingly, the rowLinks and colLinks would be NULL. All the other slots inTreeSummarizedExperimentare inherited fromSingleCellExperiment`.

The slots rowTree and colTree only accept the tree data as the phylo class. If the tree is available in other formats, one would need to convert it to phylo with other R packages. For example, the package treeio provides 12 functions to import different tree formats and output phylo object in the slot phylo.

2.2 Toy data

suppressPackageStartupMessages({
    library(TreeSummarizedExperiment)
    library(S4Vectors)
    library(ggtree)
    library(ape)})

We generate a assay_data with observations of 5 entities collected from 4 samples.

# assays data
assay_data <- rbind(rep(0, 4), matrix(1:16, nrow = 4))
colnames(assay_data) <- paste(rep(LETTERS[1:2], each = 2), 
                            rep(1:2, 2), sep = "_")
rownames(assay_data) <- paste("entity", seq_len(5), sep = "")
assay_data

##         A_1 A_2 B_1 B_2
## entity1   0   0   0   0
## entity2   1   5   9  13
## entity3   2   6  10  14
## entity4   3   7  11  15
## entity5   4   8  12  16

The descriptions of the 5 entities and 4 samples are given in the row_data and col_data, respectively.

# row data
row_data <- DataFrame(var1 = sample(letters[1:2], 5, replace = TRUE),
                    var2 = sample(c(TRUE, FALSE), 5, replace = TRUE),
                    row.names = rownames(assay_data))
row_data

## DataFrame with 5 rows and 2 columns
##                var1      var2
##         <character> <logical>
## entity1           b      TRUE
## entity2           b      TRUE
## entity3           b     FALSE
## entity4           a      TRUE
## entity5           a     FALSE

# column data
col_data <- DataFrame(gg = c(1, 2, 3, 3),
                    group = rep(LETTERS[1:2], each = 2), 
                    row.names = colnames(assay_data))
col_data

## DataFrame with 4 rows and 2 columns
##            gg       group
##     <numeric> <character>
## A_1         1           A
## A_2         2           A
## B_1         3           B
## B_2         3           B

The hierarchical structure of the 5 entities is denoted as row_tree. The hierarchical structure of the 4 samples is denoted as col_tree. We create them by using the function rtree from the package ape.

# Toy tree 1
set.seed(1)
row_tree <- rtree(5)
class(row_tree)

## [1] "phylo"

# Toy tree 2
set.seed(4)
col_tree <- rtree(4)
col_tree$tip.label <- colnames(assay_data)
col_tree$node.label <- c("All", "GroupA", "GroupB")

The created trees are phylo objects. The phylo object is actually a list with at least four elements: edge, tip.label, edge.length, and Nnode.

class(row_tree)

## [1] "phylo"

str(row_tree)

## List of 4
##  $ edge       : int [1:8, 1:2] 6 6 7 8 8 9 9 7 1 7 ...
##  $ tip.label  : chr [1:5] "t2" "t1" "t3" "t4" ...
##  $ edge.length: num [1:8] 0.0618 0.206 0.1766 0.687 0.3841 ...
##  $ Nnode      : int 4
##  - attr(*, "class")= chr "phylo"
##  - attr(*, "order")= chr "cladewise"

The package ggtree (Yu et al. 2017) has been used to visualize the tree. The node labels and node numbers are in blue and orange texts, respectively. The row_tree has no labels for internal nodes.

# Visualize the row tree
ggtree(row_tree, size = 2) +
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6)

$\label{rtree} The structure of the row tree$

Figure 2: The structure of the row tree

The col_tree has labels for internal nodes.

# Visualize the column tree
ggtree(col_tree, size = 2) +
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6)

$\label{ctree} The structure of the column tree$

Figure 3: The structure of the column tree

2.3 The construction of `TreeSummarizedExperiment`

The TreeSummarizedExperiment class is used to store the toy data: assay_data, row_data, col_data, col_tree and row_tree, To correctly store data, the link information between the rows (or columns) of assay_data and the nodes of the row_tree (or col_tree) is requried to provide via a charactor vector rowNodeLab (or colNodeLab). Those columns or rows that don’t match with any node of the tree structure are removed with warnings. The link data between the assays tables and the tree data is automatically generated in the construction.

Below shows an example to construct TreeSummarizedExperiment without the column tree.

# provide the node labels in rowNodeLab
node_lab <- row_tree$tip.label
row_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                rowData = row_data,
                                colData = col_data,
                                rowTree = row_tree,
                                rowNodeLab = node_lab)

When printing out row_tse, we see a similar message as SingleCellExperiment with four additional lines about rowLinks, rowTree, colLinks and colTree. Here, row_tse stores a row tree (phylo object), and the rowLinks has 5 rows that is exactly the same as the number of rows in the assays tables. More details about the link data could be found in Section 2.4.2.

row_tse

## class: TreeSummarizedExperiment 
## dim: 5 4 
## metadata(0):
## assays(1): ''
## rownames(5): entity1 entity2 entity3 entity4 entity5
## rowData names(2): var1 var2
## colnames(4): A_1 A_2 B_1 B_2
## colData names(2): gg group
## reducedDimNames(0):
## spikeNames(0):
## rowLinks: a LinkDataFrame (5 rows)
## rowTree: a phylo (5 leaves)
## colLinks: NULL
## colTree: NULL

If the row tree and the column tree are both available, the TreeSummarizedExperiment could be constructed similarly as below. Here, the column names of the assays table match with the node labels used in the column tree. So, we could omit the step of providing colNodeLab.

all(colnames(assay_data) %in% c(col_tree$tip.label, col_tree$node.label))

## [1] TRUE

both_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                rowData = row_data,
                                colData = col_data,
                                rowTree = row_tree,
                                rowNodeLab = node_lab,
                                colTree = col_tree)

Compared to row_tse, both_tse includes also a column tree. The column link data (colLinks) with 4 rows is automatically generated. The number of rows in the link data is decided by the column dimension of the assays tables.

both_tse

## class: TreeSummarizedExperiment 
## dim: 5 4 
## metadata(0):
## assays(1): ''
## rownames(5): entity1 entity2 entity3 entity4 entity5
## rowData names(2): var1 var2
## colnames(4): A_1 A_2 B_1 B_2
## colData names(2): gg group
## reducedDimNames(0):
## spikeNames(0):
## rowLinks: a LinkDataFrame (5 rows)
## rowTree: a phylo (5 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: a phylo (4 leaves)

2.4 The accessor functions

2.4.1 Assays, rowData, colData, and metadata

For slots inherited from the SingleCellExperiment class, the accessors are exactly the same as shown in SingleCellExperiment.

# to get the first table in the assays
(count <- assays(both_tse)[[1]])

##         A_1 A_2 B_1 B_2
## entity1   0   0   0   0
## entity2   1   5   9  13
## entity3   2   6  10  14
## entity4   3   7  11  15
## entity5   4   8  12  16

# to get row data
rowData(both_tse)

## DataFrame with 5 rows and 2 columns
##                var1      var2
##         <character> <logical>
## entity1           b      TRUE
## entity2           b      TRUE
## entity3           b     FALSE
## entity4           a      TRUE
## entity5           a     FALSE

# to get column data
colData(both_tse)

## DataFrame with 4 rows and 2 columns
##            gg       group
##     <numeric> <character>
## A_1         1           A
## A_2         2           A
## B_1         3           B
## B_2         3           B

# to get metadata: it's empty here
metadata(both_tse)

## list()

2.4.2 rowLinks, colLinks

The row link and column link could be accessed via rowLinks and colLinks, respectively. The output would be a LinkDataFrame object. The LinkDataFrame class is extended from the DataFrame class with the restriction that it has at least four columns: nodeLab, nodeLab_alias, nodeNum, and isLeaf. More details about the DataFrame class could be found in the S4Vectors package.

nodeLab: the labels of nodes on the tree
nodeLab_alias: the alias labels of nodes on the tree
nodeNum: the numbers of nodes on the tree
isLeaf: it’s to indicate whether the node is a leaf node

When a phylo tree is available in the rowTree, we could see a LinkDataFrame object in the rowLinks. The number of rows of rowLinks data matches with the number of rows of assays tables.

(rLink <- rowLinks(both_tse))

## LinkDataFrame with 5 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1          t2       alias_1         1      TRUE
## 2          t1       alias_2         2      TRUE
## 3          t3       alias_3         3      TRUE
## 4          t4       alias_4         4      TRUE
## 5          t5       alias_5         5      TRUE

class(rLink)

## [1] "LinkDataFrame"
## attr(,"package")
## [1] "TreeSummarizedExperiment"

showClass("LinkDataFrame")

## Class "LinkDataFrame" [package "TreeSummarizedExperiment"]
## 
## Slots:
##                                                             
## Name:           rownames             nrows          listData
## Class: character_OR_NULL           integer              list
##                                                             
## Name:        elementType   elementMetadata          metadata
## Class:         character DataTable_OR_NULL              list
## 
## Extends: 
## Class "DataFrame", directly
## Class "LinkDataFrame_Or_NULL", directly
## Class "DataTable", by class "DataFrame", distance 2
## Class "SimpleList", by class "DataFrame", distance 2
## Class "DataTable_OR_NULL", by class "DataFrame", distance 3
## Class "List", by class "DataFrame", distance 3
## Class "Vector", by class "DataFrame", distance 4
## Class "list_OR_List", by class "DataFrame", distance 4
## Class "Annotated", by class "DataFrame", distance 5

nrow(rLink) == nrow(both_tse)

## [1] TRUE

Similarly, the number of rows of colLinks data matches with the number of columns of assays table.

(cLink <- colLinks(both_tse))

## LinkDataFrame with 4 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1         A_1       alias_1         1      TRUE
## 2         A_2       alias_2         2      TRUE
## 3         B_1       alias_3         3      TRUE
## 4         B_2       alias_4         4      TRUE

nrow(cLink) == ncol(both_tse)

## [1] TRUE

If the tree is not available, the corresponding link data is NULL.

colTree(row_tse)

## NULL

colLinks(row_tse)

## NULL

The link data is automatically generated when constructing the TreeSummarizedExperiment object. We highly recommend users not to modify it manually; otherwise the link might be broken. For R packages developers, we show in the Section 5.2 about how to update the link.

2.5 The subseting function

We could use [ to subset the TreeSummarizedExperiment. To keep track of the original data, the rowTree and colTree stay the same in the subsetting.

sub_tse <- both_tse[1:2, 1]
sub_tse

## class: TreeSummarizedExperiment 
## dim: 2 1 
## metadata(0):
## assays(1): ''
## rownames(2): entity1 entity2
## rowData names(2): var1 var2
## colnames(1): A_1
## colData names(2): gg group
## reducedDimNames(0):
## spikeNames(0):
## rowLinks: a LinkDataFrame (2 rows)
## rowTree: a phylo (5 leaves)
## colLinks: a LinkDataFrame (1 rows)
## colTree: a phylo (4 leaves)

The annotation data on the row and column dimension is changed accordingly.

# The first four columns are from rowLinks data and the others from rowData
cbind(rowLinks(sub_tse), rowData(sub_tse))

## DataFrame with 2 rows and 6 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf        var1      var2
##   <character>   <character> <integer> <logical> <character> <logical>
## 1          t2       alias_1         1      TRUE           b      TRUE
## 2          t1       alias_2         2      TRUE           b      TRUE

# The first four columns are from colLinks data and the others from colData
cbind(colLinks(sub_tse), colData(sub_tse))

## DataFrame with 1 row and 6 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf        gg       group
##   <character>   <character> <integer> <logical> <numeric> <character>
## 1         A_1       alias_1         1      TRUE         1           A

3 Aggregation

The aggregation is allowed on the row and the column dimension.

3.1 The column dimension

Here, we show the aggregation on the column dimension. The TreeSummarizedExperiment object is assigned to the argument x. The desired aggregation level is given in colLevel. The level could be specified via the node label (the orange texts in Figure 3) or the node number (the blue texts in Figure 3). We could further decide how to aggregate via the argument FUN.

# use node labels to specify colLevel
aggCol <- aggValue(x = both_tse, 
                   colLevel = c("GroupA", "GroupB"),
                   FUN = sum)
# or use node numbers to specify colLevel
aggCol <- aggValue(x = both_tse, colLevel = c(6, 7), FUN = sum)

assays(aggCol)[[1]]

##         alias_6 alias_7
## entity1       0       0
## entity2      15      14
## entity3      18      16
## entity4      21      18
## entity5      24      20

The rowData doesn’t change, but the colData adjusts with the change of the table. For example, the column group has the A value for GroupA because the descendant nodes of GroupA all have the value A; the column gg has the NA value for GroupA because the descendant nodes of GroupA have different values, (1 and 2).

# before aggregation
colData(both_tse)

## DataFrame with 4 rows and 2 columns
##            gg       group
##     <numeric> <character>
## A_1         1           A
## A_2         2           A
## B_1         3           B
## B_2         3           B

# after aggregation
colData(aggCol)

## DataFrame with 2 rows and 2 columns
##                gg     group
##         <logical> <logical>
## alias_6        NA        NA
## alias_7        NA        NA

The colLinks is updated to link the new rows of assays tables and the column tree.

# the link data is updated
colLinks(aggCol)

## LinkDataFrame with 2 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1      GroupA       alias_6         6     FALSE
## 2      GroupB       alias_7         7     FALSE

From the Figure 2, we could see that the nodes 6 and 7 are labelled with GroupA and GroupB, respectively. This agrees with the column link data.

3.2 The row dimension

It’s similar to the aggregation on the row dimension, except that the level should be specified via rowLevel.

agg_row <- aggValue(x = both_tse, rowLevel = 7:9, FUN = sum)

Now, the output assays table has 3 rows.

assays(agg_row)[[1]]

##         A_1 A_2 B_1 B_2
## alias_7  10  26  42  58
## alias_8   6  18  30  42
## alias_9   5  13  21  29

We could see which row corresponds to which nodes via the rowLinks data.

rowLinks(agg_row)

## LinkDataFrame with 3 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1          NA       alias_7         7     FALSE
## 2          NA       alias_8         8     FALSE
## 3          NA       alias_9         9     FALSE

The Figure 2 shows that the nodes 7, 8 and 9 have no labels. Therefore, the nodeLab column in LinkData of the row data has missing value. They are all internal nodes and hence the column isLeaf has only FALSE value.

3.3 Both dimensions

The aggregation on both row and column dimensions could be performed in one step using the same function specified via FUN. If different functions are required for different dimension, it’s suggested to do it in two steps as described in Section 3.2 and Section 3.1 because the order of aggregation might matter.

agg_both <- aggValue(x = both_tse, colLevel = c(6, 7), 
                    rowLevel = 7:9, FUN = sum)

As expected, we obtain a table with 3 rows (rowLevel = 7:9) and 2 columns (colLevel = c(6, 7)).

assays(agg_both)[[1]]

##         alias_6 alias_7
## alias_7      78      68
## alias_8      54      48
## alias_9      39      34

4 Aggregation on the taxonomic table

In some case, the information of the hierarchical structure is available as a data.frame instead of the phylo object mentioned above. To do the work listed above, we could convert the data.frame to the phylo class.

The function toTree outputs the hierarchical information into a phylo object. If the data set is large, we suggest to allow cache = TRUE to speed up the aggregation step.

# The toy taxonomic table
taxa <- data.frame(Kindom = rep("A", 5),
                     Phylum = c("B1", rep("B2", 4)),
                     Class = c("C1", "C2", "C3", "C3", NA),
                     OTU = c("D1", "D2", "D3", "D4", NA))
# convert to a phylo tree
taxa_tree <- toTree(data = taxa, cache = FALSE)
ggtree(taxa_tree)+
geom_text2(aes(label = node), color = "darkblue",
                hjust = -0.5, vjust = 0.7, size = 6) +
geom_text2(aes(label = label), color = "darkorange",
            hjust = -0.1, vjust = -0.7, size = 6) +
    geom_point2()

# construct a TreeSummarizedExperiment object
taxa_tse <- TreeSummarizedExperiment(assays = list(assay_data),
                                   rowData = row_data,
                                   rowTree = taxa_tree,
                                   rowNodeLab = taxa_tree$tip.label)

Here is about how to aggregate to the phylum level.

# specify the level
taxa_lab <- c(taxa_tree$tip.label, taxa_tree$node.label)
ii <- startsWith(taxa_lab, "Phylum:") 
(l1 <- taxa_lab[ii])

## [1] "Phylum:B1" "Phylum:B2"

# aggregate
agg_taxa <- aggValue(x = taxa_tse, rowLevel = l1, FUN = sum)

assays(agg_taxa)[[1]]

##         A_1 A_2 B_1 B_2
## alias_7   0   0   0   0
## alias_9  10  26  42  58

rowData(agg_taxa)

## DataFrame with 2 rows and 2 columns
##                var1      var2
##         <character> <logical>
## alias_7           b      TRUE
## alias_9          NA        NA

The aggregation could be on any freely combined level.

# specify the level
l2 <- c("Class:C3", "Phylum:B1")
# aggregate
agg_any <- aggValue(x = taxa_tse, rowLevel = l2, FUN = sum)

assays(agg_any)[[1]]

##          A_1 A_2 B_1 B_2
## alias_11   5  13  21  29
## alias_7    0   0   0   0

rowData(agg_any)

## DataFrame with 2 rows and 2 columns
##                 var1      var2
##          <character> <logical>
## alias_11          NA        NA
## alias_7            b      TRUE

5 Additional

5.1 More about functions working on the `phylo` object.

Here, we show some functions as examples to manipulate or to extract information from the phylo object. More functions could be found in other packages, such as ape (Paradis and Schliep 2018), tidytree. These functions might be useful when R package developers want to create their own functions to work on the TreeSummarizedExperiment class.

Below shows the node label (black texts) and node number (blue texts) of each node on an example tree.

ggtree(tinyTree, branch.length = "none") +
    geom_text2(aes(label = label), hjust = -0.3) +
    geom_text2(aes(label = node), vjust = -0.8,
               hjust = -0.3, color = 'blue')

5.1.1 print out nodes of the tree

We could specify to print out all nodes (type = "all"), the leaves (type = "leaf") or the internal nodes (type = "internal").

printNode(tree = tinyTree, type = "all")

##    nodeLab nodeLab_alias nodeNum isLeaf
## 1       t2        alias1       1   TRUE
## 2       t7        alias2       2   TRUE
## 3       t6        alias3       3   TRUE
## 4       t9        alias4       4   TRUE
## 5       t4        alias5       5   TRUE
## 6       t8        alias6       6   TRUE
## 7      t10        alias7       7   TRUE
## 8       t1        alias8       8   TRUE
## 9       t5        alias9       9   TRUE
## 10      t3       alias10      10   TRUE
## 11 Node_11      alias_11      11  FALSE
## 12 Node_12      alias_12      12  FALSE
## 13 Node_13      alias_13      13  FALSE
## 14 Node_14      alias_14      14  FALSE
## 15 Node_15      alias_15      15  FALSE
## 16 Node_16      alias_16      16  FALSE
## 17 Node_17      alias_17      17  FALSE
## 18 Node_18      alias_18      18  FALSE
## 19 Node_19      alias_19      19  FALSE

5.1.2 Count the number of nodes

# The number of leaves
countLeaf(tree = tinyTree)

## [1] 10

# The number of nodes (leaf nodes and internal nodes)
countNode(tree = tinyTree)

## [1] 19

5.1.3 Translation between the node label and the node number

The translation between the labels and the numbers of nodes could be achieved by the function transNode.

transNode(tree = tinyTree, node = c(12, 1, 4))

## [1] "Node_12" "t2"      "t9"

transNode(tree = tinyTree, node = c("t4", "Node_18"))

##      t4 Node_18 
##       5      18

5.1.4 find the descendants

To get descendants that are on the leaf level, we could set the argument only.leaf = TRUE.

# only the leaf nodes
findOS(tree = tinyTree, node = 17, only.leaf = TRUE)

## $Node_17
## [1] 6 4 5

The argument only.leaf = FALSE is set to get all descendants

# all descendant nodes
findOS(tree = tinyTree, node = 17, only.leaf = FALSE)

## $Node_17
## [1]  6  4 18  5

5.1.5 find the sibling node

The input node could be either the node label or the node number.

# node = 5, node = "t4" are the same node
findSibling(tree = tinyTree, node = 5)

## t9 
##  4

findSibling(tree = tinyTree, node = "t4")

## t9 
##  4

5.1.6 find the share node

This would find the first node that joined by the specified nodes (node) in the path to the root.

shareNode(tree = tinyTree, node = c(5, 6))

## Node_17 
##      17

5.1.7 identify a leaf node

isLeaf(tree = tinyTree, node = 5)

## [1] TRUE

isLeaf(tree = tinyTree, node = 17)

## [1] FALSE

5.1.8 calculate the distance between two nodes

The distance between any two nodes on the tree could be calculated by distNode.

distNode(tree = tinyTree, node = c(1, 5))

## [1] 2.699212

5.1.9 prune tree via leaves

We could specify the leaf nodes rmLeaf to remove parts of a tree. If mergeSingle = TRUE, the internal node that is connected to the removed leaf nodes is removed too; otherwise, it is kept.

NT1 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5),
                mergeSingle = TRUE)
ggtree(NT1, branch.length = "none") +
    geom_text2(aes(label = label), color = "darkorange",
               hjust = -0.1, vjust = -0.7) +
    geom_point2()

NT2 <- pruneTree(tree = tinyTree, rmLeaf = c(4, 5),
                mergeSingle = FALSE)
ggtree(NT2, branch.length = "none") +
    geom_text2(aes(label = label), color = "darkorange",
               hjust = -0.1, vjust = -0.7) +
    geom_point2()

5.1.10 convert a `phylo` object to a matrix

Each row gives a path that connects a leaf and the root.

matTree(tree = tinyTree)

##       L1 L2 L3 L4 L5 L6 L7
##  [1,]  1 13 12 11 NA NA NA
##  [2,]  2 14 13 12 11 NA NA
##  [3,]  3 14 13 12 11 NA NA
##  [4,]  4 18 17 16 15 12 11
##  [5,]  5 18 17 16 15 12 11
##  [6,]  6 17 16 15 12 11 NA
##  [7,]  7 19 16 15 12 11 NA
##  [8,]  8 19 16 15 12 11 NA
##  [9,]  9 15 12 11 NA NA NA
## [10,] 10 11 NA NA NA NA NA

5.2 Customize functions for the `TreeSummarizedExperiment` class

We show examples about how to create functions for the TreeSummarizedExperiment. R package developers could customize their functions based on the functions provided above on the phylo object or develop their own ones.

Here, a function rmRows is created to remove entities (on rows) that have zero in all samples (on columns) in the first assays table.

# dat: a TreeSummarizedExperiment
rmRows <- function(dat) {
    # calculate the total counts of each row
    count <- assays(dat)[[1]]
    tot <- apply(count, 1, sum)
    
    # find the row with zero in all columns
    ind <- which(tot == 0)
    
    # remove those rows
    out <- dat[-ind, ]
    return(out)
    
}
(rte <- rmRows(dat = both_tse))

## class: TreeSummarizedExperiment 
## dim: 4 4 
## metadata(0):
## assays(1): ''
## rownames(4): entity2 entity3 entity4 entity5
## rowData names(2): var1 var2
## colnames(4): A_1 A_2 B_1 B_2
## colData names(2): gg group
## reducedDimNames(0):
## spikeNames(0):
## rowLinks: a LinkDataFrame (4 rows)
## rowTree: a phylo (5 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: a phylo (4 leaves)

rowLinks(rte)

## LinkDataFrame with 4 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1          t1       alias_2         2      TRUE
## 2          t3       alias_3         3      TRUE
## 3          t4       alias_4         4      TRUE
## 4          t5       alias_5         5      TRUE

The function rmRows doesn’t update the tree data. To update the tree, we could do it as below with the help of ape::drop.tip.

updateRowTree <- function(tse, dropLeaf) {
    ## -------------- new tree: drop leaves ----------
    oldTree <- rowTree(tse)
    newTree <- ape::drop.tip(phy = oldTree, tip = dropLeaf)
    
    ## -------------- update the row link ----------
    # track the tree
    track <- trackNode(oldTree)
    track <- ape::drop.tip(phy = track, tip = dropLeaf)
    
    # row links
    rowL <- rowLinks(tse)
    rowL <- DataFrame(rowL)
    
    # update the row links: 
    #   1. use the alias label to track and updates the nodeNum
    #   2. the nodeLab should be updated based on the new tree using the new
    #      nodeNum
    #   3. lastly, update the nodeLab_alias
    rowL$nodeNum <- transNode(tree = track, node = rowL$nodeLab_alias,
                              message = FALSE)
    rowL$nodeLab <- transNode(tree = newTree, node = rowL$nodeNum, 
                              use.alias = FALSE, message = FALSE)
    rowL$nodeLab_alias <- transNode(tree = newTree, node = rowL$nodeNum, 
                                    use.alias = TRUE, message = FALSE)
    rowL$isLeaf <- isLeaf(tree = newTree, node = rowL$nodeNum)

    rowNL <- as(rowL, "LinkDataFrame")
    
    ## update the row tree and links
    newDat <- BiocGenerics:::replaceSlots(tse,
                                          rowLinks = rowNL,
                                          rowTree = list(phylo = newTree))
    return(newDat)
    
}

Now the row tree has four leaves.

# find the mismatch between the rows of the 'assays' table and the leaves of the
# tree
row_tree <- rowTree(rte)
row_link <- rowLinks(rte)
leaf_tree <- printNode(tree = row_tree,type = "leaf")$nodeNum
leaf_data <- row_link$nodeNum[row_link$isLeaf]
leaf_rm <- setdiff(leaf_tree, leaf_data)
ntse <- updateRowTree(tse = rte, dropLeaf = leaf_rm)

ntse

## class: TreeSummarizedExperiment 
## dim: 4 4 
## metadata(0):
## assays(1): ''
## rownames(4): entity2 entity3 entity4 entity5
## rowData names(2): var1 var2
## colnames(4): A_1 A_2 B_1 B_2
## colData names(2): gg group
## reducedDimNames(0):
## spikeNames(0):
## rowLinks: a LinkDataFrame (4 rows)
## rowTree: a phylo (4 leaves)
## colLinks: a LinkDataFrame (4 rows)
## colTree: a phylo (4 leaves)

rowLinks(ntse)

## LinkDataFrame with 4 rows and 4 columns
##       nodeLab nodeLab_alias   nodeNum    isLeaf
##   <character>   <character> <integer> <logical>
## 1          t1       alias_1         1      TRUE
## 2          t3       alias_2         2      TRUE
## 3          t4       alias_3         3      TRUE
## 4          t5       alias_4         4      TRUE

6 Session Info

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ape_5.3                        ggtree_1.16.1                 
##  [3] TreeSummarizedExperiment_1.0.3 SingleCellExperiment_1.6.0    
##  [5] SummarizedExperiment_1.14.0    DelayedArray_0.10.0           
##  [7] BiocParallel_1.18.0            matrixStats_0.54.0            
##  [9] Biobase_2.44.0                 GenomicRanges_1.36.0          
## [11] GenomeInfoDb_1.20.0            IRanges_2.18.1                
## [13] S4Vectors_0.22.0               BiocGenerics_0.30.0           
## [15] BiocStyle_2.12.0              
## 
## loaded via a namespace (and not attached):
##  [1] treeio_1.8.1           tidyselect_0.2.5       xfun_0.7              
##  [4] purrr_0.3.2            lattice_0.20-38        colorspace_1.4-1      
##  [7] htmltools_0.3.6        yaml_2.2.0             rlang_0.3.4           
## [10] pillar_1.4.1           glue_1.3.1             GenomeInfoDbData_1.2.1
## [13] rvcheck_0.1.3          plyr_1.8.4             stringr_1.4.0         
## [16] zlibbioc_1.30.0        munsell_0.5.0          gtable_0.3.0          
## [19] evaluate_0.14          labeling_0.3           knitr_1.23            
## [22] highr_0.8              Rcpp_1.0.1             scales_1.0.0          
## [25] BiocManager_1.30.4     jsonlite_1.6           XVector_0.24.0        
## [28] ggplot2_3.1.1          digest_0.6.19          stringi_1.4.3         
## [31] bookdown_0.11          dplyr_0.8.1            grid_3.6.0            
## [34] tools_3.6.0            bitops_1.0-6           magrittr_1.5          
## [37] RCurl_1.95-4.12        lazyeval_0.2.2         tibble_2.1.2          
## [40] tidyr_0.8.3            crayon_1.3.4           pkgconfig_2.0.2       
## [43] tidytree_0.2.4         Matrix_1.2-17          assertthat_0.2.1      
## [46] rmarkdown_1.13         R6_2.4.0               nlme_3.1-140          
## [49] compiler_3.6.0

Reference

Lun, Aaron, and Davide Risso. 2019. SingleCellExperiment: S4 Classes for Single Cell Data.

Paradis, E., and K. Schliep. 2018. “Ape 5.0: An Environment for Modern Phylogenetics and Evolutionary Analyses in R.” Bioinformatics 35:526–28.

Yu, Guangchuang, David Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1):28–36. https://doi.org/10.1111/2041-210X.12628.

Introduction to TreeSummarizedExperiment

3 June 2019

Package

Contents

1 Introduction

2 TreeSummarizedExperiment

2.1 Anatomy of TreeSummarizedExperiment

2.2 Toy data

2.3 The construction of `TreeSummarizedExperiment`

2.4 The accessor functions

2.4.1 Assays, rowData, colData, and metadata

2.4.2 rowLinks, colLinks

2.5 The subseting function

3 Aggregation

3.1 The column dimension

3.2 The row dimension

3.3 Both dimensions

4 Aggregation on the taxonomic table

5 Additional

5.1 More about functions working on the `phylo` object.

5.1.1 print out nodes of the tree

5.1.2 Count the number of nodes

5.1.3 Translation between the node label and the node number

5.1.4 find the descendants

5.1.5 find the sibling node

5.1.7 identify a leaf node

5.1.8 calculate the distance between two nodes

5.1.9 prune tree via leaves

5.1.10 convert a `phylo` object to a matrix

5.2 Customize functions for the `TreeSummarizedExperiment` class

6 Session Info

Reference

Introduction to TreeSummarizedExperiment

3 June 2019

Package

Contents

1 Introduction

2 TreeSummarizedExperiment

2.1 Anatomy of TreeSummarizedExperiment

2.2 Toy data

2.3 The construction of TreeSummarizedExperiment

2.4 The accessor functions

2.4.1 Assays, rowData, colData, and metadata

2.4.2 rowLinks, colLinks

2.5 The subseting function

3 Aggregation

3.1 The column dimension

3.2 The row dimension

3.3 Both dimensions

4 Aggregation on the taxonomic table

5 Additional

5.1 More about functions working on the phylo object.

5.1.1 print out nodes of the tree

5.1.2 Count the number of nodes

5.1.3 Translation between the node label and the node number

5.1.4 find the descendants

5.1.5 find the sibling node

5.1.6 find the share node

5.1.7 identify a leaf node

5.1.8 calculate the distance between two nodes

5.1.9 prune tree via leaves

5.1.10 convert a phylo object to a matrix

5.2 Customize functions for the TreeSummarizedExperiment class

6 Session Info

Reference

2.3 The construction of `TreeSummarizedExperiment`

5.1 More about functions working on the `phylo` object.

5.1.10 convert a `phylo` object to a matrix

5.2 Customize functions for the `TreeSummarizedExperiment` class