library(TreeDimensionTest)
The package provides a tool to statistically assess presence of trajectory in data. The function, “test.trajectory”, implements the Tree Dimension Test (TDT). It takes as input a matrix with rows as observations and columns as features. The output from the function is a list containing the tree dimension test measure (TDT), tree dimension test effect, s statistic and the p.value for TDT.
The example below illustrates the application of TDT to test presence of trajectory in simulated single-cell RNA-seq gene expression data that has a trajectory. The dataset is stored in matrix “input”, where rows are cells and columns are genes. TDT is able to recognize the presence of trajectory as depicted by the significant TDT p.value.
=system.file('extdata', package = 'TreeDimensionTest')
data_path= readRDS(paste0(data_path,"/bifurcating_7.rds"))
input
#'input' is matrix of single-cell RNA-seq gene expression data with rows as cells and columns as genes
= input$expression
input
#PCA and plot to visualize data
=prcomp(input)
pcplot(pc$x[,c(1:2)], main="PCA plot", xlab="PC1", ylab="PC2")
#Running the test.trajectory function on matrix "input".
# dim.reduction is set to "pca"; meaning dimensionality reduction will be performed first #using principal component analysis.Number of pca components are selected using Scree test. Set dim.reduction to "none" if you don't wish to perform #dimensionality reduction.
#MST is set to "exact"; the exact MST is used. The alternative is the approximate and fast
#DualTreeBoruvka MST. Set MST to "boruvka" to use the approximate MST.
= test.trajectory(input, dim.reduction = "pca", MST="exact")
res
#List containing Tree Dimension Test measure (tdt_measure), Tree Dimension Test #effect(tdt_effect), statistic, p.value, number vertices that are leaves and diameter of tree.
#p.value is significant and tdt_effect is strong, depicting presence of trajectory.
res#> $tdt_measure
#> [1] 5.591667
#>
#> $statistic
#> [1] 400.1192
#>
#> $tdt_effect
#> [1] 0.6958596
#>
#> $leaves
#> [1] 193
#>
#> $diameter
#> [1] 119
#>
#> $p.value
#> [1] 8.557247e-08
The data below is random, and has no trajectory. The p.value for the test is insignificant.
= cbind(rnorm(1000), rnorm(1000))
mat = test.trajectory(mat, dim.reduction = "none")
res plot(mat)
res#> $tdt_measure
#> [1] 7.201299
#>
#> $statistic
#> [1] 682.2852
#>
#> $tdt_effect
#> [1] 0.6822852
#>
#> $leaves
#> [1] 219
#>
#> $diameter
#> [1] 153
#>
#> $p.value
#> [1] 0.3879682
Function ‘compute.stats’ computes tree dimension measure, tree dimension effect, leafs and diameter of MST for a given in put data
= cbind(rnorm(1000), rnorm(1000))
mat
= compute.stats(mat, MST="boruvka", dim.reduction = "none")
res
res#> $tdt_measure
#> [1] 7.16129
#>
#> $tdt_effect
#> [1] 0.6831817
#>
#> $leaves
#> [1] 222
#>
#> $diameter
#> [1] 154
Function ‘separability’ computes heterogeneity of observations of the same type in a given data. Observations of the same type have the same label. The function takes a matrix x as input with rows as observations and columns as features. The function also takes a vector of labels for the observations. The function returns separability values for each label type and the overall separability value. The separability values range from 0 to 1, with 1 being the highest separability. In the examples below, there are 3 types of observations labeled L1, L2 and L3. An instance of real application is in single-cell data, where the labels could be cell types.
#Random data
= cbind(rnorm(200), rnorm(200))
mat
#Labels for the samples in the data
= c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels
#Color vector of samples, each unique color correspods with unique label
= c(rep("blue",93), rep("green",78), rep("red",29))
cols
#Plots an MST of the data, with samples of the same label highlighted by same color
plotTree(mat,labels, node.size = 12, node.col = cols,main = "Low seperability", legend.cord=c(-2.1,0.9))
#Compute separability of samples in mat
= separability(mat, labels)
res
#List containing separability values for each label and, overall separability on the data. #Overall separability is relatively low, implying samples with same labels are mixed.
res#> $label_separability
#> L1 L2 L3
#> 0.5636364 0.5165563 0.2320000
#>
#> $overall_separability
#> [1] 0.4373976
An example where samples with the same type are close together.
#An example data where labels of the same type are close together , resulting in high separability value.
= rbind(cbind(rnorm(93,mean=20), rnorm(93, mean=20)), cbind(rnorm(78,mean=5),rnorm(78,mean=5)), cbind(rnorm(29, mean=50), rnorm(29, mean=50)))
mat = c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels
#Color vector of samples, each unique color correspods with unique label
= c(rep("blue",93), rep("green",78), rep("red",29))
cols
plotTree(mat,labels, node.size=12, node.col = cols, main = "High seperability", legend.cord=c(-1.9,0.9))
= separability(mat, labels)
res
#Overall separability is 1, implying samples of different labels are perfectly separated.
res#> $label_separability
#> L1 L2 L3
#> 1 1 1
#>
#> $overall_separability
#> [1] 1
We now illustrate the use of separability to compute tissue specificity for calcium signaling and ribosome pathways on developing mouse data with samples of different tissue types.
#Loading calcium signaling pathway data from Mouse
#This is Mouse development RNA-seq data spanned by the geneset fo calcium signaling pathway
#Rows are genes and columns are samples.
=system.file('extdata', package = 'TreeDimensionTest')
data_pathload(file=paste0(data_path,"/calcium_pathway_data.rdata"))
#loading color vector of samples by label type; mouse_cols
load(file=paste0(data_path,"/mouse_cols.rdata"))
#Labels of the samples are the column names of the data, which are names of tissue types.
= colnames(calcium_pathway_data)
labels
plotTree(t(calcium_pathway_data), labels, node.col=mouse_cols,node.size=12, main = "Calcium Signaling pathway", legend.cord=c(-1.9,-1.3))
= separability(t(calcium_pathway_data), labels)
res
#Separabiltiy for each tissue type as well as the overall separability. High separability depicts high tissue specificity.
res#> $label_separability
#> Brain Cerebellum Heart Kidney Liver Ovary Testis
#> 0.7971014 1.0000000 1.0000000 1.0000000 0.9047619 0.9032258 0.8709677
#>
#> $overall_separability
#> [1] 0.925151
Calcium signaling pathway has high tissue specificity as depicted by the high separability value. Tissues of the same type are closer together as shown in the plot.
#Loading ribosome pathway data from Mouse
=system.file('extdata', package = 'TreeDimensionTest')
data_pathload(file=paste0(data_path,"/ribosome_pathway_data.rdata"))
#loading color vector of samples by label type; mouse_cols
load(file=paste0(data_path,"/mouse_cols.rdata"))
# ribsome_pathway_data is RNA-seq data with rows as genes and columns as samples
= colnames(ribosome_pathway_data)
labels plotTree(t(ribosome_pathway_data), labels, node.col= mouse_cols, node.size=12, main = "Ribosome pathway", legend.cord=c(-1.9,-1.3))
= separability(t(ribosome_pathway_data), labels)
res
res#> $label_separability
#> Brain Cerebellum Heart Kidney Liver Ovary Testis
#> 0.5140187 0.7818182 0.6179775 0.6219512 0.5428571 0.4057971 0.3913043
#>
#> $overall_separability
#> [1] 0.5536749
Ribosome pathway has relatively lower tissue specificity as depicted by the lower separability value. Tissues of the same type are mixed as shown in the plot.