Topological data analysis is a relatively new area of data science which can find unique, non-linear global structure in whole datasets. The main tool of topological data analysis is called persistent homology, which produces a “shape descriptor” (called a persistence diagram) of its input dataset. Two main R packages exist for computing persistence diagrams; the more flexible of the two, called TDA, stores persistence diagrams in a non-standard data structure, and the faster of the two, called TDAstats, does not provide functionality for analyzing diagrams with published techniques. Methods exist for computing distances and kernels (which are a special type of similarity functions) between pairs of persistence diagrams, but only distance calculations are currently available in the R package TDA. Several papers have used distance and kernel computations in order to perform machine learning or inferences tasks with groups of persistence diagrams, but to date no publicly available software in either python or R provides the functionality for these types of analyses. In order to make the power of topological data analysis more readily available to data and machine learning practitioners, it would be very helpful to have a software package that can
It is with these goals that the package TDApplied was created.
Topological data analysis has gained popularity over the past two decades since the original paper on persistent homology was published (Edelsbrunner, Letscher, and Zomorodian 2000), and two main R packages have been created for persistent homology calculations:
The paper (Somasundaram et al. 2021) compared the runtimes of the homology calculations in the R packages TDA and TDAstats, and concluded that certain persistent homology calculations are faster with TDAstats. The choice of which package to use therefore may vary by application, depending on whether speed is desired or whether knowledge of which data points represent which topological feature. Therefore, TDApplied can accept as input persistence diagrams computed from either package in order to cover a wide variety of potential use-cases.
It is worthwhile to note that there are two other R packages for carrying out topological data analysis - TDAvec (Islambekov and Luchinsky 2022) (methods for vectorizing persistence diagrams) and TDAkit (You and Yu 2021) (clustering and dimension reduction methods for functional summaries of persistence diagrams, called persistence landscapes/silhouettes), as well as several python packages dedicated to topological data analysis calculations, including scikit-TDA (Saul and Tralie 2019) (which computes persistence diagrams, and distances/kernels between them, using several python libraries), and giotto-tda (Tauzin et al. 2020) (which computes persistence diagrams and uses persistence diagrams to analyze time series data).
There are a number of shortcomings of available topological data analysis software. Firstly, in both python and R there is currently no package which allows for machine learning and inference of persistence diagrams, a limitation which greatly constrains the types of analyses that can be carried out. In R this is partially because there is no package for kernel calculations of persistence diagrams, but the very slow computation of distances between persistence diagrams in the TDA package also inhibits the practicality of distance-based inference procedures. Additionally, in R, the output of persistent homology calculations from the package TDA is a list with an element called “diagram” of class “diagram,” which is not compatible with data frame methods that form the basis for data analysis in R. On the other hand, TDAstats does not compute a published distance metric for persistence diagrams, making inferences drawn from its permutation_test
function unclear. Overall, the non-standard data type returned by persistent homology calculations in TDA, the slow distance calculations in TDA and the non-published distance metric in TDAstats may be limiting the development of TDA applications in R.
The package TDApplied aims to solve the three goals outlined in the introductory paragraph. Firstly, the function diagram_to_df
allows the conversion of the output of TDA persistent homology calculations to a data frame. Secondly, the functions diagram_distance
and diagram_kernel
allow for fast distance and kernel calculations respectively, and their counterparts distance_matrix
and gram_matrix
compute in parallel, for scalability, the (cross) distance and (cross) Gram matrices respectively. Thirdly, these distance and kernel calculations are used to perform machine learning and inference on persistence diagrams. Methods include
The kernel machine learning methods implemented in TDApplied are wrappers of the flexible R package for kernel calculations kernlab (Karatzoglou et al. 2004), with some additional processing steps specific to persistence diagrams. In the subsequent sections we will describe these applications in more detail.
The main tool of topological data analysis is called persistent homology (see (Edelsbrunner, Letscher, and Zomorodian 2000) for the introductory paper, and (Zomorodian and Carlsson 2005) for further computational details). Persistent homology has been applied in a variety of areas, including (but not limited to) economics (largely for the application of time series, for example see (Yen and Cheong 2021)), neuroscience (see (al 2021) for a number of functional MRI applications), etc.
Persistent homology starts with data points and a distance function. It assumes that these points were sampled from some kind of shape. This shape has certain features that exist at various scales, but sampling induces noise in these features. Persistent homology aims to describe certain mathematical features of this underlying shape, by forming approximations to the shape at various distance scales. The mathematical features which are tracked are clusters (connected components), loops (ellipses), voids (spheres), etc, and the “significance” of each feature is calculated (i.e. are the feature “real” or not). The homological dimension of these features are 0, 1 and 2 respectively (higher dimensional features can also be calculated). What’s really interesting about these particular mathematical features is that they can tell us where our data is not, which is extremely important information which other data analysis methods can’t provide.
The persistent homology algorithm proceeds in the following manner: first, if the input is a dataset and distance metric, then the distance matrix, storing the distance metric value of each pair of points in the dataset, is computed. Next, a parameter \(\epsilon \geq 0\) is grown starting at 0, and at each \(\epsilon\) value we compute a shape approximation of the dataset \(C_{\epsilon}\), called a simplicial complex (see (Edelsbrunner, Letscher, and Zomorodian 2000) or (Zomorodian and Carlsson 2005) for more details). We construct \(C_{\epsilon}\) by connecting all pairs of points whose distance is at most \(\epsilon\). To encode higher-dimensional structure in these approximations, we also add a triangle between any triple of points which are all connected, a tetrahedron between any quadruple of points which are all connected, etc. Note that this process of forming a sequence of skeletal approximations is called a filtration, and other methods exist for forming the approximations (the one described here is the most commonly used, called the Rips-Vietoris complex).
At any given \(\epsilon\) value, some topological features will exist in \(C_{\epsilon}\). As \(\epsilon\) grows, the \(C_{\epsilon}\)’s will contain each other, i.e. if \(\epsilon_{1} < \epsilon_{2}\) then every edge (triangle, tetrahedron etc.) in \(C_{\epsilon_1}\) will also be present in \(C_{\epsilon_2}\). Therefore, each topological feature will be “born” at some \(\epsilon_{birth}\) value, and “die” at some some \(\epsilon_{death}\) value. Consider the example of a loop – a loop will be “born” when the last connection around the circumference of the loop is connected (at the \(\epsilon\) value which is the largest distance between consecutive points around the loop), and the loop will “die” when the last connection across the loop’s diameter is connected thereby filling in its hole.
Therefore, the output of persistent homology, a persistence diagram, in each dimension has one 2D point for each topological feature found in the filtration process, where the \(x\)-value of the point is the birth \(\epsilon\) value and the \(y\)-value is the death \(\epsilon\) value. This is why every point lies above the diagonal – features die after they are born! The difference of a points \(y\) and \(x\) value, \(y-x\), is called the “persistence” of the corresponding topological feature. Points which have high (large) persistence likely represent real topological features of the dataset, whereas points with low persistence likely represent topological noise.
A persistence diagram containing \(n\) topological features can be represented in a vector of length \(2n\). However, persistence diagrams can contain different numbers of features, and the ordering of the features is arbitrary. There, there is no obvious vector representation of all persistence diagrams that can be used as the input of machine learning or statistical inference. Nevertheless, we can apply a number of these techniques to persistence diagrams provided we can quantify how near (similar) or far (distant) they are from each other, and describing suitable distance and similarity measures with their accompanying analysis methods will be the content of the following section.
In this section we will describe the various computational tools implemented in the package TDApplied to analyze persistence diagrams, both explaining the mathematics and providing functional examples. To run our examples we must start by loading the TDApplied package:
library("TDApplied")
Since TDApplied uses the output of TDA/TDAstats calculations as inputs to its functions, either TDA or TDAstats should be at least installed (if not attached) when using TDApplied. All examples will analyze simple diagrams which are random deviations of three persistence diagrams called D1, D2 and D3 (each with points in dimension 0 only):
When desired, random Gaussian noise with a small variance will be added to the birth and death values of the points in these three diagrams (being careful make sure the points always have appropriate birth and death values), which will be achieved with the following function (only used in this vignette), generate_TDApplied_vignette_data
:
<- function(num_D1,num_D2,num_D3){
generate_TDApplied_vignette_data
# num_D1 is the number of desired copies of D1, and likewise
# for num_D2 and num_D3
# create data
= data.frame(dimension = c(0),birth = c(2),death = c(3))
D1 = data.frame(dimension = c(0),birth = c(2,0),death = c(3.3,0.5))
D2 = data.frame(dimension = c(0),birth = c(0),death = c(0.5))
D3
# make noisy copies
<- lapply(X = 1:(num_D1 + num_D2 + num_D3),FUN = function(X){
noisy_copies
# i stores the number of the data frame to make copies of:
# i = 1 is for D1, i = 2 is for D2 and i = 3 is for D3
<- 1
i if(X > num_D1 & X <= num_D1 + num_D2)
{<- 2
i
}if(X > num_D1 + num_D2)
{<- 3
i
}# store correct data in noisy_copy
<- get(paste0("D",i))
noisy_copy
# add Gaussian noise to birth and death values
<- nrow(noisy_copy)
n $dimension <- as.numeric(as.character(noisy_copy$dimension))
noisy_copy$birth <- noisy_copy$birth + stats::rnorm(n = n,mean = 0,sd = 0.05)
noisy_copy$death <- noisy_copy$death + stats::rnorm(n = n,mean = 0,sd = 0.05)
noisy_copy
# make any birth values which are less than 0 equal 0
which(noisy_copy$birth < 0),2] <- 0
noisy_copy[
# make any birth values which are greater than their death values equal their death values
which(noisy_copy$birth > noisy_copy$death),2] <-
noisy_copy[which(noisy_copy$birth > noisy_copy$death),3]
noisy_copy[return(noisy_copy)
})
# return list containing num_D1 noisy copies of D1, then
# num_D2 noisy copies of D2, and finally num_D3 noisy copies
# of D3
return(noisy_copies)
}
Here is an example of making noisy copies of D1:
As the persistence diagram is a descriptor of the underlying shape structure of a dataset, it can be useful to quantify the differences between pairs of persistence diagrams. There are several ways to compute distances between persistence diagrams in the same homological dimension (like dimension 0 for clusters, dimension 1 for loops, etc.). The most common two are called the 2-wasserstein and bottleneck distances (Kerber, Morozov, and Nigmetov 2017). These techniques find an optimal matching of the 2D points in their input two diagrams, and compute a cost of that optimal matching. A point from one diagram is allowed either to be paired (matched) with a point in the other diagram or its diagonal projection, i.e. the nearest point on the diagonal line \(y=x\) (matching a point to its diagonal projection is essentially saying that feature is likely topological noise because it died very soon after it was born).
Allowing points to be paired with their diagonal projections both allows for matchings of persistence diagrams with different numbers of points (which is almost always the case in practice) and also formalizes the idea that some points in a persistence diagram represent noise. The “cost” value associated with a matching is given by either (i) the maximum of infinity-norm distances between paired points, or (ii) the square-root of the sum of squared infinity-norm between matched points. The cost of the optimal matching under loss (i) is called the bottleneck distance of persistence diagrams, and the cost of the optimal matching of cost (ii) is called the 2-wasserstein metric of persistence diagrams. Both distance metrics have been used in a number of applications, but the 2-wasserstein metric is able to find more fine-scale differences in persistence diagrams compared to the bottleneck distance. The problem of finding an optimal matching can be solved with the Hungarian algorithm, which is implemented in the R package clue (Hornik 2005).
In the picture we can see that there is a “better” matching between D1 and D2 compared to D1 and D3, so the (wasserstein/bottleneck) distance value between D1 and D2 would be smaller than that of D1 and D3.
The wasserstein and bottleneck distances have been implemented in the TDApplied function diagram_distance
. We can confirm that the distance between D1 and D2 is smaller than D1 and D3 for both distances:
# calculate 2-wasserstein distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.3905125
# calculate 2-wasserstein distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = 2,distance = "wasserstein")
#> [1] 0.559017
# calculate bottleneck distance between D1 and D2
diagram_distance(D1,D2,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.3
# calculate bottleneck distance between D1 and D3
diagram_distance(D1,D3,dim = 0,p = Inf,distance = "wasserstein")
#> [1] 0.5
There is a generalization of the 2-wasserstein distance for any \(p \geq 1\), the \(p\)-wasserstein distance, which can also be computed using the diagram_distance
function by varying the parameter p
.
Another distance metric between persistence diagrams, which will be useful for kernel calculations, is called the Fisher information metric, \(d_{FIM}(D_1,D_2,\sigma)\) (details can be found in (Le and Yamada 2018)). The idea is to represent the two persistence diagrams as probability density functions, with a 2D-Gaussian point mass centered at each point in both diagrams (including the diagonal projections of the points in the opposite diagram), all of variance \(\sigma^2 > 0\), and calculate how much those distributions agree on their pdf value at each point in the plane (called their Fisher information metric).
Points in the rightmost plot which are close to white in color have the most similar pdf values in the two distributions, and would not contribute to a large distance value, however having more points with a red color would contribute to a larger distance value.
The diagram_distance
function can also calculate the Fisher information metric between persistence diagrams:
# Fisher information metric calculation between D1 and D2 for sigma = 1
diagram_distance(D1,D2,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.02354779
# Fisher information metric calculation between D1 and D3 for sigma = 1
diagram_distance(D1,D3,dim = 0,distance = "fisher",sigma = 1)
#> [1] 0.08821907
Again, D1 and D2 are less different than D1 and D3 using the Fisher information metric.
Dimension reduction is a task in machine learning which is commonly used for data visualization, removing noise in data, and decreasing the number of covariates in a model (which can be helpful in reducing overfitting). One common dimension reduction technique in machine learning is called multidimensional scaling (MDS) (Cox and Cox 2008). MDS takes as input an \(n\) by \(n\) distance (or dissimilarity) matrix \(D\), computed from \(n\) points in a dataset, and outputs an embedding of those points into a Euclidean space of chosen dimension \(k\) which best preserves the inter-point distances. MDS is often used for visualizing data in exploratory analyses, and can be particularly useful when the input data points do not live in a common Euclidean space (as is the case for persistence diagrams). Using the R function cmdscale
from the package stats (R Core Team (2021)) we can compute the optimal embedding of a set of persistence diagrams using any of the three distance metrics using the function diagram_mds
. Here is an example of the diagram_mds
function projecting nine persistence diagrams, three noisy copies sampled from each of D1, D2 and D3, into 2D space:
# create 9 diagrams based on D1, D2 and D3
<- generate_TDApplied_vignette_data(3,3,3)
g
# calculate their 2D MDS embedding in dimension 0 with the bottleneck distance
<- diagram_mds(diagrams = g,dim = 0,p = Inf,k = 2,num_workers = 2)
mds
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(mds[,1],mds[,2],xlab = "Embedding coordinate 1",ylab = "Embedding coordinate 2",
main = "MDS plot",col = as.factor(rep(c("D1","D2","D3"),each = 3)),bty = "L")
legend("topright", inset=c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))),
pch=16, col=unique(as.factor(c("D1","D2","D3"))))
The MDS plot shows the clear separation between the three generating diagrams (D1, D2 and D3), and the embedded coordinates could be used for further downstream analyses.
One of the most important inference procedures in classical statistics is the analysis of variance (ANOVA), which can find differences in the means of groups of normally-distributed measurements (Casella, Berger, and Company 2002). Distributions of persistence diagrams and their means can be complicated (see (Turner 2013) and (Turner et al. 2014)). Therefore, a non-parametric permutation test has been proposed which can find differences in groups of persistence diagrams. Such a test was first proposed in (Robinson and Turner 2017), and some variations have been suggested in later publications. In (Robinson and Turner 2017), two groups of persistence diagrams would be compared. The null hypothesis, \(H_0\), is that the diagrams from the two groups are generated from shapes with the same type and scale of topological features, i.e. they “come” from the same “shape.” The alternative hypothesis, \(H_A\), is that the underlying type or scale of the features are different between the two groups. In each dimension a p-value is computed, finding evidence against \(H_0\) in that dimension. A measure of within-group distances (a “loss function”) is calculated for the two groups, and that measure is compared to a null distribution for when the group labels are permuted.
This inference procedure is implemented in the permutation_test
function, with several speedups and additional functionalities. Firstly, the loss function is computed in parallel for scalability since distance computations can be expensive. Secondly, we store distance calculations as we compute them because these calculations are often repeated. Additional functionality includes allowing for any number groups (not just two) and allowing for a pairing between groups of the same size as described in (Abdallah (2021)). When a natural pairing exists between the groups (like if the groups represent persistence diagrams from the same subject of a study in different conditions) we can simulate a more realistic null distribution by restricting the way in which we permute group labels, achieving higher statistical power.
In order to demonstrate the utility of the permutation test we will detect differences between noisy copies of D1, D2, D3:
# permutation test between three diagrams
<- generate_TDApplied_vignette_data(3,0,0)
g1 <- generate_TDApplied_vignette_data(0,3,0)
g2 <- generate_TDApplied_vignette_data(0,0,3)
g3 <- permutation_test(g1,g2,g3,
perm_test num_workers = 2,
dims = c(0))
$p_values
perm_test#> 0
#> 0.04761905
As expected, a difference was found (at the \(\alpha = 0.05\) significance level) between the three groups.
The package TDAstats also has a function called permutation_test
which is based on the same test procedure, however it uses an unpublished distance metric between persistence diagrams and does not use parallelization for scalability. As such, care must be taken if both TDApplied and TDAstats are attached in an R script to use the desired permutation_test
function.
A kernel function is a special (positive semi-definite) symmetric similarity measure between objects in some complicated space which can be used to project data into a space suitable for machine learning (Murphy 2012). Some examples of machine learning techniques which can be “kernelized” when dealing with complicated data are k-means (kernel k-means), principal components analysis (kernel PCA), and support vector machines (SVM) which are inherently based on kernel calculations.
There have been, to date, four main kernels proposed for persistence diagrams. In TDApplied the persistence Fisher kernel (Le and Yamada 2018) has been implemented because of its practical advantages over the other kernels – smaller cross-validation SVM error on a number of test data sets and a faster method for cross validation. For information on the other three kernels see (Kusano, Fukumizu, and Hiraoka 2018), (Carrière, Cuturi, and Oudot 2017), and (Reininghaus et al. 2014).
The persistence Fisher kernel is computed directly from the Fisher information metric between two persistence diagrams: let \(\sigma > 0\) be the parameter for \(d_{FIM}\), and let \(t > 0\). Then the persistence Fisher kernel is defined as \(k_{PF}(D_1,D_2) = \mbox{exp}(-t*d_{FIM}(D_1,D_2,\sigma))\). Computing the persistence Fisher kernel can be achieved with the diagram_kernel
function in TDApplied:
# calculate the kernel value between D1 and D2 with sigma = 2, t = 2
diagram_kernel(D1,D2,dim = 0,sigma = 2,t = 2)
#> [1] 0.9872455
# calculate the kernel value between D1 and D3 with sigma = 2, t = 2
diagram_kernel(D1,D3,dim = 0,sigma = 2,t = 2)
#> [1] 0.9707209
As before, D1 and D2 are more similar than D1 and D3.
Kernel k-means (Dhillon, Guan, and Kulis 2004) is a method which can find hidden groups in complex data, extending regular k-means clustering (Murphy 2012) via a kernel. A “kernel distance” is calculated between a persistence diagram and a cluster center using only the kernel function, and the algorithm converges like regular k-means. This algorithm is implemented in the function diagram_kkmeans
as a wrapper of the kernlab function kkmeans
. Moreover, a prediction function predict_diagram_kkmeans
can be used to find the nearest cluster labels for a new set of diagrams. Here is an example clustering three groups of noisy copies from D1, D2 and D3:
# create noisy copies of D1, D2 and D3
<- generate_TDApplied_vignette_data(3,3,3)
g
# calculate kmeans clusters with centers = 3, and sigma = t = 2
<- diagram_kkmeans(diagrams = g,centers = 3,dim = 0,t = 2,sigma = 2,num_workers = 2)
clust
# display cluster labels
$clustering@.Data
clust#> [1] 1 1 1 3 3 3 2 2 2
As we can see, the diagram_kkmeans
function was able to correctly separate the three generating diagrams D1, D2 and D3 (the cluster labels are arbitrary and therefore may not be 1,1,1,2,2,2,3,3,3, however the induce they correct partition).
If we wish to predict the cluster label for new persistence diagrams (computed via the largest kernel value to any cluster center), we can use the predict_diagram_kkmeans
function as follows:
# create nine new diagrams
<- generate_TDApplied_vignette_data(3,3,3)
g_new
# predict cluster labels
predict_diagram_kkmeans(new_diagrams = g_new,clustering = clust,num_workers = 2)
#> [1] 1 1 1 3 3 3 2 2 2
This function correctly predicted the cluster for each new diagram (assigning each diagram to the cluster label by D1, D2 or D3, depending on which diagram it was generated from).
PCA is another dimension reduction technique in machine learning, but can be preferable compared to MDS in certain situations because it allows for the projection of new data points onto an old embedding model (Murphy 2012). For example, this can be important if PCA is used as a pre-processing step in model fitting. Kernel PCA (kPCA) (Schölkopf, Smola, and Müller 1998) is an extension of regular PCA which uses a kernel to project complex data into a high-dimensional Euclidean space and then uses PCA to project that data into a low-dimensional space. The diagram_kpca
method computes the kPCA embedding of a set of persistence diagrams, and the predict_diagram_kpca
function can be used to project new diagrams using a pre-trained kPCA model. Here is an example using a group of noisy copies of D1, D2 and D3:
# create noisy copies of D1, D2 and D3
<- generate_TDApplied_vignette_data(3,3,3)
g
# calculate their 2D PCA embedding with sigma = t = 2
<- diagram_kpca(diagrams = g,dim = 0,t = 2,sigma = 2,features = 2,num_workers = 2)
pca
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(pca$pca@rotated[,1],pca$pca@rotated[,2],xlab = "Embedding coordinate 1",
ylab = "Embedding coordinate 2",main = "PCA plot",
col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))), pch=16,
col=unique(as.factor(c("D1","D2","D3"))))
The function was able to recognize the three groups, and the embedding coordinates can be used for further downstream analysis. However, an important advantage of kPCA over MDS is that in kPCA we can project new points onto an old embedding using the predict_diagram_kpca
function:
# create nine new diagrams
<- generate_TDApplied_vignette_data(3,3,3)
g_new
# project new diagrams onto old model
<- predict_diagram_kpca(new_diagrams = g_new,embedding = pca,num_workers = 2)
new_pca
# plot
par(mar=c(5.1, 4.1, 4.1, 8.1), xpd=TRUE)
plot(new_pca[,1],new_pca[,2],xlab = "Embedding coordinate 1",
ylab = "Embedding coordinate 2",main = "PCA prediction plot",
col = as.factor(rep(c("D1","D2","D3"),each = 3)))
legend("topright",inset = c(-0.2,0),
legend=levels(as.factor(c("D1","D2","D3"))), pch=16,
col=unique(as.factor(c("D1","D2","D3"))))
As we can see, the original three groups, and their approximate location in 2D space, is preserved during prediction.
SVMs (Murphy 2012) are one of the most popular machine learning techniques for regression and classification tasks. SVMs use a kernel function to project complex data into a high-dimensional space and then find a sparse set of training examples, called “support vectors,” which maximally linearly separate the outcome variable classes (or yield the highest explained variance in the case of regression).
SVMs have been implemented in the function diagram_ksvm
, tailored for input datasets which contain pairs of persistence diagrams and their outcome variable labels. A prediction method is supplied called predict_diagram_ksvm
which can be used to predict the label value of a set of new persistence diagrams given a pre-trained model. A parallelized implementation of cross-validation model-fitting is used based on the remarks in (Le and Yamada 2018) for scalability (which avoids needlessly recomputing persistence Fisher information metric values). Here is an example of fitting an SVM model on a list of persistence diagrams for a classification task (guessing whether the diagram comes from D1, D2 or D3):
# create thirty noisy copies of D1, D2 and D3
<- generate_TDApplied_vignette_data(10,10,10)
g
# create response vector
<- as.factor(rep(c("D1","D2","D3"),each = 10))
y
# fit model with cross validation
<- diagram_ksvm(diagrams = g,cv = 2,dim = c(0),
model_svm y = y,sigma = c(1,0.1),t = c(1,2),
num_workers = 2)
We can use the function predict_diagram_ksvm
to predict new diagrams like so:
# create nine new diagrams
<- generate_TDApplied_vignette_data(3,3,3)
g_new
# predict
predict_diagram_ksvm(new_diagrams = g_new,model = model_svm,num_workers = 2)
#> [1] D1 D1 D1 D2 D2 D2 D3 D3 D3
#> Levels: D1 D2 D3
As we can see the best SVM model was able to separate the three diagrams We can gain more information about the best model found during model fitting and the CV results by accessing different list elements of model_svm
.
An important question when presented with two groups of paired persistence diagrams is determining if the pairings are independent or not. A procedure was described in (Gretton et al. 2007) which can be used to answer this question using kernel computations, and importantly uses a parametric null distribution. The null hypothesis for this test is that the groups are independent, and the alternative hypothesis is that the groups are not independent. A test statistic called the Hilbert-Schmidt independence criteria is calculated, and its value is compared to a gamma distribution with certain parameters which can be estimated from the data.
This inference procedure has been implemented in the independence_test
function, and returns the p-value of the test in each desired dimension of the diagrams (among other additional information). We would expect to find no dependence between noisy copies of D1, D2 and D3, since each copy is generated randomly:
# create 10 noisy copies of D1 and D2
<- generate_TDApplied_vignette_data(10,0,0)
g1 <- generate_TDApplied_vignette_data(0,10,0)
g2
# do independence test with sigma = t = 1
<- independence_test(g1,g2,dims = c(0),num_workers = 2)
indep_test $p_values
indep_test#> 0
#> 0.332969
The p-value of this test would not be significant at any typical significance threshold, reflecting the fact that there is no real (i.e. non-spurious) dependence between the two groups, as expected.
diagram_distance
and TDA wasserstein
FunctionsComputing distances (wasserstein/bottleneck) between persistence diagrams is a key feature of some of the main topological data analysis software packages. However, these calculations can be very expensive, rendering practical applications of topological data analysis nearly unfeasible. TDApplied strives to provide useful methods for applied topological data analysis, and as such its distance function must be fast, at least compared to distance calculations provided by other packages. We will compare the run time of the TDApplied diagram_distance
function compared to the R package TDA wasserstein
function to argue that TDApplied provides a much more tractable distance calculation. We will also compare the runtime of the TDApplied diagram_distance
function against a python counterpart from scikit-TDA to evaluate the feasibility for a TDApplied counterpart module in python.
We simulated persistence diagrams computed from uniform sampling of two shapes provided in the TDA package – a torus (i.e. a hollow doughnut) with major radius 2 and minor radius 1, and a sphere (i.e. surface of a ball) of radius 1. We sampled \(n\) points from each shape where \(n \in \{100,200,300,\dots,1000\}\), with 10 iterations at each value of \(n\). At each iteration the distance between the calculated persistence diagrams of the sampled torus and of the sampled sphere with \(n\) points was computed in homological dimensions 0, 1 and 2 using the TDApplied function diagram_distance
with p = 2
and the TDA function wasserstein
. The runtime in seconds for these operations was recorded on a Windows 10 64-bit machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz processor with 8 cores and 64GB of RAM. We plot the mean runtime for each package at each value of \(n\) with 95% confidence interval error bars. To avoid a long build for this vignette the results were computed and recorded, but the code used for benchmarking can be seen below (not run in this vignette):
# generate persistence diagrams from Tori and spheres with 100,200,...,1000 data points.
<- data.frame(n_row = numeric(),package = character(),time_in_sec = numeric())
runtimes_shape for(n_row in seq(100,1000,100)){
for(iteration in 1:10)
{# simulate pair of diagrams from the desired shapes
= ripsDiag(X = TDA::torusUnif(n = n_row,a = 1,c = 2),
diagram_torus maxdimension = 2,maxscale = 2)
= ripsDiag(X = TDA::sphereUnif(n = n_row,d = 2,r = 1),
diagram_sphere maxdimension = 2,maxscale = 1)
# compute their wasserstein distances in all dimensions and benchmark
= Sys.time()
start_time_TDApplied diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 0,
p = 2,distance = "wasserstein")
diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 1,
p = 2,distance = "wasserstein")
diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 2,
p = 2,distance = "wasserstein")
= Sys.time()
end_time_TDApplied = as.numeric(end_time_TDApplied - start_time_TDApplied,units = "secs")
time_diff_TDApplied
= Sys.time()
start_time_TDA ::wasserstein(Diag1 = diagram_torus$diagram,Diag2 = diagram_sphere$diagram,
TDAdimension = 0,p = 2)
::wasserstein(Diag1 = diagram_torus$diagram,Diag2 = diagram_sphere$diagram,
TDAdimension = 1,p = 2)
::wasserstein(Diag1 = diagram_torus$diagram,Diag2 = diagram_sphere$diagram,
TDAdimension = 2,p = 2)
= Sys.time()
end_time_TDA = as.numeric(end_time_TDA - start_time_TDA,units = "secs")
time_diff_TDA
= rbind(runtimes_shape,data.frame(n_row = n_row,
runtimes_shape package = "TDApplied",
time_in_sec = time_diff_TDApplied))
= rbind(runtimes_shape,data.frame(n_row = n_row,
runtimes_shape package = "TDA",
time_in_sec = time_diff_TDA))
}print(paste0("Done ",n_row," rows"))
}
# compute means and sd's at each value of rows for both packages
= data.frame(n_row = numeric(),mean = numeric(),sd = numeric(),
summary_table package = character())
for(n_row in seq(100,1000,100))
{for(p in c("TDApplied","TDA"))
{= data.frame(n_row = n_row,
result mean = mean(runtimes_shape[which(runtimes_shape$n_row == n_row
& runtimes_shape$package == p),
3]),
sd = sd(runtimes_shape[which(runtimes_shape$n_row == n_row
& runtimes_shape$package == p),
3]),
package = p)
= rbind(summary_table,result)
summary_table
}
}
# plot table
plot(summary_table$n_row[summary_table$package=="TDA"],
$mean[summary_table$package=="TDA"],
summary_tabletype="b",
xlim=range(summary_table$n_row),
ylim=range(0,summary_table$mean+1.96*summary_table$sd/sqrt(10)),
xlab = "Points in shape",ylab = "Mean execution time (sec)")
lines(summary_table$n_row[summary_table$package=="TDApplied"],
$mean[summary_table$package=="TDApplied"],
summary_tablecol=2, type="b")
legend(x = 200,y = 2000,legend = c("TDApplied","TDA"),
col = c("red","black"),lty = c(1,1),cex = 0.8)
arrows(summary_table$n_row[summary_table$package == "TDApplied"],
$mean[summary_table$package == "TDApplied"]
summary_table-1.96*summary_table$sd[summary_table$package == "TDApplied"]/sqrt(10),
$n_row[summary_table$package == "TDApplied"],
summary_table$mean[summary_table$package == "TDApplied"]
summary_table+1.96*summary_table$sd[summary_table$package == "TDApplied"]/sqrt(10),
length=0.05, angle=90, code=3,col = "red")
arrows(summary_table$n_row[summary_table$package == "TDA"],
$mean[summary_table$package == "TDA"]
summary_table-1.96*summary_table$sd[summary_table$package == "TDA"]/sqrt(10),
$n_row[summary_table$package == "TDA"],
summary_table$mean[summary_table$package == "TDA"]
summary_table+1.96*summary_table$sd[summary_table$package == "TDA"]/sqrt(10),
length=0.05, angle=90, code=3,col = "black")
A linear model can be used to verify that the runtime ratio of the TDA vs. TDApplied grew at each subsequent number of tested data points in the shapes (with about a 100x speed up for 1000 data points), suggesting that distance calculations with TDApplied are faster and more scalable:
<- stats::lm(data = data.frame(y =
model $mean[summary_table$package == "TDA"]
summary_table/summary_table$mean[summary_table$package == "TDApplied"],
x = seq(100,1000,100)),
formula = y ~ x,)
summary(model)$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -14.7529336 3.111817475 -4.740938 1.462190e-03
#> x 0.1095645 0.005015148 21.846711 2.032998e-08
predict(model,newdata = data.frame(x = 1000))[[1]]
#> [1] 94.81155
The fast and scalable distance calculation in TDApplied makes the applications of statistics and machine learning with persistence diagrams more feasible. This is why the TDA distance calculation was not used in the TDApplied package.
diagram_distance
Function against persim’s wasserstein
FunctionWhile the functionality of python packages for topological data analysis packages are out of the scope for an R package, in order to fully situate TDApplied in the landscape of topological data analysis software we will benchmark the diagram_distance
function against its counterpart from the sci-kit TDA collection of libraries, namely the wasserstein
function from the persim python module. Note that there is currently no publicly available implementation of the persistence Fisher kernel in python, and as such we will not benchmark kernel calculations. The code for this R vs. python benchmarking is very similar to the code from the previous section, with a few modifications. The general set up is the same, with tori and spheres generated with various numbers of rows with ten iterations at each number of rows. The benchmarking was also carried out on a Windows 10 64-bit machine, with an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz processor with 8 cores and 64GB of RAM. However, the package reticulate (Ushey, Allaire, and Tang 2022) is attached before the benchmarking and two python modules from scikit-TDA are loaded - persim and ripser. Persim will be used to calculate the wasserstein distances via its wasserstein
function, and ripser will be used to create persistence diagrams which can be used as input to the wasserstein
function. Before running this code miniconda was installed through the reticulate package, using the install_miniconda
function, and the modules persim and ripser were installed using the code py_install("persim")
and py_install("ripser")
respectively. The code for benchmarking is included below, but not run in this vignette:
# load reticulate package
library(reticulate)
<- reticulate::import("persim")
persim <- reticulate::import("ripser")
ripser
# generate persistence diagrams from Tori and spheres with 100,200,...,1000 data points.
<- data.frame(n_row = numeric(),package = character(),
runtimes_language time_in_sec = numeric())
for(n_row in seq(100,1000,100)){
for(iteration in 1:10)
{# simulate pair of diagrams from the desired shapes
= TDA::torusUnif(n = n_row,a = 1,c = 2)
torus = TDA::sphereUnif(n = n_row,d = 2,r = 1)
sphere = ripsDiag(X = torus,
diagram_torus maxdimension = 2,maxscale = 2)
= ripsDiag(X = sphere,
diagram_sphere maxdimension = 2,maxscale = 1)
= ripser$ripser(torus,maxdim = 2,thresh = 2)$dgms
diagram_torus_py 1]][which(diagram_torus_py[[1]][,2] == Inf),2] = 2
diagram_torus_py[[2]][which(diagram_torus_py[[2]][,2] == Inf),2] = 2
diagram_torus_py[[3]][which(diagram_torus_py[[3]][,2] == Inf),2] = 2
diagram_torus_py[[= ripser$ripser(sphere,maxdim = 2,thresh = 1)$dgms
diagram_sphere_py 1]][which(diagram_sphere_py[[1]][,2] == Inf),2] = 2
diagram_sphere_py[[2]][which(diagram_sphere_py[[2]][,2] == Inf),2] = 2
diagram_sphere_py[[3]][which(diagram_sphere_py[[3]][,2] == Inf),2] = 2
diagram_sphere_py[[
# compute their wasserstein distances in all dimensions and benchmark
= Sys.time()
start_time_TDApplied diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 0,
p = 2,distance = "wasserstein")
diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 1,
p = 2,distance = "wasserstein")
diagram_distance(D1 = diagram_torus,D2 = diagram_sphere,dim = 2,
p = 2,distance = "wasserstein")
= Sys.time()
end_time_TDApplied = as.numeric(end_time_TDApplied - start_time_TDApplied,units = "secs")
time_diff_TDApplied
= Sys.time()
start_time_persim $wasserstein(diagram_torus_py[[1]],diagram_sphere_py[[1]])
persim$wasserstein(diagram_torus_py[[2]],diagram_sphere_py[[2]])
persim$wasserstein(diagram_torus_py[[3]],diagram_sphere_py[[3]])
persim= Sys.time()
end_time_persim = as.numeric(end_time_persim - start_time_persim,units = "secs")
time_diff_persim
= rbind(runtimes_language,data.frame(n_row = n_row,
runtimes_language package = "TDApplied",
time_in_sec = time_diff_TDApplied))
= rbind(runtimes_language,data.frame(n_row = n_row,
runtimes_language package = "persim",
time_in_sec = time_diff_persim))
}print(paste0("Done ",n_row," rows"))
}
# compute means and sd's at each value of rows for both packages
= data.frame(n_row = numeric(),mean = numeric(),sd = numeric(),
summary_table package = character())
for(n_row in seq(100,1000,100))
{for(p in c("TDApplied","persim"))
{= data.frame(n_row = n_row,
result mean = mean(runtimes_language[which(runtimes_language$n_row == n_row
& runtimes_language$package == p),
3]),
sd = sd(runtimes_language[which(runtimes_language$n_row == n_row
& runtimes_language$package == p),
3]),
package = p)
= rbind(summary_table,result)
summary_table
}
}
# plot table
plot(summary_table$n_row[summary_table$package=="TDApplied"],
$mean[summary_table$package=="persim"], type="b",
summary_tablexlim=range(summary_table$n_row),
ylim=range(0,summary_table$mean+1.96*summary_table$sd/sqrt(10)),
xlab = "Points in shape",ylab = "Mean execution time (sec)")
lines(summary_table$n_row[summary_table$package=="TDApplied"],
$mean[summary_table$package=="TDApplied"],
summary_tablecol="red", type="b")
lines(summary_table$n_row[summary_table$package=="persim"],
$mean[summary_table$package=="persim"],
summary_tablecol="black", type="b")
legend(x = 200,y = 20,legend = c("TDApplied","persim"),
col = c("red","black"),lty = c(1,1),cex = 0.8)
arrows(summary_table$n_row[summary_table$package == "TDApplied"],
$mean[summary_table$package == "TDApplied"]
summary_table-1.96*summary_table$sd[summary_table$package == "TDApplied"]/sqrt(10),
$n_row[summary_table$package == "TDApplied"],
summary_table$mean[summary_table$package == "TDApplied"]
summary_table+1.96*summary_table$sd[summary_table$package == "TDApplied"]/sqrt(10),
length=0.05, angle=90, code=3,col = "red")
arrows(summary_table$n_row[summary_table$package == "persim"],
$mean[summary_table$package == "persim"]
summary_table-1.96*summary_table$sd[summary_table$package == "persim"]/sqrt(10),
$n_row[summary_table$package == "persim"],
summary_table$mean[summary_table$package == "persim"]
summary_table+1.96*summary_table$sd[summary_table$package == "persim"]/sqrt(10),
length=0.05, angle=90, code=3,col = "black")
The runtime of the persim wasserstein
function was significantly faster than TDApplied’s diagram_distance
function. However, a linear model of the runtime ratio of TDApplied vs. persim against the number of points in the shape finds evidence that the two functions scale similarly (via a constant runtime ratio of mean 15):
<- stats::lm(data = data.frame(y =
model $mean[summary_table$package == "TDApplied"]
summary_table/summary_table$mean[summary_table$package == "persim"],
x = seq(100,1000,100)),
formula = y ~ x,)
summary(model)$coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.497041e+01 1.720666280 8.700354 2.375283e-05
#> x 3.619817e-04 0.002773105 0.130533 8.993674e-01
Nevertheless, the raw speed increase in python could be the basis for a very fast python counterpart to the TDApplied package in the future.
There are two main limitations of TDApplied which should be discussed for its own future improvements. The first limitation is in the function diagram_ksvm
- the only acceptable input is a dataset where the single training feature is a persistence diagram (one diagram for each training example). This may be too inflexible for some applications, where the training features may include several persistence diagrams, or a mix of persistence diagrams, numeric and categorical features. A future update to TDApplied should provide the functionality for allowing any number of training features, of various types, to the diagram_ksvm
function via a weighted sum of kernels. The second limitation of TDApplied is the long runtime of its diagram_distance
function compared to python’s persim wasserstein
function. Even with a potential runtime increase caused by the reticulate package, the persim wasserstein
function was significantly faster than the TDApplied diagram_distance
function. As such, a future version of TDApplied could use the wasserstein (and bottleneck) distance calculations from the persim module as its distance engine, however this would introduce a couple of complications. Firstly, a dependency on reticulate, and therefore on python, would be introduced which is not ideal. Secondly, since the Fisher information metric is not implemented in the persim module, calculating this metric would still require its own R code. While TDApplied has been created with flexibility and scalability at its core, there will always be room for adding more functionality and speedups.
The TDApplied package aims to bridge topological data analysis with researchers and data practitioners in the R community. Current topological data analysis packages in R and python do not provide the ability to carry out standard types of data analysis, being statistics and machine learning, with persistence diagrams - greatly limiting research and industry interest in topological data analysis. TDApplied was built with performance in mind, with fast native-R implementations of distance calculations between persistence diagrams and parallelization at every possible point. Topological data analysis is an exciting and powerful new field of data analysis, and with TDApplied anyone can access its power for meaningful and creative analyses of data.