Introduction

Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction proposed by McInnes and Healy. This vignette demonstrates how to use the umap R package to perform dimensional reduction and visualization with the UMAP method.

Usage

For a practical demonstration, let's use the Iris dataset. This dataset is accessible through object iris.

head(iris, 3)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa

The first four columns contain data, the last column contains a label. It will be useful to separate those components.

iris.data = iris[, grep("Sepal|Petal", colnames(iris))]
iris.labels = iris[, "Species"]

Now, let's load the umap package and apply the UMAP transformation.

library(umap)
iris.umap = umap(iris.data)

The output is here an object iris.umap. We can get a minimal summary of its contents by just printing it.

iris.umap
## umap embedding of 150 items in 2 dimensions
## object components: layout, knn, config

The main component of the object is layout, which holds a matrix with coordinates.

head(iris.umap$layout, 3)
##           [,1]      [,2]
## [1,] 2.3576047 -5.290552
## [2,] 1.0353754 -7.342834
## [3,] 0.9800537 -6.799518

We can now use these coordinates to visualize the dataset. (The custom plot function, plot.iris, is available at the end of this vignette.)

plot.iris(iris.umap, iris.labels)

plot of chunk unnamed-chunk-4

The layout conveys separation between the groups and dispersion within the group. While this vignette example is simple by construction, the umap package can provide similar visualizations for larger datasets with many thousands of data points.

Tuning UMAP

The example above uses function umap with a single argument - the input dataset - so the embedding is performed with default settings. However, the algorithm can be tuned in several ways. There are two strategies for tuning: via configuration objects and via additional arguments.

Configuration objects

The default configuration object is called umap.defaults. This is a list encoding default values for all the parameters used within the algorithm.

umap.defaults
## umap configuration parameters
##            n.neighbors: 15
##           n.components: 2
##        metric.function: euclidean
##               n.epochs: 200
##                  input: data
##                   init: spectral
##               min.dist: 0.1
##       set.op.mix.ratio: 1
##     local.connectivity: 1
##              bandwidth: 1
##                  alpha: 1
##                  gamma: 1
##   negative.sample.rate: 5
##                      a: NA
##                      b: NA
##                 spread: 1
##                   seed: NA
##            knn.repeats: 1
##                verbose: FALSE

This object is a list with key-value pairs shown. To obtain some minimal information about each field, see the documentation in help(umap.defaults), or see the original publication.

To create a custom configuration, create a copy of the defaults and then update any of the fields. For example, let's change the seed for random number generation.

custom.config = umap.defaults
custom.config$seed = 123

We can observe the changed settings by inspecting the object again (try it). To perform the UMAP projection with these settings, we can run the projection again and pass the configuration object as a second argument.

iris.umap.2 = umap(iris.data, custom.config)
plot.iris(iris.umap.2, iris.labels,
          main="Another UMAP visualization of the Iris dataset (different seed)")

plot of chunk custom2

The result is slightly different due to a new instantiation of the random number generator.

Additional arguments

Another way to customize the algorithm is to specify the non-default parameters explicitly. To achieve equivalent results to the above, we can thus use

iris.umap.3 = umap(iris.data, seed=123)

The coordinates in this new output object should match the ones from iris.umap.2 (check it!)

Implementation

The package provides two implementations of the umap method, one written in R and one accessed via an external python module.

The implementation written in R is the default. This implementation follows the design principles of the UMAP algorithm and its running time scales better-than-quadratically with the number of items (points) in a dataset. It is thus in principle suitable for use on datasets with thousands of points. However, it is not optimized for speed. It is the default because it should be functional without extensive dependencies and because its output provides insight into the inner workings of the algorithm.

The second available implementation is a wrapper for a python module with the same name. To enable this implementation, specify the argument method,

iris.umap.4 = umap(iris.data, method="python")

This command has several dependencies. To make it work, you must have the reticulate package installed and loaded (use install.packages("reticulate") and library(reticulate)). Furthermore, you must have the umap-learn python package installed (see the package repo for instructions). If either of these components is not available, the above command will display an error message.

Note that it will not be possible to produce exactly the same output from the two implementations due to inequivalent random number generators in R and python.

 

Appendix

The custom plot function used to visualize the Iris dataset:

plot.iris
## function(x, labels,
##          main="A UMAP visualization of the Iris dataset",
##          pad=0.02, cex=0.65, pch=19,
##          cex.main=1, cex.legend=1) {
## 
##   layout = x$layout
##   par(mar=c(0.2,0.7,1.2,0.7), ps=10)
##   xylim = range(layout)
##   xylim = xylim + ((xylim[2]-xylim[1])*pad)*c(-0.5, 0.5)
##   plot(xylim, xylim, type="n", axes=F, frame=F)
##   xylim = par()$usr
##   rect(xylim[1], xylim[1], xylim[2], xylim[2], border="#aaaaaa", lwd=0.2)
##   points(layout[,1], layout[,2], col=iris.colors[as.integer(labels)],
##          cex=cex, pch=pch)
##   mtext(side=3, main, cex=cex.main)
## 
##   labels.u = unique(labels)
##   legend("topright", legend=as.character(labels.u),
##          col=iris.colors[as.integer(labels.u)],
##          bty="n", pch=pch, cex=cex.legend)
## }
## <bytecode: 0x6a548c0>

Summary of R session:

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /software/opt/R/R-3.4.1/lib/libRblas.so
## LAPACK: /software/opt/anaconda/anaconda3-4.4.0/lib/libmkl_intel_lp64.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] umap_0.1.0.3
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.1   Matrix_1.2-12    magrittr_1.5     tools_3.4.1     
##  [5] reticulate_1.8   Rcpp_0.12.17     codetools_0.2-15 stringi_1.2.2   
##  [9] highr_0.6        grid_3.4.1       knitr_1.20       digest_0.6.15   
## [13] jsonlite_1.5     stringr_1.3.1    lattice_0.20-35  evaluate_0.10.1