Adaptable Radial Axes (ARA) plots (Rubio-Sánchez, Sanchez, and Lehmann 2017) are an interactive, exploratory data visualization technique related to Star Coordinates (Kandogan 2000, 2001) and Biplots (Gabriel 1971; Gower and Hand 1995; Hofmann 2004; Grange, Roux, and Gardner-Lubbe 2009; Greenacre 2010). ARA performs dimensionality reduction by representing high-dimensional numerical data as points in a plane or three-dimensional space. Unlike many other dimensionality reduction methods, ARA plots also visualize variable-specific information through “axis vectors” and their corresponding labeled axis lines, where each vector is associated with a particular data variable. These axis vectors (interactively defined by users, as in Star Coordinates) indicate the directions in which the values of their associated variables increase within the plot. Additionally, the high-dimensional observations are mapped (in an optimal sense) onto the plot in order to allow users to estimate variable values by visually projecting these points onto the labeled axes, as in Biplots.
ARA plot of four variables of a breakfast cereal dataset. High-dimensional data values can be estimated via projections onto the labeled axes, as in Biplots.
In the previous ARA plot, the “axis” vectors associated with the variables are chosen to distinguish between healthier cereals (toward the right) and less healthy ones (toward the left). For instance, points on the right will generally have larger values of Protein and Vitamins, and lower values of Sugar and Calories. Users can obtain approximations of data values by visually projecting the plotted points onto the labeled axes. For example, users would estimate a value of about 57 calories for the point that lies further to the right.
ARA plots are obtained by solving the following general convex optimization problem:
\[\begin{array}{cl} \displaystyle \begin{array}{c}\mathrm{minimize} \\ \mathbf{P} \in \mathbb{R}^{N\times m} \end{array} & \hspace{-0.2cm} \begin{array}{c} \| (\mathbf{P}\mathbf{V}^{\scriptstyle \top} - \mathbf{X})\mathbf{W} \|, \\ \ \end{array} \\[10pt] \text{subject to} & \mathcal{C}. \\ \end{array}\]Here, we use the following notation:
Similarly to Biplots, the product \(\mathbf{P}\mathbf{V}^{\scriptstyle \top}\) represents the estimates of data values in \(\mathbf{X}\) when projecting the mapped points in \(\mathbf{P}\) on to the axes defined by \(\mathbf{V}\). Thus, the goal consists of finding optimal embedded points to minimize the discrepancy between the original data values and the estimates. Lastly, matrix \(\mathbf{W}\) can be used to control relative accuracy of the estimates on each variable.
It is also worth noting that the variables should have a similar scaling. Otherwise, the solutions would focus on providing more accurate estimates for variables with larger values. In addition, centering the data leads to more accurate estimates in general. Thus, in practice, we recommend standardizing the data, which would center the data and force each variable to have unit variance.
In the current version of the aramappings package there are 9 types of ARA plots, which stem from the combination of three types of matrix norms, and three choices regarding constraints.
The current version of the package considers three norms:
The choice of norm reflects the analyst’s tolerance for large estimation errors. The \(\ell^{\infty}\) norm emphasizes minimizing the maximum estimation error across all variables or axes, making it suitable when the worst-case error is critical. The \(\ell^{2}\) norm also penalizes larger errors, but to a lesser extent. In contrast, the \(\ell^{1}\) norm allows for some large estimation errors if doing so leads to significantly smaller errors for other variables.
The current version of the package considers three options regarding the constraints of the ARA optimization problem:
The aramappings package contains nine functions to compute ARA mappings (i.e., the embedded points in \(\mathbf{P}\)), according to the chosen norm and type of constraints. The names of thefunctions are shown in the following table:
| Unconstrained | Exact | Ordered | |
|---|---|---|---|
| \(\ell^{2}\) norm (squared) | ara_unconstrained_l2() |
ara_exact_l2() |
ara_ordered_l2() |
| \(\ell^{1}\) norm | ara_unconstrained_l1() |
ara_exact_l1() |
ara_ordered_l1() |
| \(\ell^{\infty}\) norm | ara_unconstrained_linf() |
ara_exact_linf() |
ara_ordered_linf() |
The package also contains an additional basic function for drawing two-dimensional ARA plots when using standardized data:
draw_ara_plot_2d_standardized()The function is included in order to show the results of the ARA mapping functions (i.e., the embedded points), together wirh the axis vectors and labeled axis lines. However, it does not consider several aspects that may be necessary for generating a clear visualization. Note that ARA should be implemented within an interactive visualization environment that offers extensive features. Such a tool should allow users to dynamically manipulate plots (e.g., by modifying axis vectors; customizing or toggling elements such as axes, labels, and variable names; encoding additional data dimensions through point color, size, or shape; and animating transitions between plots).
Install a stable version from CRAN
Install the development version of aramappings from GitHub with:
or
This section presents several usage examples.
We recommend starting an exploratory analysis using an unconstrained ARA plot with the \(\ell^{2}\) norm, which can be generated very efficiently since the mappings (\(\mathbf{P}\)) can be obtained through a closed-form expression (i.e., a formula). The obvious first step is to load the aramappings package:
In the usage examples we will use the Auto MPG dataset available in packages ascentTraining and grpnet, Kaggle, or the UCI Machine Learning Repository (Frank and Asuncion 2010). In this case, we load the dataset in the ascentTraining package.
Next, we select a subset of numerical variables from the dataset. The
selected variables are specified through a vector containing their
column indices in the original dataset. Furthermore, we rename the data
set to X, simply for clarity with respect to the notation
defined above.
# Define subset of (numerical) variables
selected_variables <- c(1, 4, 5, 6) # 1:"mpg", 4:"horsepower", 5:"weight", 6:"acceleration")
# Retain only selected variables and rename dataset as X
X <- auto_mpg[, selected_variables] # Select a subset of variables
rm(auto_mpg)The ARA functions halt if the data or other parameters contain missing values. Thus, we proceed eliminating any row (data observation) that contains missing values. Naturally, another approach consists of replacing missing values by some other substituted values (imputation).
# Remove rows with missing values from X
N <- nrow(X)
rows_to_delete <- NULL
for (i in 1:N) {
if (sum(is.na(X[i, ])) > 0) {
rows_to_delete <- c(rows_to_delete, -i)
}
}
X <- X[rows_to_delete, ]At this moment X is a data.frame. In order to use the
ARA functions we first need to convert it to a matrix:
In addition, we strongly recommend standardizing the data when using
ARA. We save the result in variable Z since the function
that draws the plot needs the original values in X.
Having preprocessed the data, the next step consists of defining the
axis vectors, which are the rows of \(\mathbf{V}\). These can be obtained
manually (ideally through a graphical user interface), or through an
automatic method. For instante, \(\mathbf{V}\) could be the matrix defining
the Principal Component Analysis transfomation (in that case the ARA
plot would be a Principal Component Biplot). In this example, we simply
define a configuration of vectors in polar coordinates and transform
them to Cartesian coordinates with the pol2cart() function
in package geometry.
# Define axis vectors (2-dimensional in this example)
library(geometry)
r <- c(0.8, 1, 1.2, 1)
theta <- c(225, 100, 315, 80) * 2 * pi / 360
V <- pol2cart(theta, r)It is also possible to define weights in order to control the relative importance of estimating accurately on each axis. This is nevertheless complex, since the accuracy also depends on the length of the axis vectors. In this case, we have set two weigths to 1 and another two to 0.75. Visually, the weights determine the level of gray used to color the axis vectors in the ARA plot.
Having defined the data (Z), axis vectors
(V), and weights (weights), we can proceed to
compute the ARA mapping. For unconstrained ARA plots with the \(\ell^{2}\) norm the solver
parameter should be set to “formula” (the default), in order to obtain
the mapping through a closed-form expression. The following code
computes the embedded points and saves them in mapping$P,
and shows the execution runtime involved in computing the mapping.
# Compute the mapping and print the execution time
start <- Sys.time()
mapping <- ara_unconstrained_l2(
Z,
V,
weights = weights,
solver = "formula"
)
end <- Sys.time()
message(c('Execution time: ',end - start, ' seconds'))
#> Execution time: 0.00945591926574707 secondsARA plots can get cluttered when showing all of the axis lines and
corresponding labels at the tick marks. Thus, the function that we will
use to draw the ARA plot (draw_ara_plot_2d_standardized())
contains a parameter called axis_lines for specifying the
subset of axis lines (and labels) we wish to visualize. In this example
we will show the axes associated with variables “mpg” and
“acceleration”. Note that in the original data frame these were
variables 1 and 6. However, since we only retained four variables
(“mpg”, “horsepower”, “weight”, and “acceleration”), the column index of
“acceleration” is now 4.
# Select variables with labeled axis lines on ARA plot
axis_lines <- c(1, 4) # 1:"mpg", 4:"acceleration")Also, the plotted points can be colored according to the values of a particular variable. In this example, we will color the points according to the value of variable “mpg”.
Finally, we generate the ARA plot by calling
draw_ara_plot_2d_standardized():
# Draw the ARA plot
draw_ara_plot_2d_standardized(
Z,
X,
V,
mapping$P,
weights = weights,
axis_lines = axis_lines,
color_variable = color_variable
)
#> [1] 0Unconstrained ARA plot with the l2 norm of a subset of the Autompg dataset.
Exact and ordered ARA plots do not use weights, but it is necessary to specify the variable for which to obtain exact estimates, or for ordering the plotted points:
The exact ARA mapping for the \(\ell^{2}\) norm is computed through:
# Compute the mapping and print the execution time
start <- Sys.time()
mapping <- ara_exact_l2(
Z,
V,
variable = variable,
solver = "formula"
)
end <- Sys.time()
message(c('Execution time: ',end - start, ' seconds'))
#> Execution time: 0.460579872131348 secondsNote that it is also very efficient since the solution can also be expressed as a formula. Finally, we generate the ARA plot:
# Draw the ARA plot
draw_ara_plot_2d_standardized(
Z,
X,
V,
mapping$P,
axis_lines = axis_lines,
color_variable = color_variable
)
#> [1] 0Exact ARA plot with the l2 norm of a subset of the Autompg dataset. Exact estimates are obtained for variable ‘mpg’.
Note that the colors of the plotted points are in consonance with their position along the “mpg” axis.
The optimization problem for the ordered ARA mapping with the \(\ell^{2}\) is a quadratic program (QP). The following code solves the problem through the clarabel package.
# Compute the mapping and print the execution time
start <- Sys.time()
mapping <- ara_ordered_l2(
Z,
V,
variable = variable,
solver = "clarabel"
)
end <- Sys.time()
message(c('Execution time: ',end - start, ' seconds'))
#> Execution time: 0.0240919589996338 secondsFinally, we generate the ARA plot:
# Draw the ARA plot
draw_ara_plot_2d_standardized(
Z,
X,
V,
mapping$P,
axis_lines = axis_lines,
color_variable = color_variable
)
#> [1] 0Ordered ARA plot with the l2 norm of a subset of the Autompg dataset. The values of ‘mpg’ are ordered correctly along its corresponding axis.
Although the estimates for “mpg” are not exact, the values of “mpg” are ordered correctly along the corresponding axes. In comparison with the exact mapping, the ordered variant produces more accurate estimates for other variables.
ARA mappings of unconstrained and exact problems with the \(\ell^{1}\) or \(\ell^{\infty}\) can be computed in parallel using several cores. The ideal number of cores to use depends on the size of the input data matrix. In general, it grows with the number of observations. Users are advised to analyze runtimes on their computers to estimate an adequate number of cores to use on their data.
In the following example, we first detect the available number of
cores (NCORES) with package parallelly,
and will use half of them (if several are available).
# Detect the number of available CPU cores
library(parallelly)
NCORES <- parallelly::availableCores()
if (NCORES > 1) {
NCORES <- floor(NCORES / 2)
}We then create a cluster object with function
makeCluster() of the parallel package,
specifying the number of cores to use.
In this example we will compute an ARA unconstrained mapping with the
\(\ell^{1}\) norm. This function admits
weights, and lets us choose between several solvers. When using the
solver from package glpkAPI we can specify whether to
use a simplex algorithm, when parameter use_glpkAPI_simplex
is set to TRUE, or an interior point method if it is set to
FALSE. Lastly, the parameter cluster is the
cluster object created above.
# Compute the mapping and print the execution time
start <- Sys.time()
mapping <- ara_unconstrained_l1(
Z,
V,
weights = weights,
solver = "glpkAPI",
use_glpkAPI_simplex = TRUE,
cluster = cl
)
end <- Sys.time()
message(c('Execution time: ',end - start, ' seconds'))
#> Execution time: 0.10847020149231 secondsThe ARA plot generated through:
draw_ara_plot_2d_standardized(
Z,
X,
V,
mapping$P,
weights = weights,
axis_lines = axis_lines,
color_variable = color_variable
)
#> [1] 0Unconstrained ARA plot with the l1 norm of a subset of the Autompg dataset.
Alternatively, the following code computes an exact ARA mapping with the \(\ell^{1}\) norm:
# Compute the mapping and print the execution time
start <- Sys.time()
mapping <- ara_exact_l1(
Z,
V,
variable = variable,
solver = "glpkAPI",
use_glpkAPI_simplex = TRUE,
cluster = cl
)
end <- Sys.time()
message(c('Execution time: ',end - start, ' seconds'))
#> Execution time: 0.0387840270996094 secondsThe ARA plot generated through:
draw_ara_plot_2d_standardized(
Z,
X,
V,
mapping$P,
axis_lines = axis_lines,
color_variable = color_variable
)
#> [1] 0Exact ARA plot with the l1 norm of a subset of the Autompg dataset. Exact estimates are obtained for variable ‘mpg’.
Lastly, the cluster should be stopped at the end of the exploratory analysis process. It is not necessary to start a cluster and stop it after computing a single ARA mapping.