1 Introduction

Exploratory data analysis (EDA) relies on graphical summaries—boxplots, histograms, scatter plots—to reveal a dataset’s salient features before formal modeling. Yet modern data are increasingly recorded not as single scalar values but as intervals, histograms, or full empirical distributions. These richer objects are collectively known as symbolic data (Billard and Diday, 2006). For example, when individual-level measurements are aggregated by group, each variable naturally becomes an interval \([a, b]\) rather than a point value.

Conventional R graphics cannot natively accommodate interval-valued observations. ggInterval (formerly ggESDA) bridges this gap by extending the ggplot2 framework to visualize interval-valued symbolic data. The package provides a family of plot functions with a uniform interface:

ggInterval_<GRAPH_TYPE>(data, mapping = aes(...), ...)

where data is a symbolic data object and mapping uses the standard ggplot2::aes() syntax. Because most plot functions return ggplot2 objects, users can freely add themes, scales, labels, and additional layers.

2 Data Preparation

2.1 Built-in datasets

The package ships with several symbolic datasets. The most commonly used are:

  • facedata – 24 faces (8 ethnic groups \(\times\) 3 replicates) with 6 interval-valued facial measurements (AD, BC, AH, GH, EH, BG).
  • Environment – 14 cities described by 17 variables, including both interval-valued and modal multi-valued variables.
  • oils – 8 types of oils with 4 interval-valued chemical properties.
data(facedata)
facedata
#> # A tibble: 27 × 6
#>                   AD              BC                AH                DH
#>  *        <symblc_n>      <symblc_n>        <symblc_n>        <symblc_n>
#>  1 [155.00 : 157.00] [58.00 : 61.01] [100.45 : 103.28] [105.00 : 107.30]
#>  2 [154.00 : 160.01] [57.00 : 64.00] [101.98 : 105.55] [104.35 : 107.30]
#>  3 [154.01 : 161.00] [57.00 : 63.00]  [99.36 : 105.65] [101.04 : 109.04]
#>  4 [168.86 : 172.84] [58.55 : 63.39] [102.83 : 106.53] [122.38 : 124.52]
#>  5 [169.85 : 175.03] [60.21 : 64.38] [102.94 : 108.71] [120.24 : 124.52]
#>  6 [168.76 : 175.15] [61.40 : 63.51] [104.35 : 107.45] [120.93 : 125.18]
#>  7 [155.26 : 160.45] [53.15 : 60.21]   [95.88 : 98.49]   [91.68 : 94.37]
#>  8 [156.26 : 161.31] [51.09 : 60.07]   [95.77 : 99.36]   [91.21 : 96.83]
#>  9 [154.47 : 160.31] [55.08 : 59.03]   [93.54 : 98.98]   [90.43 : 96.43]
#> 10 [164.00 : 168.00] [55.01 : 60.03] [120.28 : 123.04] [117.52 : 121.02]
#> # ℹ 17 more rows
#> # ℹ 2 more variables: EH <symblc_n>, GH <symblc_n>
summary(facedata)
#> $symbolic_interval
#>                        AD              BC                AH                DH
#> Min.    [149.34 : 155.32] [50.36 : 55.23]   [93.54 : 98.49]   [90.43 : 94.37]
#> 1st Qu. [154.56 : 158.91] [53.60 : 59.08] [102.88 : 106.99] [105.18 : 111.07]
#> Median  [163.00 : 167.07] [57.00 : 63.00] [115.26 : 119.60] [114.28 : 117.41]
#> Mean    [162.90 : 162.90] [60.00 : 60.00] [113.17 : 113.17] [113.20 : 113.20]
#> 3rd Qu. [167.13 : 171.19] [61.22 : 65.04] [117.91 : 121.60] [117.10 : 121.72]
#> Max.    [169.85 : 175.15] [66.03 : 69.01] [123.75 : 127.29] [124.08 : 127.78]
#> Std.        [6.82 : 6.82]   [4.42 : 4.42]     [9.08 : 9.08]     [9.48 : 9.48]
#>                      EH              GH
#> Min.    [49.41 : 54.64] [48.27 : 50.61]
#> 1st Qu. [54.65 : 58.49] [51.60 : 56.03]
#> Median  [56.73 : 61.72] [55.32 : 60.46]
#> Mean    [59.85 : 59.85] [57.69 : 57.69]
#> 3rd Qu. [60.96 : 65.80] [58.52 : 63.84]
#> Max.    [63.89 : 69.07] [64.20 : 67.80]
#> Std.      [4.04 : 4.04]   [4.63 : 4.63]

2.2 Converting classical data with classic2sym

Classical (scalar) data can be converted to symbolic interval data by aggregating within groups. The classic2sym() function supports several grouping strategies:

myIris <- classic2sym(iris, groupby = "Species")
myIris$intervalData
#>             Sepal.Length   Sepal.Width  Petal.Length   Petal.Width
#> setosa     [4.30 : 5.80] [2.30 : 4.40] [1.00 : 1.90] [0.10 : 0.60]
#> versicolor [4.90 : 7.00] [2.00 : 3.40] [3.00 : 5.10] [1.00 : 1.80]
#> virginica  [4.90 : 7.90] [2.20 : 3.80] [4.50 : 6.90] [1.40 : 2.50]

The groupby argument accepts:

  • A column name (factor variable) in the data, e.g. "Species".
  • "kmeans" or "hclust" for unsupervised clustering (with k groups).
  • "customize" for user-supplied minimum and maximum data frames.
myIris_km <- classic2sym(iris, groupby = "kmeans", k = 5)
myIris_km$intervalData
#> # A tibble: 5 × 5
#>    Sepal.Length   Sepal.Width  Petal.Length   Petal.Width
#>      <symblc_n>    <symblc_n>    <symblc_n>    <symblc_n>
#> 1 [5.60 : 7.00] [2.20 : 3.40] [4.30 : 5.60] [1.20 : 2.40]
#> 2 [4.90 : 5.80] [3.30 : 4.40] [1.20 : 1.90] [0.10 : 0.60]
#> 3 [4.30 : 5.00] [2.30 : 3.60] [1.00 : 1.90] [0.10 : 0.30]
#> 4 [6.30 : 7.90] [2.50 : 3.80] [5.10 : 6.90] [1.60 : 2.50]
#> 5 [4.90 : 6.10] [2.00 : 3.00] [3.00 : 4.50] [1.00 : 1.70]
#> # ℹ 1 more variable: Species <symblc_m>

2.3 Converting RSDA objects with RSDA2sym

If you already have an RSDA symbolic_tbl object, wrap it with RSDA2sym() so it can be used with all ggInterval plot functions:

mySym <- RSDA2sym(Cardiological)
mySym$intervalData

3 Descriptive Statistics

ggInterval provides S3 methods for common statistical summaries on symbolic interval data.

3.1 Summary statistics

summary() reports the minimum, quartiles, median, mean, maximum, and standard deviation for each interval-valued variable:

summary(facedata)
#> $symbolic_interval
#>                        AD              BC                AH                DH
#> Min.    [149.34 : 155.32] [50.36 : 55.23]   [93.54 : 98.49]   [90.43 : 94.37]
#> 1st Qu. [154.56 : 158.91] [53.60 : 59.08] [102.88 : 106.99] [105.18 : 111.07]
#> Median  [163.00 : 167.07] [57.00 : 63.00] [115.26 : 119.60] [114.28 : 117.41]
#> Mean    [162.90 : 162.90] [60.00 : 60.00] [113.17 : 113.17] [113.20 : 113.20]
#> 3rd Qu. [167.13 : 171.19] [61.22 : 65.04] [117.91 : 121.60] [117.10 : 121.72]
#> Max.    [169.85 : 175.15] [66.03 : 69.01] [123.75 : 127.29] [124.08 : 127.78]
#> Std.        [6.82 : 6.82]   [4.42 : 4.42]     [9.08 : 9.08]     [9.48 : 9.48]
#>                      EH              GH
#> Min.    [49.41 : 54.64] [48.27 : 50.61]
#> 1st Qu. [54.65 : 58.49] [51.60 : 56.03]
#> Median  [56.73 : 61.72] [55.32 : 60.46]
#> Mean    [59.85 : 59.85] [57.69 : 57.69]
#> 3rd Qu. [60.96 : 65.80] [58.52 : 63.84]
#> Max.    [63.89 : 69.07] [64.20 : 67.80]
#> Std.      [4.04 : 4.04]   [4.63 : 4.63]

3.2 Correlation and covariance

cor() and cov() compute association matrices. Several methods are available for interval data, including "centers", "B" (Billard), "BD" (Billard–Diday), and "BG" (Billard–Greco):

cor(facedata)
#>             AD        BC         AH         DH           EH         GH
#> AD 1.000000000 0.6882596  0.3770045  0.6305841  0.005217304  0.1873164
#> BC 0.688259575 1.0000000  0.2910128  0.4634647  0.193673951  0.2351438
#> AH 0.377004536 0.2910128  1.0000000  0.7062072 -0.376548510 -0.6085799
#> DH 0.630584078 0.4634647  0.7062072  1.0000000 -0.471592548 -0.2422946
#> EH 0.005217304 0.1936740 -0.3765485 -0.4715925  1.000000000  0.6889340
#> GH 0.187316425 0.2351438 -0.6085799 -0.2422946  0.688934015  1.0000000
cov(facedata)
#>            AD        BC        AH        DH          EH         GH
#> AD 46.4682449 20.719131  23.32847  40.76781   0.1435936   5.915365
#> BC 20.7191307 19.502140  11.66581  19.41128   3.4532110   4.810633
#> AH 23.3284745 11.665807  82.39925  60.79810 -13.8004527 -25.592151
#> DH 40.7678128 19.411276  60.79810  89.94822 -18.0581801 -10.645536
#> EH  0.1435936  3.453211 -13.80045 -18.05818  16.3012737  12.885936
#> GH  5.9153653  4.810633 -25.59215 -10.64554  12.8859360  21.461256

3.3 Standardization

scale() standardizes symbolic interval data (centering and scaling), which can be useful before multivariate analyses:

facedata_scaled <- scale(facedata)
facedata_scaled
#> <ggInterval>
#>   Public:
#>     clone: function (deep = FALSE) 
#>     clusterResult: NULL
#>     initialize: function (rawData = NULL, statisticsDF = NULL, intervalData = NULL, 
#>     intervalData: data.frame, symbolic_tbl
#>     rawData: NULL
#>     statisticsDF: list
#>   Private:
#>     invalidDataType: function ()

4 Univariate Plots

4.1 Index plot

ggInterval_indexplot() displays the interval range of each observation as a vertical bar. This is useful for spotting outliers and comparing spreads across observations.

ggInterval_indexplot(facedata, aes(x = AD))

4.2 Index image

ggInterval_indexImage() replaces the margin bars of the index plot with a color-coded strip. The column_condition parameter controls whether colors represent column-wise or matrix-wise conditions, and full_strip expands the color strip to the full figure width.

ggInterval_indexImage(facedata, aes(AD),
                      column_condition = TRUE, full_strip = FALSE)

ggInterval_indexImage(facedata, aes(AD),
                      column_condition = TRUE, full_strip = TRUE) +
  coord_flip()

4.3 Boxplot

ggInterval_boxplot() draws an interval-valued box plot, where each observation’s interval is represented by nested rectangles showing the distribution of the interval endpoints. Use plotAll = TRUE to display all variables side by side.

ggInterval_boxplot(facedata, aes(AD))

ggInterval_boxplot(facedata, plotAll = TRUE)

4.4 Histogram

ggInterval_hist() constructs a histogram from interval-valued data. Two binning strategies are supported:

  • method = "equal-bin" (default): bins of equal width.
  • method = "unequal-bin": bin boundaries depend on the data distribution.

Note that ggInterval_hist() returns a list; use $plot to extract the ggplot2 object.

ggInterval_hist(facedata, aes(x = AD), bins = 10,
                method = "equal-bin")$plot

ggInterval_hist(facedata, aes(x = AD),
                method = "unequal-bin")$plot

4.5 Min-max plot

ggInterval_MMplot() marks the minimum and maximum endpoints of each observation’s interval, connected by a line segment. This makes it easy to compare ranges across observations.

ggInterval_MMplot(facedata, aes(AD))

Use plotAll = TRUE to display all variables together:

ggInterval_MMplot(facedata, plotAll = TRUE)

4.6 Center-range plot

ggInterval_CRplot() plots each observation as a point in a two-dimensional space where the x-axis is the center (midpoint) of the interval and the y-axis is the range (spread).

ggInterval_CRplot(facedata, aes(AD))

ggInterval_CRplot(facedata, plotAll = TRUE)

5 Bivariate Plots

5.1 Scatter plot

ggInterval_scatterplot() visualizes two interval-valued variables simultaneously. Each observation is drawn as a rectangle whose width and height represent the intervals on the x- and y-axes, respectively.

ggInterval_scatterplot(facedata, aes(x = AD, y = BC))

5.2 2D histogram

ggInterval_2Dhist() partitions the bivariate domain into a grid and counts how many interval observations overlap each cell. The xBins and yBins parameters control the grid resolution.

ggInterval_2Dhist(facedata, aes(x = AD, y = BC), xBins = 10, yBins = 10)
#> $plot

#> 
#> $`Table (AD, BC)`
#>                 [50:52.23] [52:54.09] [54:55.95] [56:57.82] [58:59.69]
#> [149:151.92]         0.017      0.183      0.334      0.339      0.191
#> [152:154.5]          0.086      0.359      0.511      0.414      0.269
#> [155:157.08]         0.414      1.067      1.162      0.727       1.19
#> [157:159.66]         0.178      0.395      0.522      0.575      0.608
#> [160:162.24]         0.041      0.088      0.132      0.193      0.216
#> [162:164.83]             0      0.003      0.176      0.251      0.187
#> [165:167.41]             0      0.004      0.383      0.618      0.481
#> [167:169.99]             0      0.004      0.243      0.343      0.404
#> [170:172.57]             0          0      0.001      0.001      0.165
#> [173:175.15]             0          0          0          0      0.016
#> Frequency of AD      0.736      2.103      3.464      3.461      3.727
#> Margin of AD         0.027      0.078      0.128      0.128      0.138
#>                 [60:61.55] [62:63.42] [63:65.28] [65:67.15] [67:69.01]
#> [149:151.92]         0.015          0          0          0          0
#> [152:154.5]          0.068      0.039      0.007          0          0
#> [155:157.08]         0.727      0.204      0.036          0          0
#> [157:159.66]          0.29      0.204      0.036          0          0
#> [160:162.24]           0.1      0.062      0.055      0.078      0.078
#> [162:164.83]         0.014          0      0.122      0.526      0.426
#> [165:167.41]         0.054      0.038      0.217      0.888      0.622
#> [167:169.99]         0.562      1.003      1.192      0.802      0.395
#> [170:172.57]         0.701      1.286      0.901      0.458      0.225
#> [173:175.15]         0.252      0.652      0.178          0          0
#> Frequency of AD      2.783      3.488      2.744      2.752      1.746
#> Margin of AD         0.103      0.129      0.102      0.102      0.065
#>                 Frequency of BC Margin of BC
#> [149:151.92]              1.079         0.04
#> [152:154.5]               1.753        0.065
#> [155:157.08]              5.527        0.205
#> [157:159.66]              2.808        0.104
#> [160:162.24]              1.043        0.039
#> [162:164.83]              1.705        0.063
#> [165:167.41]              3.305        0.122
#> [167:169.99]              4.948        0.183
#> [170:172.57]              3.738        0.138
#> [173:175.15]              1.098        0.041
#> Frequency of AD              27             
#> Margin of AD                               1

Here is the same plot for the oils dataset:

data(oils)
ggInterval_2Dhist(oils, aes(x = GRA, y = FRE), xBins = 5, yBins = 5)
#> $plot

#> 
#> $`Table (GRA, FRE)`
#>                  [-27:-14] [-14:-1] [-1:12] [12:25] [25:38] Frequency of FRE
#> [1:0.87]                 0        0       0     0.3     1.7                2
#> [1:0.89]                 0        0       0       0       0                0
#> [1:0.91]                 0        0       0       0       0                0
#> [1:0.92]                 1      1.2       1       0       0              3.2
#> [1:0.94]             0.684    2.116       0       0       0              2.8
#> Frequency of GRA     1.684    3.316       1     0.3     1.7                8
#> Margin of GRA        0.211    0.414   0.125   0.038   0.212                 
#>                  Margin of FRE
#> [1:0.87]                  0.25
#> [1:0.89]                     0
#> [1:0.91]                     0
#> [1:0.92]                   0.4
#> [1:0.94]                  0.35
#> Frequency of GRA              
#> Margin of GRA                1

6 Multivariate Plots

6.1 Scatter plot matrix

ggInterval_scatterMatrix() produces a pairwise scatter plot matrix for all continuous interval variables in the dataset. Note that this function returns a marrangeGrob object (from gridExtra), not a ggplot2 object.

ggInterval_scatterMatrix(facedata[, 1:3])

6.2 2D histogram matrix

ggInterval_2DhistMatrix() is the matrix analogue of ggInterval_2Dhist(), showing 2D histograms for all variable pairs.

ggInterval_2DhistMatrix(oils, xBins = 5, yBins = 5)

6.3 Index image heatmap

When plotAll = TRUE, ggInterval_indexImage() produces a heatmap-style visualization across all variables, providing an overview of the entire dataset.

ggInterval_indexImage(facedata, plotAll = TRUE)

6.4 Radar plot

ggInterval_radarplot() displays multiple interval-valued variables on radial axes. Each observation is represented by a polygon (or rectangle) whose extent along each axis shows the interval range. The plotPartial argument selects which observations to display.

data(Environment)
ggInterval_radarplot(Environment[, 5:17],
                     plotPartial = 2,
                     showLegend = FALSE,
                     base_circle = TRUE,
                     base_lty = 2,
                     addText = FALSE) +
  labs(title = "Environment: radar plot (default)")

The type = "rect" variant draws rectangles instead of polygons:

ggInterval_radarplot(Environment[, 5:17],
                     plotPartial = 2,
                     type = "rect",
                     showLegend = FALSE,
                     base_circle = TRUE,
                     addText = FALSE) +
  labs(title = "Environment: radar plot (rect)")

6.5 3D scatter plot

ggInterval_3Dscatterplot() visualizes three interval-valued variables, rendering each observation as a cube-like shape projected into two dimensions.

ggInterval_3Dscatterplot(facedata[1:5, ], aes(x = BC, y = EH, z = GH))

7 Principal Component Analysis

ggInterval_PCA() performs vertices-based PCA on interval-valued data. Each interval observation is expanded to its vertices (all \(2^p\) corner combinations), PCA is applied, and the results are projected back to interval form.

pca_result <- ggInterval_PCA(facedata, plot = FALSE)
pca_result$ggplotPCA

Setting poly = TRUE adds a convex-hull polygon connecting the projected vertices for each observation:

pca_poly <- ggInterval_PCA(facedata, poly = TRUE, plot = FALSE)
pca_poly$ggplotPCA

PCA also works with classical data via automatic conversion:

myIris <- classic2sym(iris, groupby = "Species")
pca_iris <- ggInterval_PCA(myIris, plot = FALSE)
pca_iris$ggplotPCA

8 Working with ggplot2

Because most ggInterval functions return standard ggplot2 objects, you can customize plots with the full range of ggplot2 features.

Themes and labels:

ggInterval_indexplot(facedata, aes(x = AD)) +
  theme_minimal() +
  labs(title = "Index plot of AD", x = "Observation", y = "AD")

Custom color scales:

p <- ggInterval_hist(facedata, aes(x = AD), bins = 10,
                     method = "equal-bin")$plot
p + scale_fill_manual(values = rainbow(10))

Adding reference lines:

ggInterval_CRplot(facedata, aes(AD)) +
  geom_hline(yintercept = 5, linetype = "dashed", color = "red")

Note that ggInterval_scatterMatrix() returns a marrangeGrob object, so ggplot2 + operators cannot be applied to it directly.

9 References