Vector Binary Tree: Manage Your Data Through Column Names

Chen ZHANG

2024-01-11

VBTree is designed for what

VBTree, short for vector binary tree, is a data structure designed to deal with the data with very structurized column names. It is not uncommon that summary tables are made by different structurized column names. In my case, I met a data collected from series experiments with 2 different data type, 7 temperature conditions, 4 different strain rate conditions and 1 deformation rate, that means there must be \(2*7*4*1=56\) columns in its summary table. I have extracted the first 50 rows and save them into ‘datatest’. Let’s have a look what does it look like:

library(VBTree)
dim(datatest)
#> [1] 50 56
head(datatest[,1:3])
#>   Strain-900-0.001-0.6 Stress-900-0.001-0.6 Strain-900-0.01-0.6
#> 1              0.00009                 2.81             0.00030
#> 2              0.00026                 2.95             0.00052
#> 3              0.00059                 3.37             0.00068
#> 4              0.00076                 3.37             0.00084
#> 5              0.00093                 3.65             0.00112
#> 6              0.00104                 3.79             0.00122
colnames(datatest)
#>  [1] "Strain-900-0.001-0.6"  "Stress-900-0.001-0.6"  "Strain-900-0.01-0.6"  
#>  [4] "Stress-900-0.01-0.6"   "Strain-900-0.1-0.6"    "Stress-900-0.1-0.6"   
#>  [7] "Strain-900-1-0.6"      "Stress-900-1-0.6"      "Strain-950-0.001-0.6" 
#> [10] "Stress-950-0.001-0.6"  "Strain-950-0.01-0.6"   "Stress-950-0.01-0.6"  
#> [13] "Strain-950-0.1-0.6"    "Stress-950-0.1-0.6"    "Strain-950-1-0.6"     
#> [16] "Stress-950-1-0.6"      "Strain-1000-0.001-0.6" "Stress-1000-0.001-0.6"
#> [19] "Strain-1000-0.01-0.6"  "Stress-1000-0.01-0.6"  "Strain-1000-0.1-0.6"  
#> [22] "Stress-1000-0.1-0.6"   "Strain-1000-1-0.6"     "Stress-1000-1-0.6"    
#> [25] "Strain-1050-0.001-0.6" "Stress-1050-0.001-0.6" "Strain-1050-0.01-0.6" 
#> [28] "Stress-1050-0.01-0.6"  "Strain-1050-0.1-0.6"   "Stress-1050-0.1-0.6"  
#> [31] "Strain-1050-1-0.6"     "Stress-1050-1-0.6"     "Strain-1100-0.001-0.6"
#> [34] "Stress-1100-0.001-0.6" "Strain-1100-0.01-0.6"  "Stress-1100-0.01-0.6" 
#> [37] "Strain-1100-0.1-0.6"   "Stress-1100-0.1-0.6"   "Strain-1100-1-0.6"    
#> [40] "Stress-1100-1-0.6"     "Strain-1150-0.001-0.6" "Stress-1150-0.001-0.6"
#> [43] "Strain-1150-0.01-0.6"  "Stress-1150-0.01-0.6"  "Strain-1150-0.1-0.6"  
#> [46] "Stress-1150-0.1-0.6"   "Strain-1150-1-0.6"     "Stress-1150-1-0.6"    
#> [49] "Strain-1200-0.001-0.6" "Stress-1200-0.001-0.6" "Strain-1200-0.01-0.6" 
#> [52] "Stress-1200-0.01-0.6"  "Strain-1200-0.1-0.6"   "Stress-1200-0.1-0.6"  
#> [55] "Strain-1200-1-0.6"     "Stress-1200-1-0.6"

Sometimes I need to extract the data with fixed temperature conditions while in some other circumstances, I have to export data with fixed strain rate. While how to implement that without making defination for repeat times for the code for or while? The main idea is to locate all the column names into an array or tensor with the dimension of \(2*7*4*1\) through which the methods of array or tensor will be applicable. As there is regularly repeat in the names with different combination orders, it is naturally to come out the idea that make all factors in their names be splited firstly, then put them into some proper data structures which can make correct mapping between a character vector and an array, or a tensor. There for these mediate data structures, called double list and vector binary tree, are designed. Here are what them look like:

# Save character vector into chrvec:
chrvec <- colnames(datatest)
unregdl <- chrvec2dl(chrvec) # unregularized double list
print(unregdl) # The pure numeric layers (layer2) are not sorted since all elements are treated as character
#> [[1]]
#> [1] "Strain" "Stress"
#> 
#> [[2]]
#> [1] "1000" "1050" "1100" "1150" "1200" "900"  "950" 
#> 
#> [[3]]
#> [1] "0.001" "0.01"  "0.1"   "1"    
#> 
#> [[4]]
#> [1] "0.6"
#> 
#> attr(,"class")
#> [1] "Double.List"
vbt <- dl2vbt(unregdl)
print(vbt) # elements in layer 2 were sorted
#> $tree
#> $tree[[1]]
#> [1] "Strain" "Stress"
#> 
#> $tree[[2]]
#> $tree[[2]][[1]]
#> [1] "900"  "950"  "1000" "1050" "1100" "1150" "1200"
#> 
#> $tree[[2]][[2]]
#> $tree[[2]][[2]][[1]]
#> [1] "0.001" "0.01"  "0.1"   "1"    
#> 
#> $tree[[2]][[2]][[2]]
#> $tree[[2]][[2]][[2]][[1]]
#> [1] "0.6"
#> 
#> $tree[[2]][[2]][[2]][[2]]
#> list()
#> 
#> 
#> 
#> 
#> 
#> $dims
#> [1] 2 7 4 1
#> 
#> attr(,"class")
#> [1] "Vector.Binary.Tree"

Through which column names are splited into four layers in double list and vector binary tree. The levels for each layers are 2, 7, 4 and 1 respectively. Using these data structure, we can readily convert the whole names into tensor or array from double list or vector binary tree. The demonstration:

ts <- dl2ts(unregdl) # Convert from double list to tensor
print(ts)
#> , , 1, 1
#> 
#>       I2
#> I1     [,1]                   [,2]                   [,3]                   
#>   [1,] "Strain-900-0.001-0.6" "Strain-950-0.001-0.6" "Strain-1000-0.001-0.6"
#>   [2,] "Stress-900-0.001-0.6" "Stress-950-0.001-0.6" "Stress-1000-0.001-0.6"
#>       I2
#> I1     [,4]                    [,5]                    [,6]                   
#>   [1,] "Strain-1050-0.001-0.6" "Strain-1100-0.001-0.6" "Strain-1150-0.001-0.6"
#>   [2,] "Stress-1050-0.001-0.6" "Stress-1100-0.001-0.6" "Stress-1150-0.001-0.6"
#>       I2
#> I1     [,7]                   
#>   [1,] "Strain-1200-0.001-0.6"
#>   [2,] "Stress-1200-0.001-0.6"
#> 
#> , , 2, 1
#> 
#>       I2
#> I1     [,1]                  [,2]                  [,3]                  
#>   [1,] "Strain-900-0.01-0.6" "Strain-950-0.01-0.6" "Strain-1000-0.01-0.6"
#>   [2,] "Stress-900-0.01-0.6" "Stress-950-0.01-0.6" "Stress-1000-0.01-0.6"
#>       I2
#> I1     [,4]                   [,5]                   [,6]                  
#>   [1,] "Strain-1050-0.01-0.6" "Strain-1100-0.01-0.6" "Strain-1150-0.01-0.6"
#>   [2,] "Stress-1050-0.01-0.6" "Stress-1100-0.01-0.6" "Stress-1150-0.01-0.6"
#>       I2
#> I1     [,7]                  
#>   [1,] "Strain-1200-0.01-0.6"
#>   [2,] "Stress-1200-0.01-0.6"
#> 
#> , , 3, 1
#> 
#>       I2
#> I1     [,1]                 [,2]                 [,3]                 
#>   [1,] "Strain-900-0.1-0.6" "Strain-950-0.1-0.6" "Strain-1000-0.1-0.6"
#>   [2,] "Stress-900-0.1-0.6" "Stress-950-0.1-0.6" "Stress-1000-0.1-0.6"
#>       I2
#> I1     [,4]                  [,5]                  [,6]                 
#>   [1,] "Strain-1050-0.1-0.6" "Strain-1100-0.1-0.6" "Strain-1150-0.1-0.6"
#>   [2,] "Stress-1050-0.1-0.6" "Stress-1100-0.1-0.6" "Stress-1150-0.1-0.6"
#>       I2
#> I1     [,7]                 
#>   [1,] "Strain-1200-0.1-0.6"
#>   [2,] "Stress-1200-0.1-0.6"
#> 
#> , , 4, 1
#> 
#>       I2
#> I1     [,1]               [,2]               [,3]               
#>   [1,] "Strain-900-1-0.6" "Strain-950-1-0.6" "Strain-1000-1-0.6"
#>   [2,] "Stress-900-1-0.6" "Stress-950-1-0.6" "Stress-1000-1-0.6"
#>       I2
#> I1     [,4]                [,5]                [,6]               
#>   [1,] "Strain-1050-1-0.6" "Strain-1100-1-0.6" "Strain-1150-1-0.6"
#>   [2,] "Stress-1050-1-0.6" "Stress-1100-1-0.6" "Stress-1150-1-0.6"
#>       I2
#> I1     [,7]               
#>   [1,] "Strain-1200-1-0.6"
#>   [2,] "Stress-1200-1-0.6"
#> 
#> attr(,"class")
#> [1] "tensor"
arr <- vbt2arr(vbt) # Convert from vector binary tree to array
print(arr)
#> , , 1, 1
#> 
#>      [,1]                   [,2]                   [,3]                   
#> [1,] "Strain-900-0.001-0.6" "Strain-950-0.001-0.6" "Strain-1000-0.001-0.6"
#> [2,] "Stress-900-0.001-0.6" "Stress-950-0.001-0.6" "Stress-1000-0.001-0.6"
#>      [,4]                    [,5]                    [,6]                   
#> [1,] "Strain-1050-0.001-0.6" "Strain-1100-0.001-0.6" "Strain-1150-0.001-0.6"
#> [2,] "Stress-1050-0.001-0.6" "Stress-1100-0.001-0.6" "Stress-1150-0.001-0.6"
#>      [,7]                   
#> [1,] "Strain-1200-0.001-0.6"
#> [2,] "Stress-1200-0.001-0.6"
#> 
#> , , 2, 1
#> 
#>      [,1]                  [,2]                  [,3]                  
#> [1,] "Strain-900-0.01-0.6" "Strain-950-0.01-0.6" "Strain-1000-0.01-0.6"
#> [2,] "Stress-900-0.01-0.6" "Stress-950-0.01-0.6" "Stress-1000-0.01-0.6"
#>      [,4]                   [,5]                   [,6]                  
#> [1,] "Strain-1050-0.01-0.6" "Strain-1100-0.01-0.6" "Strain-1150-0.01-0.6"
#> [2,] "Stress-1050-0.01-0.6" "Stress-1100-0.01-0.6" "Stress-1150-0.01-0.6"
#>      [,7]                  
#> [1,] "Strain-1200-0.01-0.6"
#> [2,] "Stress-1200-0.01-0.6"
#> 
#> , , 3, 1
#> 
#>      [,1]                 [,2]                 [,3]                 
#> [1,] "Strain-900-0.1-0.6" "Strain-950-0.1-0.6" "Strain-1000-0.1-0.6"
#> [2,] "Stress-900-0.1-0.6" "Stress-950-0.1-0.6" "Stress-1000-0.1-0.6"
#>      [,4]                  [,5]                  [,6]                 
#> [1,] "Strain-1050-0.1-0.6" "Strain-1100-0.1-0.6" "Strain-1150-0.1-0.6"
#> [2,] "Stress-1050-0.1-0.6" "Stress-1100-0.1-0.6" "Stress-1150-0.1-0.6"
#>      [,7]                 
#> [1,] "Strain-1200-0.1-0.6"
#> [2,] "Stress-1200-0.1-0.6"
#> 
#> , , 4, 1
#> 
#>      [,1]               [,2]               [,3]               
#> [1,] "Strain-900-1-0.6" "Strain-950-1-0.6" "Strain-1000-1-0.6"
#> [2,] "Stress-900-1-0.6" "Stress-950-1-0.6" "Stress-1000-1-0.6"
#>      [,4]                [,5]                [,6]               
#> [1,] "Strain-1050-1-0.6" "Strain-1100-1-0.6" "Strain-1150-1-0.6"
#> [2,] "Stress-1050-1-0.6" "Stress-1100-1-0.6" "Stress-1150-1-0.6"
#>      [,7]               
#> [1,] "Strain-1200-1-0.6"
#> [2,] "Stress-1200-1-0.6"

Batch data processing through array or tensor

Because the regularized double list, vector binary tree and tensor (array) possess unique mapping relationships, a regularized double list is necessary for correct index setting:

regdl <- vbt2dl(vbt)
print(regdl)
#> [[1]]
#> [1] "Strain" "Stress"
#> 
#> [[2]]
#> [1] "900"  "950"  "1000" "1050" "1100" "1150" "1200"
#> 
#> [[3]]
#> [1] "0.001" "0.01"  "0.1"   "1"    
#> 
#> [[4]]
#> [1] "0.6"
#> 
#> attr(,"class")
#> [1] "Double.List"

It can be seen that temperatures were save in layer 2 with 7 levels while strain rates were save in layer 3 with 4 levels. Array’s methods are available now. For example, if we want to ‘Stress’ data (1st layer1, 2nd level) and make traversal in all temperature conditions with fixed 0.01 strain rate (3rd layer, 2nd level), execute the folloing code:

subset1 <- datatest[, arr[2,,2,1]]
head(subset1)
#>   Stress-900-0.01-0.6 Stress-950-0.01-0.6 Stress-1000-0.01-0.6
#> 1                3.37                2.39                 3.65
#> 2                3.79                2.81                 3.65
#> 3                4.07                2.67                 3.65
#> 4                4.49                2.95                 3.65
#> 5                5.05                3.23                 3.93
#> 6                5.61                3.37                 3.93
#>   Stress-1050-0.01-0.6 Stress-1100-0.01-0.6 Stress-1150-0.01-0.6
#> 1                2.670                0.000                3.091
#> 2                2.669                2.951                3.231
#> 3                2.949                3.370                3.370
#> 4                3.229                3.369                3.509
#> 5                3.509                3.508                3.788
#> 6                3.509                3.367                4.067
#>   Stress-1200-0.01-0.6
#> 1                3.231
#> 2                3.511
#> 3                3.650
#> 4                3.790
#> 5                4.069
#> 6                4.208

If we want to automatically plot the Stress-Strain plot with fixed temperature (1050 for example, in 2nd layer, 4th level), traverse all strain rate conditions, try the following code:

xbatch <- arr[1,4,,1]
ybatch <- arr[2,4,,1]
regdl <- arr2dl(arr)

rpt <- length(xbatch)
i <- 1
for (i in 1:rpt) {
  plt <- plot(datatest[,xbatch[i]], datatest[,ybatch[i]], xlab="Strain", ylab="Stress", main=paste("in T=1050, SR=",regdl[[3]][i], sep = ""))
  plt
}

The methods through tensor are the same as that of array.

Advanced batch data processing thorugh vector binary tree

If we need highly customized condition select, for example I need make traversal in the temperature range from 1000 to 1150, with 0.01 and 1 two strain rate conditions, to make the Stress-Strain plot, the vector binary tree will make sense. It supports the visit through a handmade double list which can be highly customized. Firstly let us have a look at the appearance of the full vector binary tree:

print(vbt)
#> $tree
#> $tree[[1]]
#> [1] "Strain" "Stress"
#> 
#> $tree[[2]]
#> $tree[[2]][[1]]
#> [1] "900"  "950"  "1000" "1050" "1100" "1150" "1200"
#> 
#> $tree[[2]][[2]]
#> $tree[[2]][[2]][[1]]
#> [1] "0.001" "0.01"  "0.1"   "1"    
#> 
#> $tree[[2]][[2]][[2]]
#> $tree[[2]][[2]][[2]][[1]]
#> [1] "0.6"
#> 
#> $tree[[2]][[2]][[2]][[2]]
#> list()
#> 
#> 
#> 
#> 
#> 
#> $dims
#> [1] 2 7 4 1
#> 
#> attr(,"class")
#> [1] "Vector.Binary.Tree"

Well, the desired elements locate from 3rd to 7th in layer 2, the 2nd and 4th in layer 3. We can made two double list to specify and extract the desired Stress and Strain subsets. The demonstration is:

subStrain_dl <- list(1, c(3:7), c(2,4), 1)
subStress_dl <- list(2, c(3:7), c(2,4), 1)
# make visiting from original vector binary
# tree and save them as new doube lists:
subStrain_dl2 <- advbtinq(vbt, subStrain_dl) 
subStress_dl2 <- advbtinq(vbt, subStress_dl)
print(subStrain_dl2)
#> [[1]]
#> [1] "Strain"
#> 
#> [[2]]
#> [1] "1000" "1050" "1100" "1150" "1200"
#> 
#> [[3]]
#> [1] "0.01" "1"   
#> 
#> [[4]]
#> [1] "0.6"
#> 
#> attr(,"class")
#> [1] "Double.List"
print(subStress_dl2)
#> [[1]]
#> [1] "Stress"
#> 
#> [[2]]
#> [1] "1000" "1050" "1100" "1150" "1200"
#> 
#> [[3]]
#> [1] "0.01" "1"   
#> 
#> [[4]]
#> [1] "0.6"
#> 
#> attr(,"class")
#> [1] "Double.List"
xbatch2 <- as.vector(dl2arr(subStrain_dl2))
ybatch2 <- as.vector(dl2arr(subStress_dl2))
print(xbatch2)
#>  [1] "Strain-1000-0.01-0.6" "Strain-1050-0.01-0.6" "Strain-1100-0.01-0.6"
#>  [4] "Strain-1150-0.01-0.6" "Strain-1200-0.01-0.6" "Strain-1000-1-0.6"   
#>  [7] "Strain-1050-1-0.6"    "Strain-1100-1-0.6"    "Strain-1150-1-0.6"   
#> [10] "Strain-1200-1-0.6"
print(ybatch2)
#>  [1] "Stress-1000-0.01-0.6" "Stress-1050-0.01-0.6" "Stress-1100-0.01-0.6"
#>  [4] "Stress-1150-0.01-0.6" "Stress-1200-0.01-0.6" "Stress-1000-1-0.6"   
#>  [7] "Stress-1050-1-0.6"    "Stress-1100-1-0.6"    "Stress-1150-1-0.6"   
#> [10] "Stress-1200-1-0.6"

Their respective order matched perfectly. The next step is similar as what we done in previous section:

rpt <- length(xbatch2)
i <- 1
for (i in 1:rpt) {
  plt <- plot(datatest[, xbatch2[i]], datatest[, ybatch2[i]], xlab="Strain", ylab="Stress", main=ybatch2[i])
  plt
}

Advantage of VBTree

It is commonly said that R performs relative low speed compared to other popular programming languages, espcially in the situations of frequent data operations such as melt and reshape. In my opinion, an efficient logic for data management is more important rather than some amazing skills in data treatment. Although all the demos I showed from beginning to end never do any melt, bind or reshape operations on original data, but data batch processing is still can be implemented.

Lets check all object sizes we used:

# For original data:
object.size(datatest)
#> 31072 bytes
# For tensor and array:
object.size(ts)
#> 6400 bytes
object.size(arr)
#> 5568 bytes
# For vector binary tree:
object.size(vbt)
#> 2080 bytes
# For double list:
object.size(regdl)
#> 1408 bytes

I packaged the datatest in VBTree only used first 50 rows only for demonstration. In fact, it has the scales far more than 50 rows. All these data can be structurized managed throguh VBTree, using a only 1408 bytes object minimally.