collapse and dplyr

Fast (Weighted) Aggregations and Transformations in a Piped Workflow

Sebastian Krantz

2020-08-27

collapse is a C/C++ based package for data transformation and statistical computing in R. It’s aims are:

  1. To facilitate complex data transformation, exploration and computing tasks in R.
  2. To help make R code fast, flexible, parsimonious and programmer friendly.

This vignette focuses on the integration of collapse and the popular dplyr package by Hadley Wickham. In particular it will demonstrate how using collapse’s fast functions and some fast alternatives for dplyr verbs can substantially facilitate and speed up basic data manipulation, grouped and weighted aggregations and transformations, and panel data computations (i.e. between- and within-transformations, panel-lags, differences and growth rates) in a dplyr (piped) workflow.


Notes:


1. Fast Aggregations

A key feature of collapse is it’s broad set of Fast Statistical Functions (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fNobs, fNdistinct) which are able to substantially speed-up column-wise, grouped and weighted computations on vectors, matrices or data frames. The functions are S3 generic, with a default (vector), matrix and data frame method, as well as a grouped_df method for grouped tibbles used by dplyr. The grouped tibble method has the following arguments:

FUN.grouped_df(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
               use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] ...)

where w is a weight variable, and TRA and can be used to transform x using the computed statistics and one of 10 available transformations ("replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%", "-%%", discussed in section 2). na.rm efficiently removes missing values and is TRUE by default. use.g.names generates new row-names from the unique combinations of groups (default: disabled), whereas keep.group_vars (default: enabled) will keep the grouping columns as is custom in the native data %>% group_by(...) %>% summarize(...) workflow in dplyr. Finally, keep.w regulates whether a weighting variable used is also aggregated and saved in a column. For fsum, fmean, fmedian, fnth, fvar, fsd and fmode this will compute the sum of the weights in each group, whereas fprod returns the product of the weights.

With that in mind, let’s consider some straightforward applications.

1.1 Simple Aggregations

Consider the Groningen Growth and Development Center 10-Sector Database included in collapse and introduced in the main vignette:

library(collapse)
head(GGDC10S)
#   Country Regioncode             Region Variable Year      AGR      MIN       MAN        PU
# 1     BWA        SSA Sub-saharan Africa       VA 1960       NA       NA        NA        NA
# 2     BWA        SSA Sub-saharan Africa       VA 1961       NA       NA        NA        NA
# 3     BWA        SSA Sub-saharan Africa       VA 1962       NA       NA        NA        NA
# 4     BWA        SSA Sub-saharan Africa       VA 1963       NA       NA        NA        NA
# 5     BWA        SSA Sub-saharan Africa       VA 1964 16.30154 3.494075 0.7365696 0.1043936
# 6     BWA        SSA Sub-saharan Africa       VA 1965 15.72700 2.495768 1.0181992 0.1350976
#         CON      WRT      TRA     FIRE      GOV      OTH      SUM
# 1        NA       NA       NA       NA       NA       NA       NA
# 2        NA       NA       NA       NA       NA       NA       NA
# 3        NA       NA       NA       NA       NA       NA       NA
# 4        NA       NA       NA       NA       NA       NA       NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710

# Summarize the Data: 
# descr(GGDC10S, cols = is.categorical)
# aperm(qsu(GGDC10S, ~Variable, cols = is.numeric))

Simple column-wise computations using the fast functions and pipe operators are performed as follows:

library(dplyr)

GGDC10S %>% fNobs                       # Number of Observations
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#       5027       5027       5027       5027       5027       4364       4355       4355       4354 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4355       4355       4355       4355       3482       4248       4364
GGDC10S %>% fNdistinct                  # Number of distinct values
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#         43          6          6          2         67       4353       4224       4353       4237 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#       4339       4344       4334       4349       3470       4238       4364
GGDC10S %>% select_at(6:16) %>% fmedian # Median
#        AGR        MIN        MAN         PU        CON        WRT        TRA       FIRE        GOV 
#  4394.5194   173.2234  3718.0981   167.9500  1473.4470  3773.6430  1174.8000   960.1251  3928.5127 
#        OTH        SUM 
#  1433.1722 23186.1936
GGDC10S %>% select_at(6:16) %>% fmean   # Mean
#        AGR        MIN        MAN         PU        CON        WRT        TRA       FIRE        GOV 
#  2526696.5  1867908.9  5538491.4   335679.5  1801597.6  3392909.5  1473269.7  1657114.8  1712300.3 
#        OTH        SUM 
#  1684527.3 21566436.8
GGDC10S %>% fmode                       # Mode
#            Country         Regioncode             Region           Variable               Year 
#              "USA"              "ASI"             "Asia"              "EMP"             "2010" 
#                AGR                MIN                MAN                 PU                CON 
# "171.315882316326"                "0" "4645.12507642586"                "0" "1.34623115930777" 
#                WRT                TRA               FIRE                GOV                OTH 
# "21.8380052682527" "8.97743416914571" "40.0701608636442"                "0" "3626.84423577048" 
#                SUM 
# "37.4822945751317"
GGDC10S %>% fmode(drop = FALSE)         # Keep data structure intact
#   Country Regioncode Region Variable Year      AGR MIN      MAN PU      CON      WRT      TRA
# 1     USA        ASI   Asia      EMP 2010 171.3159   0 4645.125  0 1.346231 21.83801 8.977434
#       FIRE GOV      OTH      SUM
# 1 40.07016   0 3626.844 37.48229

Moving on to grouped statistics, we can compute the average value added and employment by sector and country using:

GGDC10S %>% 
  group_by(Variable, Country) %>%
  select_at(6:16) %>% fmean
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       1420.   52.1   1932.  1.02e2 7.42e2 1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
#  2 EMP      BOL        964.   56.0    235.  5.35e0 1.23e2 2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
#  3 EMP      BRA      17191.  206.    6991.  3.65e2 3.52e3 8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
#  4 EMP      BWA        188.   10.5     18.1 3.09e0 2.53e1 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
#  5 EMP      CHL        702.  101.     625.  2.94e1 2.96e2 6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
#  6 EMP      CHN     287744. 7050.   67144.  1.61e3 2.09e4 2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5
#  7 EMP      COL       3091.  145.    1175.  3.39e1 5.24e2 2.07e3 4.70e2  649.     NA   1.73e3 9.89e3
#  8 EMP      CRI        231.    1.70   136.  1.43e1 5.76e1 1.57e2 4.24e1   54.9   128.  6.51e1 8.87e2
#  9 EMP      DEW       2490.  407.    8473.  2.26e2 2.09e3 4.44e3 1.48e3 1689.   3945.  9.99e2 2.62e4
# 10 EMP      DNK        236.    8.03   507.  1.38e1 1.71e2 4.55e2 1.61e2  181.    549.  1.11e2 2.39e3
# # ... with 75 more rows

Similarly we can aggregate using any other of the above functions.

It is important to not use dplyr’s summarize together with these functions since that would eliminate their speed gain. These functions are fast because they are executed only once and carry out the grouped computations in C++, whereas summarize will apply the function to each group in the grouped tibble.


Excursus: What is Happening Behind the Scenes?

To better explain this point it is perhaps good to shed some light on what is happening behind the scenes of dplyr and collapse. Fundamentally both packages follow different computing paradigms:

dplyr is an efficient implementation of the Split-Apply-Combine computing paradigm. Data is split into groups, these data-chunks are then passed to a function carrying out the computation, and finally recombined to produce the aggregated data.frame. This modus operandi is evident in the grouping mechanism of dplyr. When a data.frame is passed through group_by, a ‘groups’ attribute is attached:

GGDC10S %>% group_by(Variable, Country) %>% attr("groups")
# # A tibble: 85 x 3
#    Variable Country       .rows
#  * <chr>    <chr>   <list<int>>
#  1 EMP      ARG            [62]
#  2 EMP      BOL            [61]
#  3 EMP      BRA            [62]
#  4 EMP      BWA            [52]
#  5 EMP      CHL            [63]
#  6 EMP      CHN            [62]
#  7 EMP      COL            [61]
#  8 EMP      CRI            [62]
#  9 EMP      DEW            [61]
# 10 EMP      DNK            [64]
# # ... with 75 more rows

This object is a data.frame giving the unique groups and in the third (last) column vectors containing the indices of the rows belonging to that group. A command like summarize uses this information to split the data.frame into groups which are then passed sequentially to the function used and later recombined. These steps are also done in C++ which makes dplyr quite efficient.

Now collapse is based around one-pass grouped computations at the C++ level using its own grouped statistical functions. In other words the data is not split and recombined at all but the entire computation is performed in a single C++ loop running through that data and completing the computations for each group simultaneously. This modus operandi is also evident in collapse grouping objects. The method GRP.grouped_df takes a dplyr grouping object from a grouped tibble and efficiently converts it to a collapse grouping object:

GGDC10S %>% group_by(Variable, Country) %>% GRP %>% str
# List of 8
#  $ N.groups   : int 85
#  $ group.id   : int [1:5027] 46 46 46 46 46 46 46 46 46 46 ...
#  $ group.sizes: int [1:85] 62 61 62 52 63 62 61 62 61 64 ...
#  $ groups     :List of 2
#   ..$ Variable: chr [1:85] "EMP" "EMP" "EMP" "EMP" ...
#   .. ..- attr(*, "label")= chr "Variable"
#   .. ..- attr(*, "format.stata")= chr "%9s"
#   ..$ Country : chr [1:85] "ARG" "BOL" "BRA" "BWA" ...
#   .. ..- attr(*, "label")= chr "Country"
#   .. ..- attr(*, "format.stata")= chr "%9s"
#  $ group.vars : chr [1:2] "Variable" "Country"
#  $ ordered    : logi [1:2] TRUE TRUE
#  $ order      : NULL
#  $ call       : language GRP.grouped_df(X = .)
#  - attr(*, "class")= chr "GRP"

This object is a list where the first three elements give the number of groups, the group-id to which each row belongs and a vector of group-sizes. A function like fsum uses this information to (for each column) create a result vector of size ‘N.groups’ and the run through the column using the ‘group.id’ vector to add the i’th data point to the ’group.id[i]’th element of the result vector. When the loop is finished, the grouped computation is also finished.

It is obvious that collapse is faster than dplyr since it’s method of computing involves less steps, and it does not need to call statistical functions multiple times. See the benchmark section.


1.2 More Speed using collapse Verbs

collapse fast functions do not develop their maximal performance on a grouped tibble created with group_by because of the additional conversion cost of the grouping object incurred by GRP.grouped_df. This cost is already minimized through the use of C++, but we can do even better replacing group_by with collapse::fgroup_by. fgroup_by works like group_by but does the grouping with collapse::GRP (up to 10x faster than group_by) and simply attaches a collapse grouping object to the grouped_df. Thus the speed gain is 2-fold: Faster grouping and no conversion cost when calling collapse functions.

Another improvement comes from replacing the dplyr verb select with collapse::fselect, and, for selection using column names, indices or functions use collapse::get_vars instead of select_at or select_if. Next to get_vars, collapse also introduces the predicates num_vars, cat_vars, char_vars, fact_vars, logi_vars and Date_vars to efficiently select columns by type.

GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian
# # A tibble: 85 x 13
#    Variable Country     AGR     MIN     MAN     PU    CON    WRT    TRA   FIRE     GOV    OTH    SUM
#    <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
#  1 EMP      ARG       1325.   47.4   1988.  1.05e2 7.82e2 1.85e3 5.80e2  464.   1739.   866.  9.74e3
#  2 EMP      BOL        943.   53.5    167.  4.46e0 6.60e1 1.32e2 9.70e1   15.3    NA    384.  1.84e3
#  3 EMP      BRA      17481.  225.    7208.  3.76e2 4.05e3 6.45e3 1.58e3 4355.   4450.  4479.  5.19e4
#  4 EMP      BWA        175.   12.2     13.1 3.71e0 1.90e1 2.11e1 6.75e0   10.4    53.8   31.2 3.61e2
#  5 EMP      CHL        690.   93.9    607.  2.58e1 2.30e2 4.84e2 2.05e2  106.     NA    900.  3.31e3
#  6 EMP      CHN     293915  8150.   61761.  1.14e3 1.06e4 1.70e4 9.56e3 4328.  19468.  9954.  4.45e5
#  7 EMP      COL       3006.   84.0   1033.  3.71e1 4.19e2 1.55e3 3.91e2  655.     NA   1430.  8.63e3
#  8 EMP      CRI        216.    1.49   114.  7.92e0 5.50e1 8.98e1 2.55e1   19.6   122.    60.6 7.19e2
#  9 EMP      DEW       2178   320.    8459.  2.47e2 2.10e3 4.45e3 1.53e3 1656    3700    900   2.65e4
# 10 EMP      DNK        187.    3.75   508.  1.36e1 1.65e2 4.61e2 1.61e2  169.    642.   104.  2.42e3
# # ... with 75 more rows

microbenchmark(collapse = GGDC10S %>% fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fmedian,
               hybrid = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmedian,
               dplyr = GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% summarise_all(median, na.rm = TRUE))
# Unit: microseconds
#      expr       min       lq      mean    median       uq       max neval cld
#  collapse   945.152  1038.64  1087.736  1062.961  1123.65  1395.861   100 a  
#    hybrid 12194.148 12581.71 13352.293 12830.273 13593.58 30239.488   100  b 
#     dplyr 59320.071 61269.95 64786.978 63674.774 67513.17 87006.580   100   c

Benchmarks on the different components of this code and with larger data are provided under ‘Benchmarks’. Note that a grouped tibble created with fgroup_by can no longer be used for grouped computations with dplyr verbs like mutate or summarize. To avoid errors with these functions and print.grouped_df, [.grouped_df etc., the classes assigned after fgroup_by are reshuffled, so that the data.frame is treated by the dplyr ecosystem like a normal tibble:

class(group_by(GGDC10S, Variable, Country))
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

class(fgroup_by(GGDC10S, Variable, Country))
# [1] "tbl_df"     "tbl"        "grouped_df" "data.frame"

Also note that fselect and get_vars are not full drop-in replacements for select because they do not have a grouped_df method:

GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% head(3)
# # A tibble: 3 x 13
# # Groups:   Variable, Country [1]
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 VA       BWA        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% head(3)
# # A tibble: 3 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Since by default keep.group_vars = TRUE in the Fast Statistical Functions, the end result is nevertheless the same:

GGDC10S %>% group_by(Variable, Country) %>% select_at(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
#   Variable Country    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV   OTH    SUM
#   <chr>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
# 1 EMP      ARG      1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.  992. 10542.
# 2 EMP      BOL       964.  56.0  235.   5.35  123.  282.  115.   44.6   NA   396.  2221.
# 3 EMP      BRA     17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307. 5710. 54273.
GGDC10S %>% group_by(Variable, Country) %>% get_vars(6:16) %>% fmean %>% head(3)
# # A tibble: 3 x 13
#   Variable Country    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV   OTH    SUM
#   <chr>    <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>
# 1 EMP      ARG      1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.  992. 10542.
# 2 EMP      BOL       964.  56.0  235.   5.35  123.  282.  115.   44.6   NA   396.  2221.
# 3 EMP      BRA     17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307. 5710. 54273.

Another useful verb introduced by collapse is fgroup_vars, which can be used to efficiently obtain the grouping columns or grouping variables from a grouped tibble:

# fgroup_by fully supports grouped tibbles created with group_by or fgroup_by: 
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 VA       BWA    
# 2 VA       BWA    
# 3 VA       BWA
GGDC10S %>% fgroup_by(Variable, Country) %>% fgroup_vars %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 VA       BWA    
# 2 VA       BWA    
# 3 VA       BWA

# The other possibilities:
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("unique") %>% head(3)
# # A tibble: 3 x 2
#   Variable Country
#   <chr>    <chr>  
# 1 EMP      ARG    
# 2 EMP      BOL    
# 3 EMP      BRA
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("names")
# [1] "Variable" "Country"
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("indices")
# [1] 4 1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_indices")
# Variable  Country 
#        4        1
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("logical")
#  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
GGDC10S %>% group_by(Variable, Country) %>% fgroup_vars("named_logical")
#    Country Regioncode     Region   Variable       Year        AGR        MIN        MAN         PU 
#       TRUE      FALSE      FALSE       TRUE      FALSE      FALSE      FALSE      FALSE      FALSE 
#        CON        WRT        TRA       FIRE        GOV        OTH        SUM 
#      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE      FALSE

Another collapse verb to mention here is fsubset, a faster alternative to dplyr::filter which also provides an option to flexibly subset columns after the select argument:

# Two equivalent calls, the first is substantially faster
GGDC10S %>% fsubset(Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(3)
#   Country Year      AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
# 1     BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2     BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3     BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193

GGDC10S %>% filter(Variable == "VA" & Year > 1990) %>% select(Country, Year, AGR:GOV) %>% head(3)
#   Country Year      AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
# 1     BWA 1991 303.1157 2646.950 472.6488 160.6079 580.0876 806.7509 232.7884 432.6965 1073.263
# 2     BWA 1992 333.4364 2690.939 537.4274 178.4532 678.7320 725.2577 285.1403 517.2141 1234.012
# 3     BWA 1993 404.5488 2624.928 567.3420 219.2183 634.2797 771.8253 349.7458 673.2540 1487.193

collapse also offers roworder, frename, colorder, ftransform/TRA and fungroup as fast replacements for dplyr::arrange, dplyr::rename, dplyr::relocate, dplyr::mutate, and dplyr::ungroup.

1.3 Multi-Function Aggregations

One can also aggregate with multiple functions at the same time. For such operations it is often necessary to use curly braces { to prevent first argument injection so that %>% cbind(FUN1(.), FUN2(.)) does not evaluate as %>% cbind(., FUN1(.), FUN2(.)):

GGDC10S %>%
  fgroup_by(Variable, Country) %>%
  get_vars(6:16) %>% {
    cbind(fmedian(.),
          add_stub(fmean(., keep.group_vars = FALSE), "mean_"))
    } %>% head(3)
#   Variable Country        AGR       MIN       MAN         PU        CON      WRT        TRA
# 1      EMP     ARG  1324.5255  47.35255 1987.5912 104.738825  782.40283 1854.612  579.93982
# 2      EMP     BOL   943.1612  53.53538  167.1502   4.457895   65.97904  132.225   96.96828
# 3      EMP     BRA 17480.9810 225.43693 7207.7915 375.851832 4054.66103 6454.523 1580.81120
#         FIRE      GOV       OTH       SUM   mean_AGR  mean_MIN  mean_MAN    mean_PU  mean_CON
# 1  464.39920 1738.836  866.1119  9743.223  1419.8013  52.08903 1931.7602 101.720936  742.4044
# 2   15.34259       NA  384.0678  1842.055   964.2103  56.03295  235.0332   5.346433  122.7827
# 3 4354.86210 4449.942 4478.6927 51881.110 17191.3529 206.02389 6991.3710 364.573404 3524.7384
#    mean_WRT  mean_TRA  mean_FIRE mean_GOV  mean_OTH  mean_SUM
# 1 1982.1775  648.5119  627.79291 2043.471  992.4475 10542.177
# 2  281.5164  115.4728   44.56442       NA  395.5650  2220.524
# 3 8509.4612 2054.3731 4413.54448 5307.280 5710.2665 54272.985

The function add_stub used above is a collapse function adding a prefix (default) or suffix to variables names. The collapse predicate add_vars provides a more efficient alternative to cbind.data.frame. The idea here is ‘adding’ variables to the data.frame in the first argument i.e. the attributes of the first argument are preserved, so the expression below still gives a tibble instead of a data.frame:

GGDC10S %>%
  fgroup_by(Variable, Country) %>% {
   add_vars(get_vars(., "Reg", regex = TRUE) %>% ffirst, # Regular expression matching column names
            num_vars(.) %>% fmean(keep.group_vars = FALSE) %>% add_stub("mean_"), # num_vars selects all numeric variables
            fselect(., PU:TRA) %>% fmedian(keep.group_vars = FALSE) %>% add_stub("median_"), 
            fselect(., PU:CON) %>% fmin(keep.group_vars = FALSE) %>% add_stub("min_"))      
  } %>% head(3)
# # A tibble: 3 x 22
#   Variable Country Regioncode Region mean_Year mean_AGR mean_MIN mean_MAN mean_PU mean_CON mean_WRT
#   <chr>    <chr>   <chr>      <chr>      <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
# 1 EMP      ARG     LAM        Latin~     1980.    1420.     52.1    1932.  102.       742.    1982.
# 2 EMP      BOL     LAM        Latin~     1980      964.     56.0     235.    5.35     123.     282.
# 3 EMP      BRA     LAM        Latin~     1980.   17191.    206.     6991.  365.      3525.    8509.
# # ... with 11 more variables: mean_TRA <dbl>, mean_FIRE <dbl>, mean_GOV <dbl>, mean_OTH <dbl>,
# #   mean_SUM <dbl>, median_PU <dbl>, median_CON <dbl>, median_WRT <dbl>, median_TRA <dbl>,
# #   min_PU <dbl>, min_CON <dbl>

Another nice feature of add_vars is that it can also very efficiently reorder columns i.e. bind columns in a different order than they are passed. This can be done by simply specifying the positions the added columns should have in the final data frame, and then add_vars shifts the first argument columns to the right to fill in the gaps.

GGDC10S %>%
  fsubset(Variable == "VA", Country, AGR, SUM) %>% 
  fgroup_by(Country) %>% {
   add_vars(fgroup_vars(.,"unique"),
            fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"),
            fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"), 
            pos = c(2,4,3,5))
  } %>% head(3)
#   Country  mean_AGR    sd_AGR   mean_SUM    sd_SUM
# 1     ARG 14951.292 33061.413  152533.84 301316.25
# 2     BOL  3299.718  4456.331   22619.18  33172.98
# 3     BRA 76870.146 59441.696 1200562.67 976963.14

A much more compact solution to multi-function and multi-type aggregation is offered by the function collapg:

# This aggregates numeric colums using the mean (fmean) and categorical columns with the mode (fmode)
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg %>% head(3)
# # A tibble: 3 x 16
#   Variable Country Regioncode Region  Year    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV
#   <chr>    <chr>   <chr>      <chr>  <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
# 1 EMP      ARG     LAM        Latin~ 1980.  1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.
# 2 EMP      BOL     LAM        Latin~ 1980    964.  56.0  235.   5.35  123.  282.  115.   44.6   NA 
# 3 EMP      BRA     LAM        Latin~ 1980. 17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307.
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

By default it aggregates numeric columns using the fmean and categorical columns using fmode, and preserves the order of all columns. Changing these defaults is very easy:

# This aggregates numeric colums using the median and categorical columns using the first value
GGDC10S %>% fgroup_by(Variable, Country) %>% collapg(fmedian, flast) %>% head(3)
# # A tibble: 3 x 16
#   Variable Country Regioncode Region  Year    AGR   MIN   MAN     PU    CON   WRT    TRA   FIRE
#   <chr>    <chr>   <chr>      <chr>  <dbl>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
# 1 EMP      ARG     LAM        Latin~ 1980.  1325.  47.4 1988. 105.    782.  1855.  580.   464. 
# 2 EMP      BOL     LAM        Latin~ 1980    943.  53.5  167.   4.46   66.0  132.   97.0   15.3
# 3 EMP      BRA     LAM        Latin~ 1980. 17481. 225.  7208. 376.   4055.  6455. 1581.  4355. 
# # ... with 3 more variables: GOV <dbl>, OTH <dbl>, SUM <dbl>

One can apply multiple functions to both numeric and/or categorical data:

GGDC10S %>% fgroup_by(Variable, Country) %>%
  collapg(list(fmean, fmedian), list(first, fmode, flast)) %>% head(3)
# # A tibble: 3 x 32
#   Variable Country first.Regioncode fmode.Regioncode flast.Regioncode first.Region fmode.Region
#   <chr>    <chr>   <chr>            <chr>            <chr>            <chr>        <chr>       
# 1 EMP      ARG     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# 2 EMP      BOL     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# 3 EMP      BRA     LAM              LAM              LAM              Latin Ameri~ Latin Ameri~
# # ... with 25 more variables: flast.Region <chr>, fmean.Year <dbl>, fmedian.Year <dbl>,
# #   fmean.AGR <dbl>, fmedian.AGR <dbl>, fmean.MIN <dbl>, fmedian.MIN <dbl>, fmean.MAN <dbl>,
# #   fmedian.MAN <dbl>, fmean.PU <dbl>, fmedian.PU <dbl>, fmean.CON <dbl>, fmedian.CON <dbl>,
# #   fmean.WRT <dbl>, fmedian.WRT <dbl>, fmean.TRA <dbl>, fmedian.TRA <dbl>, fmean.FIRE <dbl>,
# #   fmedian.FIRE <dbl>, fmean.GOV <dbl>, fmedian.GOV <dbl>, fmean.OTH <dbl>, fmedian.OTH <dbl>,
# #   fmean.SUM <dbl>, fmedian.SUM <dbl>

Applying multiple functions to only numeric (or only categorical) data allows return in a long format:

GGDC10S %>% fgroup_by(Variable, Country) %>%
  collapg(list(fmean, fmedian), cols = is.numeric, return = "long") %>% head(3)
# # A tibble: 3 x 15
#   Function Variable Country  Year    AGR   MIN   MAN     PU   CON   WRT   TRA   FIRE   GOV   OTH
#   <chr>    <chr>    <chr>   <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
# 1 fmean    EMP      ARG     1980.  1420.  52.1 1932. 102.    742. 1982.  649.  628.  2043.  992.
# 2 fmean    EMP      BOL     1980    964.  56.0  235.   5.35  123.  282.  115.   44.6   NA   396.
# 3 fmean    EMP      BRA     1980. 17191. 206.  6991. 365.   3525. 8509. 2054. 4414.  5307. 5710.
# # ... with 1 more variable: SUM <dbl>

Finally, collapg also makes it very easy to apply aggregator functions to certain columns only:

GGDC10S %>% fgroup_by(Variable, Country) %>%
  collapg(custom = list(fmean = 6:8, fmedian = 10:12)) %>% head(3)
# # A tibble: 3 x 8
#   Variable Country fmean.AGR fmean.MIN fmean.MAN fmedian.CON fmedian.WRT fmedian.TRA
#   <chr>    <chr>       <dbl>     <dbl>     <dbl>       <dbl>       <dbl>       <dbl>
# 1 EMP      ARG         1420.      52.1     1932.       782.        1855.       580. 
# 2 EMP      BOL          964.      56.0      235.        66.0        132.        97.0
# 3 EMP      BRA        17191.     206.      6991.      4055.        6455.      1581.

To understand more about collapg, look it up in the documentation (?collapg).

1.4 Weighted Aggregations

Weighted aggregations are possible with the functions fsum, fprod, fmean, fmedian, fnth, fmode, fvar and fsd. The implementation is such that by default (option keep.w = TRUE) these functions also aggregate the weights, so that further weighted computations can be performed on the aggregated data. fprod saves the product of the weights, whereas the other functions save the sum of the weights in a column next to the grouping variables. If na.rm = TRUE (the default), rows with missing weights are omitted from the computation.

# This computes a frequency-weighted grouped standard-deviation, taking the total EMP / VA as weight
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
  fselect(AGR:SUM) %>% fsd(SUM) %>% head(3)
# # A tibble: 3 x 13
#   Variable Country  sum.SUM    AGR   MIN   MAN    PU   CON   WRT    TRA   FIRE   GOV   OTH
#   <chr>    <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
# 1 EMP      ARG      653615.  225.   22.2  176. 20.5   285.  856.  195.   493.  1123.  506.
# 2 EMP      BOL      135452.   99.7  17.1  168.  4.87  123.  324.   98.1   69.8   NA   258.
# 3 EMP      BRA     3364925. 1587.   73.8 2952. 93.8  1861. 6285. 1306.  3003.  3621. 4257.

# This computes a weighted grouped mode, taking the total EMP / VA as weight
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
  fselect(AGR:SUM) %>% fmode(SUM) %>% head(3)
# # A tibble: 3 x 13
#   Variable Country  sum.SUM    AGR   MIN    MAN    PU   CON    WRT   TRA   FIRE    GOV    OTH
#   <chr>    <chr>      <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>
# 1 EMP      ARG      653615.  1162. 127.   2164. 152.  1415.  3768. 1060.  1748.  4336.  1999.
# 2 EMP      BOL      135452.   819.  37.6   604.  10.8  433.   893.  333.   321.    NA   1057.
# 3 EMP      BRA     3364925. 16451. 313.  11841. 388.  8154. 21860. 5169. 12011. 12149. 14235.

The weighted variance / standard deviation is currently only implemented with frequency weights.

Weighted aggregations may also be performed with collapg. By default fsum is used to compute a sum of the weights, but it is also possible here to aggregate the weights with other functions:

# This aggregates numeric colums using the weighted mean (the default) and categorical columns using the weighted mode (the default).
# Weights (column SUM) are aggregated using both the sum and the maximum. 
GGDC10S %>% group_by(Variable, Country) %>% 
  collapg(w = SUM, wFUN = list(fsum, fmax)) %>% head(3)
# # A tibble: 3 x 17
#   Variable Country fsum.SUM fmax.SUM Regioncode Region  Year    AGR   MIN   MAN     PU   CON    WRT
#   <chr>    <chr>      <dbl>    <dbl> <chr>      <chr>  <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
# 1 EMP      ARG      653615.   17929. LAM        Latin~ 1985.  1361.  56.5 1935. 105.    811.  2217.
# 2 EMP      BOL      135452.    4508. LAM        Latin~ 1987.   977.  57.9  296.   7.07  167.   400.
# 3 EMP      BRA     3364925.  102572. LAM        Latin~ 1989. 17746. 238.  8466. 389.   4436. 11376.
# # ... with 4 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>

2. Fast Transformations

collapse also provides some fast transformations that significantly extend the scope and speed of manipulations that can be performed with dplyr::mutate.

2.1 Fast Transform and Compute Variables

The function ftransform can be used to manipulate columns in the same ways as mutate:

GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
  ftransform(AGR_perc = AGR / SUM * 100,  # Computing % of VA in Agriculture
             AGR_mean = fmean(AGR),       # Average Agricultural VA
             AGR = NULL, SUM = NULL) %>%  # Deleting columns AGR and SUM
             head
#   Country Year AGR_perc AGR_mean
# 1     BWA 1960       NA  5137561
# 2     BWA 1961       NA  5137561
# 3     BWA 1962       NA  5137561
# 4     BWA 1963       NA  5137561
# 5     BWA 1964 43.49132  5137561
# 6     BWA 1965 39.96990  5137561

Instead of column = value type arguments, it is also possible to pass a single list of transformed variables to ftransform, which will be regarded in the same way as an evaluated list of column = value arguments:

# This replaces variables mpg, carb and wt by their log
mtcars %>% ftransform(fselect(., mpg, carb, wt) %>% lapply(log)) %>% head
#                        mpg cyl disp  hp drat        wt  qsec vs am gear      carb
# Mazda RX4         3.044522   6  160 110 3.90 0.9631743 16.46  0  1    4 1.3862944
# Mazda RX4 Wag     3.044522   6  160 110 3.90 1.0560527 17.02  0  1    4 1.3862944
# Datsun 710        3.126761   4  108  93 3.85 0.8415672 18.61  1  1    4 0.0000000
# Hornet 4 Drive    3.063391   6  258 110 3.08 1.1678274 19.44  1  0    3 0.0000000
# Hornet Sportabout 2.928524   8  360 175 3.15 1.2354715 17.02  0  0    3 0.6931472
# Valiant           2.895912   6  225 105 2.76 1.2412686 20.22  1  0    3 0.0000000

# This adds the log of mpg, carb and wt
mtcars %>% ftransform(fselect(., mpg, carb, wt) %>% lapply(log) %>% add_stub("log.")) %>% head
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log.mpg  log.carb    log.wt
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522 1.3862944 0.9631743
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522 1.3862944 1.0560527
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761 0.0000000 0.8415672
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391 0.0000000 1.1678274
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524 0.6931472 1.2354715
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912 0.0000000 1.2412686

If only the computed columns need to be returned, fcompute provides an efficient alternative:

GGDC10S %>% fsubset(Variable == "VA", Country, Year, AGR, SUM) %>%
  fcompute(AGR_perc = AGR / SUM * 100,
           AGR_mean = fmean(AGR)) %>% head
#   AGR_perc AGR_mean
# 1       NA  5137561
# 2       NA  5137561
# 3       NA  5137561
# 4       NA  5137561
# 5 43.49132  5137561
# 6 39.96990  5137561

ftransform and fcompute are an order of magnitude faster than mutate, but they do not support grouped computations using arbitrary functions. We will see that this is hardly a limitation as collapse provides very efficient and elegant alternative programming mechanisms…

2.2 Replacing and Sweeping out Statistics

All statistical (scalar-valued) functions in the collapse package (fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fNobs, fNdistinct) have a TRA argument which can be used to efficiently transforms data by either (column-wise) replacing data values with computed statistics or sweeping the statistics out of the data. Operations can be specified using either an integer or quoted operator / string. The 10 operations supported by TRA are:

Simple transformations are again straightforward to specify:

# This subtracts the median value from all data points i.e. centers on the median
GGDC10S %>% num_vars %>% fmedian(TRA = "-") %>% head
#   Year       AGR       MIN       MAN        PU       CON       WRT       TRA      FIRE       GOV
# 1  -22        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 2  -21        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 3  -20        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 4  -19        NA        NA        NA        NA        NA        NA        NA        NA        NA
# 5  -18 -4378.218 -169.7294 -3717.362 -167.8456 -1472.787 -3767.399 -1173.141 -959.0059 -3923.690
# 6  -17 -4378.792 -170.7277 -3717.080 -167.8149 -1472.101 -3766.578 -1172.861 -958.8783 -3922.817
#         OTH       SUM
# 1        NA        NA
# 2        NA        NA
# 3        NA        NA
# 4        NA        NA
# 5 -1430.831 -23148.71
# 6 -1430.494 -23146.85

# This replaces all data points with the mode
GGDC10S %>% char_vars %>% fmode(TRA = "replace") %>% head
#   Country Regioncode Region Variable
# 1     USA        ASI   Asia      EMP
# 2     USA        ASI   Asia      EMP
# 3     USA        ASI   Asia      EMP
# 4     USA        ASI   Asia      EMP
# 5     USA        ASI   Asia      EMP
# 6     USA        ASI   Asia      EMP

Similarly for grouped transformations:

# Replacing data with the 2nd quartile (25%)
GGDC10S %>%
  fselect(Variable, Country, AGR:SUM) %>% 
   fgroup_by(Variable, Country) %>% fnth(0.25, TRA = "replace_fill") %>% head(3)
# # A tibble: 3 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 VA       BWA      61.3  21.7  23.1  6.31  23.2  26.7  8.98  11.3  27.0  10.1  220.
# 2 VA       BWA      61.3  21.7  23.1  6.31  23.2  26.7  8.98  11.3  27.0  10.1  220.
# 3 VA       BWA      61.3  21.7  23.1  6.31  23.2  26.7  8.98  11.3  27.0  10.1  220.

# Scaling sectoral data by Variable and Country
GGDC10S %>%
  fselect(Variable, Country, AGR:SUM) %>% 
   fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% head
# # A tibble: 6 x 13
#   Variable Country     AGR      MIN      MAN       PU      CON      WRT      TRA     FIRE      GOV
#   <chr>    <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
# 1 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
# 2 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
# 3 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
# 4 VA       BWA     NA      NA       NA       NA       NA       NA       NA       NA       NA      
# 5 VA       BWA      0.0270  5.56e-4  5.23e-4  3.88e-4  5.11e-4  0.00194  0.00154  5.23e-4  0.00134
# 6 VA       BWA      0.0260  3.97e-4  7.23e-4  5.03e-4  1.04e-3  0.00220  0.00180  5.83e-4  0.00158
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

The benchmarks below will demonstrate that these internal sweeping and replacement operations fully performed in C++ compute significantly faster than using dplyr::mutate, especially as the number of groups grows large. The S3 generic nature of the Fast Statistical Functions further allows us to perform grouped mutations on the fly (together with ftransform or fcompute), without the need of first creating a grouped tibble:

# AGR_gmed = TRUE if AGR is greater than it's median value, grouped by Variable and Country
# Note: This calls fmedian.default
settransform(GGDC10S, AGR_gmed = AGR > fmedian(AGR, list(Variable, Country), TRA = "replace"))
tail(GGDC10S, 3)
#      Country Regioncode                       Region Variable Year      AGR      MIN      MAN
# 5025     EGY       MENA Middle East and North Africa      EMP 2010 5205.529 28.99641 2435.549
# 5026     EGY       MENA Middle East and North Africa      EMP 2011 5185.919 27.56394 2373.814
# 5027     EGY       MENA Middle East and North Africa      EMP 2012 5160.590 24.78083 2348.434
#            PU      CON      WRT      TRA     FIRE      GOV OTH      SUM AGR_gmed
# 5025 307.2712 2732.953 2977.063 1992.274 801.2984 5538.946  NA 22019.88     TRUE
# 5026 317.9979 2795.264 3020.236 2048.335 814.7403 5635.522  NA 22219.39     TRUE
# 5027 324.9332 2931.196 3109.522 2065.004 832.4770 5735.623  NA 22532.56     TRUE

Weights are easily added to any grouped transformation:

# This subtracts weighted group means from the data, using SUM column as weights.. 
GGDC10S %>%
  fselect(Variable, Country, AGR:SUM) %>% 
   fgroup_by(Variable, Country) %>% fmean(SUM, "-") %>% head
# # A tibble: 6 x 13
#   Variable Country   SUM    AGR     MIN    MAN    PU    CON    WRT    TRA   FIRE    GOV    OTH
#   <chr>    <chr>   <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
# 2 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
# 3 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
# 4 VA       BWA      NA      NA      NA     NA    NA     NA     NA     NA     NA     NA     NA 
# 5 VA       BWA      37.5 -1301. -13317. -2965. -529. -2746. -6540. -2157. -4431. -7551. -2613.
# 6 VA       BWA      39.3 -1302. -13318. -2964. -529. -2745. -6540. -2156. -4431. -7550. -2613.

Sequential operations are also easily performed:

# This scales and then subtracts the median
GGDC10S %>%
  fselect(Variable, Country, AGR:SUM) %>% 
   fgroup_by(Variable, Country) %>% fsd(TRA = "/") %>% fmedian(TRA = "-")
# # A tibble: 5,027 x 13
#    Variable Country    AGR    MIN    MAN     PU    CON     WRT     TRA    FIRE    GOV     OTH    SUM
#  * <chr>    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#  1 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  2 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  3 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  4 VA       BWA     NA     NA     NA     NA     NA     NA      NA      NA      NA     NA      NA    
#  5 VA       BWA     -0.182 -0.235 -0.183 -0.245 -0.118 -0.0820 -0.0724 -0.0661 -0.108 -0.0848 -0.146
#  6 VA       BWA     -0.183 -0.235 -0.183 -0.245 -0.117 -0.0817 -0.0722 -0.0660 -0.108 -0.0846 -0.146
#  7 VA       BWA     -0.180 -0.235 -0.183 -0.245 -0.117 -0.0813 -0.0720 -0.0659 -0.107 -0.0843 -0.145
#  8 VA       BWA     -0.177 -0.235 -0.183 -0.245 -0.117 -0.0826 -0.0724 -0.0659 -0.107 -0.0841 -0.146
#  9 VA       BWA     -0.174 -0.235 -0.183 -0.245 -0.117 -0.0823 -0.0717 -0.0661 -0.108 -0.0848 -0.146
# 10 VA       BWA     -0.173 -0.234 -0.182 -0.243 -0.115 -0.0821 -0.0715 -0.0660 -0.108 -0.0846 -0.145
# # ... with 5,017 more rows

Of course it is also possible to combine multiple functions as in the aggregation section, or to add variables to existing data:

# This adds a groupwise observation count next to each column
add_vars(GGDC10S, seq(7,27,2)) <- GGDC10S %>%
    fgroup_by(Variable, Country) %>% fselect(AGR:SUM) %>%
    fNobs("replace_fill") %>% add_stub("N_")

head(GGDC10S)
#   Country Regioncode             Region Variable Year      AGR N_AGR      MIN N_MIN       MAN N_MAN
# 1     BWA        SSA Sub-saharan Africa       VA 1960       NA    47       NA    47        NA    47
# 2     BWA        SSA Sub-saharan Africa       VA 1961       NA    47       NA    47        NA    47
# 3     BWA        SSA Sub-saharan Africa       VA 1962       NA    47       NA    47        NA    47
# 4     BWA        SSA Sub-saharan Africa       VA 1963       NA    47       NA    47        NA    47
# 5     BWA        SSA Sub-saharan Africa       VA 1964 16.30154    47 3.494075    47 0.7365696    47
# 6     BWA        SSA Sub-saharan Africa       VA 1965 15.72700    47 2.495768    47 1.0181992    47
#          PU N_PU       CON N_CON      WRT N_WRT      TRA N_TRA     FIRE N_FIRE      GOV N_GOV
# 1        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 2        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 3        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 4        NA   47        NA    47       NA    47       NA    47       NA     47       NA    47
# 5 0.1043936   47 0.6600454    47 6.243732    47 1.658928    47 1.119194     47 4.822485    47
# 6 0.1350976   47 1.3462312    47 7.064825    47 1.939007    47 1.246789     47 5.695848    47
#        OTH N_OTH      SUM N_SUM AGR_gmed
# 1       NA    47       NA    47       NA
# 2       NA    47       NA    47       NA
# 3       NA    47       NA    47       NA
# 4       NA    47       NA    47       NA
# 5 2.341328    47 37.48229    47    FALSE
# 6 2.678338    47 39.34710    47    FALSE
rm(GGDC10S)

There are lots of other examples one could construct using the 10 operations and 14 functions listed above, the examples provided just outline the suggested programming basics. Performance considerations make it very much worthwhile to spend some time and think how complex operations can be implemented in this programming framework, before defining some function in R and applying it to data using dplyr::mutate.

2.3 More Control using the TRA Function

Towards this end, calling TRA() directly also facilitates more complex and customized operations. Behind the scenes of the TRA = ... argument, the Fast Statistical Functions first compute the grouped statistics on all columns of the data, and these statistics are then directly fed into a C++ function that uses them to replace or sweep them out of data points in one of the 10 ways described above. This function can also be called directly by the name of TRA.

Fundamentally, TRA is a generalization of base::sweep for column-wise grouped operations1. Direct calls to TRA enable more control over inputs and outputs.

The two operations below are equivalent, although the first is slightly more efficient as it only requires one method dispatch and one check of the inputs:

# This divides by the product
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
    get_vars(6:16) %>% fprod(TRA = "/") %>% head
# # A tibble: 6 x 11
#          AGR        MIN        MAN        PU        CON        WRT       TRA      FIRE        GOV
#        <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>     <dbl>      <dbl>
# 1 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 2 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 3 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 4 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 5  1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92  1.01e-97  2.51e-117
# 6  1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92  1.13e-97  2.96e-117
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

# Same thing
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
    get_vars(6:16) %>% 
     TRA(fprod(., keep.group_vars = FALSE), "/") %>% head # [same as TRA(.,fprod(., keep.group_vars = FALSE),"/")]
# # A tibble: 6 x 11
#          AGR        MIN        MAN        PU        CON        WRT       TRA      FIRE        GOV
#        <dbl>      <dbl>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>     <dbl>      <dbl>
# 1 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 2 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 3 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 4 NA         NA         NA         NA        NA         NA         NA        NA        NA        
# 5  1.29e-105  2.81e-127  1.40e-101  4.44e-74  4.19e-102  3.97e-113  6.91e-92  1.01e-97  2.51e-117
# 6  1.24e-105  2.00e-127  1.94e-101  5.75e-74  8.55e-102  4.49e-113  8.08e-92  1.13e-97  2.96e-117
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

TRA.grouped_df was designed such that it matches the columns of the statistics (aggregated columns) to those of the original data, and only transforms matching columns while returning the whole data frame. Thus it is easily possible to only apply a transformation to the first two sectors:

# This only demeans Agriculture (AGR) and Mining (MIN)
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
    TRA(fselect(., AGR, MIN) %>% fmean(keep.group_vars = FALSE), "-") %>% head
# # A tibble: 6 x 16
#   Country Regioncode Region Variable  Year   AGR    MIN    MAN     PU    CON   WRT   TRA  FIRE   GOV
#   <chr>   <chr>      <chr>  <chr>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA     SSA        Sub-s~ VA        1960   NA     NA  NA     NA     NA     NA    NA    NA    NA   
# 2 BWA     SSA        Sub-s~ VA        1961   NA     NA  NA     NA     NA     NA    NA    NA    NA   
# 3 BWA     SSA        Sub-s~ VA        1962   NA     NA  NA     NA     NA     NA    NA    NA    NA   
# 4 BWA     SSA        Sub-s~ VA        1963   NA     NA  NA     NA     NA     NA    NA    NA    NA   
# 5 BWA     SSA        Sub-s~ VA        1964 -446. -4505.  0.737  0.104  0.660  6.24  1.66  1.12  4.82
# 6 BWA     SSA        Sub-s~ VA        1965 -446. -4506.  1.02   0.135  1.35   7.06  1.94  1.25  5.70
# # ... with 2 more variables: OTH <dbl>, SUM <dbl>

Since TRA is already built into all Fast Statistical Functions as an argument, it is best used in computations where grouped statistics are computed using some other function.

# Same as above, with one line of code using fmean.data.frame and ftransform...
GGDC10S %>% ftransform(fmean(list(AGR = AGR, MIN = MIN), list(Variable, Country), TRA = "-")) %>% head
#   Country Regioncode             Region Variable Year       AGR       MIN       MAN        PU
# 1     BWA        SSA Sub-saharan Africa       VA 1960        NA        NA        NA        NA
# 2     BWA        SSA Sub-saharan Africa       VA 1961        NA        NA        NA        NA
# 3     BWA        SSA Sub-saharan Africa       VA 1962        NA        NA        NA        NA
# 4     BWA        SSA Sub-saharan Africa       VA 1963        NA        NA        NA        NA
# 5     BWA        SSA Sub-saharan Africa       VA 1964 -445.8739 -4505.178 0.7365696 0.1043936
# 6     BWA        SSA Sub-saharan Africa       VA 1965 -446.4485 -4506.176 1.0181992 0.1350976
#         CON      WRT      TRA     FIRE      GOV      OTH      SUM
# 1        NA       NA       NA       NA       NA       NA       NA
# 2        NA       NA       NA       NA       NA       NA       NA
# 3        NA       NA       NA       NA       NA       NA       NA
# 4        NA       NA       NA       NA       NA       NA       NA
# 5 0.6600454 6.243732 1.658928 1.119194 4.822485 2.341328 37.48229
# 6 1.3462312 7.064825 1.939007 1.246789 5.695848 2.678338 39.34710

Another potential use of TRA is to do computations in two- or more steps, for example if both aggregated and transformed data are needed, or if computations are more complex and involve other manipulations in-between the aggregating and sweeping part:

# Get grouped tibble
gGGDC <- GGDC10S %>% fgroup_by(Variable, Country)

# Get aggregated data
gsumGGDC <- gGGDC %>% fselect(AGR:SUM) %>% fsum
head(gsumGGDC)
# # A tibble: 6 x 13
#   Variable Country     AGR     MIN     MAN     PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG      8.80e4   3230.  1.20e5  6307.  4.60e4 1.23e5 4.02e4 3.89e4  1.27e5 6.15e4 6.54e5
# 2 EMP      BOL      5.88e4   3418.  1.43e4   326.  7.49e3 1.72e4 7.04e3 2.72e3 NA      2.41e4 1.35e5
# 3 EMP      BRA      1.07e6  12773.  4.33e5 22604.  2.19e5 5.28e5 1.27e5 2.74e5  3.29e5 3.54e5 3.36e6
# 4 EMP      BWA      8.84e3    493.  8.49e2   145.  1.19e3 1.71e3 3.93e2 7.21e2  2.87e3 1.30e3 1.85e4
# 5 EMP      CHL      4.42e4   6389.  3.94e4  1850.  1.86e4 4.38e4 1.63e4 1.72e4 NA      6.32e4 2.51e5
# 6 EMP      CHN      1.73e7 422972.  4.03e6 96364.  1.25e6 1.73e6 8.36e5 2.96e5  1.36e6 1.86e6 2.91e7

# Get transformed (scaled) data
head(TRA(gGGDC, gsumGGDC, "/"))
# # A tibble: 6 x 16
#   Country Regioncode Region Variable  Year      AGR      MIN      MAN       PU      CON      WRT
#   <chr>   <chr>      <chr>  <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
# 1 BWA     SSA        Sub-s~ VA        1960 NA       NA       NA       NA       NA       NA      
# 2 BWA     SSA        Sub-s~ VA        1961 NA       NA       NA       NA       NA       NA      
# 3 BWA     SSA        Sub-s~ VA        1962 NA       NA       NA       NA       NA       NA      
# 4 BWA     SSA        Sub-s~ VA        1963 NA       NA       NA       NA       NA       NA      
# 5 BWA     SSA        Sub-s~ VA        1964  7.50e-4  1.65e-5  1.66e-5  1.03e-5  1.57e-5  6.82e-5
# 6 BWA     SSA        Sub-s~ VA        1965  7.24e-4  1.18e-5  2.30e-5  1.33e-5  3.20e-5  7.72e-5
# # ... with 5 more variables: TRA <dbl>, FIRE <dbl>, GOV <dbl>, OTH <dbl>, SUM <dbl>

As discussed, whether using the argument to fast statistical functions or TRA directly, these data transformations are essentially a two-step process: Statistics are first computed and then used to transform the original data.

Although both steps are efficiently done in C++, it would be even more efficient to do them in a single step without materializing all the statistics before transforming the data. Such slightly more efficient functions are provided for the very commonly applied tasks of centering and averaging data by groups (widely known as ‘between’-group and ‘within’-group transformations), and scaling and centering data by groups (also known as ‘standardizing’ data).

2.4 Faster Centering, Averaging and Standardizing

The functions fbetween and fwithin are slightly more memory efficient implementations of fmean invoked with different TRA options:

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace")
  fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

GGDC10S %>% # Same as ... %>% fmean(TRA = "replace_fill")
  fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fbetween(fill = TRUE) %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH    SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
# 1  462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.
# 2  462. 4509.  942.  216.  895. 1948.  635. 1359. 2373.  773. 14112.

GGDC10S %>% # Same as ... %>% fmean(TRA = "-")
  fgroup_by(Variable, Country) %>% get_vars(6:16) %>% fwithin %>% head(2)
# # A tibble: 2 x 11
#     AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Apart from higher speed, fwithin has a mean argument to assign an arbitrary mean to centered data, the default being mean = 0. A very common choice for such an added mean is just the overall mean of the data, which can be added in by invoking mean = "overall.mean":

GGDC10S %>% 
  fgroup_by(Variable, Country) %>% 
    fselect(Country, Variable, AGR:SUM) %>% fwithin(mean = "overall.mean") %>% head(3)
# # A tibble: 3 x 13
#   Country Variable   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

This can also be done using weights. The code below uses the SUM column as weights, and then for each variable and each group subtracts out the weighted mean, and then adds the overall weighted column mean back to the centered columns. The SUM column is just kept as it is and added after the grouping columns.

GGDC10S %>% 
  fgroup_by(Variable, Country) %>% 
    fselect(Country, Variable, AGR:SUM) %>% fwithin(SUM, mean = "overall.mean") %>% head(3)
# # A tibble: 3 x 13
#   Country Variable   SUM   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH
#   <chr>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 2 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# 3 BWA     VA          NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

Another argument to fwithin is the theta parameter, allowing partial- or quasi-demeaning operations, e.g. fwithin(gdata, theta = theta) is equal to gdata - theta * fbetween(gdata). This is particularly useful to prepare data for variance components (also known as ‘random-effects’) estimation.

Apart from fbetween and fwithin, the function fscale exists to efficiently scale and center data, to avoid sequential calls such as ... %>% fsd(TRA = "/") %>% fmean(TRA = "-").

# This efficiently scales and centers (i.e. standardizes) the data
GGDC10S %>%
  fgroup_by(Variable, Country) %>%
    fselect(Country, Variable, AGR:SUM) %>% fscale
# # A tibble: 5,027 x 13
#    Country Variable    AGR    MIN    MAN     PU    CON    WRT    TRA   FIRE    GOV    OTH    SUM
#  * <chr>   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  2 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  3 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  4 BWA     VA       NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
#  5 BWA     VA       -0.738 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
#  6 BWA     VA       -0.739 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.596 -0.676
#  7 BWA     VA       -0.736 -0.717 -0.668 -0.805 -0.692 -0.603 -0.589 -0.635 -0.656 -0.595 -0.676
#  8 BWA     VA       -0.734 -0.717 -0.668 -0.805 -0.692 -0.604 -0.589 -0.635 -0.655 -0.595 -0.676
#  9 BWA     VA       -0.730 -0.717 -0.668 -0.805 -0.692 -0.604 -0.588 -0.635 -0.656 -0.596 -0.676
# 10 BWA     VA       -0.729 -0.716 -0.667 -0.803 -0.690 -0.603 -0.588 -0.635 -0.656 -0.596 -0.675
# # ... with 5,017 more rows

fscale also has additional mean and sd arguments allowing the user to (group-) scale data to an arbitrary mean and standard deviation. Setting mean = FALSE just scales the data but preserves the means, and is thus different from fsd(..., TRA = "/") which simply divides all values by the standard deviation:

# Saving grouped tibble
gGGDC <- GGDC10S %>%
  fgroup_by(Variable, Country) %>%
    fselect(Country, Variable, AGR:SUM)

# Original means
head(fmean(gGGDC)) 
# # A tibble: 6 x 13
#   Variable Country     AGR    MIN     MAN      PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG       1420.   52.1  1932.   102.     742.  1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
# 2 EMP      BOL        964.   56.0   235.     5.35   123.  2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
# 3 EMP      BRA      17191.  206.   6991.   365.    3525.  8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
# 4 EMP      BWA        188.   10.5    18.1    3.09    25.3 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
# 5 EMP      CHL        702.  101.    625.    29.4    296.  6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
# 6 EMP      CHN     287744. 7050.  67144.  1606.   20852.  2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5

# Mean Preserving Scaling
head(fmean(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
#   Variable Country     AGR    MIN     MAN      PU     CON    WRT    TRA   FIRE     GOV    OTH    SUM
#   <chr>    <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
# 1 EMP      ARG       1420.   52.1  1932.   102.     742.  1.98e3 6.49e2  628.   2043.  9.92e2 1.05e4
# 2 EMP      BOL        964.   56.0   235.     5.35   123.  2.82e2 1.15e2   44.6    NA   3.96e2 2.22e3
# 3 EMP      BRA      17191.  206.   6991.   365.    3525.  8.51e3 2.05e3 4414.   5307.  5.71e3 5.43e4
# 4 EMP      BWA        188.   10.5    18.1    3.09    25.3 3.63e1 8.36e0   15.3    61.1 2.76e1 3.94e2
# 5 EMP      CHL        702.  101.    625.    29.4    296.  6.95e2 2.58e2  272.     NA   1.00e3 3.98e3
# 6 EMP      CHN     287744. 7050.  67144.  1606.   20852.  2.89e4 1.39e4 4929.  22669.  3.10e4 4.86e5
head(fsd(fscale(gGGDC, mean = FALSE)))
# # A tibble: 6 x 13
#   Variable Country   AGR   MIN   MAN    PU   CON   WRT   TRA  FIRE   GOV   OTH   SUM
#   <chr>    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 EMP      ARG      1.    1.    1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.00  1.  
# 2 EMP      BOL      1.    1.00  1.    1.00  1.00  1.    1.    1.   NA     1.    1.  
# 3 EMP      BRA      1.    1.    1.    1.00  1.    1.00  1.00  1.00  1.    1.00  1.00
# 4 EMP      BWA      1.00  1.00  1.    1.    1.    1.00  1.    1.00  1.    1.00  1.00
# 5 EMP      CHL      1.    1.    1.00  1.    1.    1.    1.00  1.   NA     1.    1.00
# 6 EMP      CHN      1.    1.    1.    1.00  1.00  1.    1.    1.    1.00  1.00  1.

One can also set mean = "overall.mean", which group-centers columns on the overall mean as illustrated with fwithin. Another interesting option is setting sd = "within.sd". This group-scales data such that every group has a standard deviation equal to the within-standard deviation of the data:

# Just using VA data for this example
gGGDC <- GGDC10S %>%
  fsubset(Variable == "VA", Country, AGR:SUM) %>% 
      fgroup_by(Country)

# This calculates the within- standard deviation for all columns
fsd(num_vars(ungroup(fwithin(gGGDC))))
#       AGR       MIN       MAN        PU       CON       WRT       TRA      FIRE       GOV       OTH 
#  45046972  40122220  75608708   3062688  30811572  44125207  20676901  16030868  20358973  18780869 
#       SUM 
# 306429102

# This scales all groups to take on the within- standard deviation while preserving group means 
fsd(fscale(gGGDC, mean = FALSE, sd = "within.sd"))
# # A tibble: 43 x 12
#    Country      AGR      MIN      MAN     PU     CON     WRT     TRA    FIRE     GOV     OTH     SUM
#    <chr>      <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#  1 ARG       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  2 BOL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  3 BRA       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  4 BWA       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  5 CHL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  6 CHN       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  7 COL       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7 NA       1.88e7  3.06e8
#  8 CRI       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
#  9 DEW       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
# 10 DNK       4.50e7   4.01e7   7.56e7 3.06e6  3.08e7  4.41e7  2.07e7  1.60e7  2.04e7  1.88e7  3.06e8
# # ... with 33 more rows

A grouped scaling operation with both mean = "overall.mean" and sd = "within.sd" thus efficiently achieves a harmonization of all groups in the first two moments without changing the fundamental properties (in terms of level and scale) of the data.

2.5 Lags / Leads, Differences and Growth Rates

This section introduces 3 further powerful collapse functions: flag, fdiff and fgrowth. The first function, flag, efficiently computes sequences of fully identified lags and leads on time series and panel data. The following code computes 1 fully-identified panel-lag and 1 fully identified panel-lead of each variable in the data:

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% flag(-1:1, Year)
# # A tibble: 5,027 x 36
#    Country Variable  Year F1.AGR   AGR L1.AGR F1.MIN   MIN L1.MIN F1.MAN    MAN L1.MAN  F1.PU     PU
#  * <chr>   <chr>    <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 BWA     VA        1960   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  2 BWA     VA        1961   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  3 BWA     VA        1962   NA    NA     NA    NA    NA     NA    NA     NA     NA     NA     NA    
#  4 BWA     VA        1963   16.3  NA     NA     3.49 NA     NA     0.737 NA     NA      0.104 NA    
#  5 BWA     VA        1964   15.7  16.3   NA     2.50  3.49  NA     1.02   0.737 NA      0.135  0.104
#  6 BWA     VA        1965   17.7  15.7   16.3   1.97  2.50   3.49  0.804  1.02   0.737  0.203  0.135
#  7 BWA     VA        1966   19.1  17.7   15.7   2.30  1.97   2.50  0.938  0.804  1.02   0.203  0.203
#  8 BWA     VA        1967   21.1  19.1   17.7   1.84  2.30   1.97  0.750  0.938  0.804  0.203  0.203
#  9 BWA     VA        1968   21.9  21.1   19.1   5.24  1.84   2.30  2.14   0.750  0.938  0.578  0.203
# 10 BWA     VA        1969   23.1  21.9   21.1  10.2   5.24   1.84  4.15   2.14   0.750  1.12   0.578
# # ... with 5,017 more rows, and 22 more variables: L1.PU <dbl>, F1.CON <dbl>, CON <dbl>,
# #   L1.CON <dbl>, F1.WRT <dbl>, WRT <dbl>, L1.WRT <dbl>, F1.TRA <dbl>, TRA <dbl>, L1.TRA <dbl>,
# #   F1.FIRE <dbl>, FIRE <dbl>, L1.FIRE <dbl>, F1.GOV <dbl>, GOV <dbl>, L1.GOV <dbl>, F1.OTH <dbl>,
# #   OTH <dbl>, L1.OTH <dbl>, F1.SUM <dbl>, SUM <dbl>, L1.SUM <dbl>

If the time-variable passed does not exactly identify the data (i.e. because of gaps or repeated values in each group), all 3 functions will issue appropriate error messages. flag, fdiff and fgrowth support unbalanced panels with different start and end periods and duration of coverage for each individual, but not irregular panels. A workaround for such panels exists with the function seqid which generates a new panel-id identifying consecutive time-sequences at the sub-individual level, see ?seqid.

It is also possible to omit the time-variable if one is certain that the data is sorted:

GGDC10S %>%
  fselect(Variable, Country,AGR:SUM) %>% 
    fgroup_by(Variable, Country) %>% flag
# # A tibble: 5,027 x 13
#    Variable Country   AGR   MIN    MAN     PU    CON   WRT   TRA  FIRE   GOV   OTH   SUM
#  * <chr>    <chr>   <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1 VA       BWA      NA   NA    NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  2 VA       BWA      NA   NA    NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  3 VA       BWA      NA   NA    NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  4 VA       BWA      NA   NA    NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  5 VA       BWA      NA   NA    NA     NA     NA     NA    NA    NA    NA    NA     NA  
#  6 VA       BWA      16.3  3.49  0.737  0.104  0.660  6.24  1.66  1.12  4.82  2.34  37.5
#  7 VA       BWA      15.7  2.50  1.02   0.135  1.35   7.06  1.94  1.25  5.70  2.68  39.3
#  8 VA       BWA      17.7  1.97  0.804  0.203  1.35   8.27  2.15  1.36  6.37  2.99  43.1
#  9 VA       BWA      19.1  2.30  0.938  0.203  0.897  4.31  1.72  1.54  7.04  3.31  41.4
# 10 VA       BWA      21.1  1.84  0.750  0.203  1.22   5.17  2.44  1.03  5.03  2.36  41.1
# # ... with 5,017 more rows

fdiff computes sequences of lagged-leaded and iterated differences as well as quasi-differences and log-differences on time series and panel data. The code below computes the 1 and 10 year first and second differences of each variable in the data:

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1:2, Year)
# # A tibble: 5,027 x 47
#    Country Variable  Year D1.AGR D2.AGR L10D1.AGR L10D2.AGR D1.MIN D2.MIN L10D1.MIN L10D2.MIN D1.MAN
#  * <chr>   <chr>    <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>
#  1 BWA     VA        1960 NA     NA            NA        NA NA     NA            NA        NA NA    
#  2 BWA     VA        1961 NA     NA            NA        NA NA     NA            NA        NA NA    
#  3 BWA     VA        1962 NA     NA            NA        NA NA     NA            NA        NA NA    
#  4 BWA     VA        1963 NA     NA            NA        NA NA     NA            NA        NA NA    
#  5 BWA     VA        1964 NA     NA            NA        NA NA     NA            NA        NA NA    
#  6 BWA     VA        1965 -0.575 NA            NA        NA -0.998 NA            NA        NA  0.282
#  7 BWA     VA        1966  1.95   2.53         NA        NA -0.525  0.473        NA        NA -0.214
#  8 BWA     VA        1967  1.47  -0.488        NA        NA  0.328  0.854        NA        NA  0.134
#  9 BWA     VA        1968  1.95   0.488        NA        NA -0.460 -0.788        NA        NA -0.188
# 10 BWA     VA        1969  0.763 -1.19         NA        NA  3.41   3.87         NA        NA  1.39 
# # ... with 5,017 more rows, and 35 more variables: D2.MAN <dbl>, L10D1.MAN <dbl>, L10D2.MAN <dbl>,
# #   D1.PU <dbl>, D2.PU <dbl>, L10D1.PU <dbl>, L10D2.PU <dbl>, D1.CON <dbl>, D2.CON <dbl>,
# #   L10D1.CON <dbl>, L10D2.CON <dbl>, D1.WRT <dbl>, D2.WRT <dbl>, L10D1.WRT <dbl>, L10D2.WRT <dbl>,
# #   D1.TRA <dbl>, D2.TRA <dbl>, L10D1.TRA <dbl>, L10D2.TRA <dbl>, D1.FIRE <dbl>, D2.FIRE <dbl>,
# #   L10D1.FIRE <dbl>, L10D2.FIRE <dbl>, D1.GOV <dbl>, D2.GOV <dbl>, L10D1.GOV <dbl>,
# #   L10D2.GOV <dbl>, D1.OTH <dbl>, D2.OTH <dbl>, L10D1.OTH <dbl>, L10D2.OTH <dbl>, D1.SUM <dbl>,
# #   D2.SUM <dbl>, L10D1.SUM <dbl>, L10D2.SUM <dbl>

Log-differences of the form \(log(x_t) - log(x_{t-s})\) are also easily computed.

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fdiff(c(1, 10), 1, Year, log = TRUE)
# # A tibble: 5,027 x 25
#    Country Variable  Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
#  * <chr>   <chr>    <dbl>     <dbl>        <dbl>     <dbl>        <dbl>     <dbl>        <dbl>
#  1 BWA     VA        1960   NA                NA    NA               NA    NA               NA
#  2 BWA     VA        1961   NA                NA    NA               NA    NA               NA
#  3 BWA     VA        1962   NA                NA    NA               NA    NA               NA
#  4 BWA     VA        1963   NA                NA    NA               NA    NA               NA
#  5 BWA     VA        1964   NA                NA    NA               NA    NA               NA
#  6 BWA     VA        1965   -0.0359           NA    -0.336           NA     0.324           NA
#  7 BWA     VA        1966    0.117            NA    -0.236           NA    -0.236           NA
#  8 BWA     VA        1967    0.0796           NA     0.154           NA     0.154           NA
#  9 BWA     VA        1968    0.0972           NA    -0.223           NA    -0.223           NA
# 10 BWA     VA        1969    0.0355           NA     1.05            NA     1.05            NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# #   Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# #   L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# #   Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>

Finally, it is also possible to compute quasi-differences and quasi-log-differences of the form \(x_t - \rho x_{t-s}\) or \(log(x_t) - \rho log(x_{t-s})\):

GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fdiff(t = Year, rho = 0.95)
# # A tibble: 5,027 x 14
#    Country Variable  Year    AGR    MIN    MAN      PU     CON    WRT    TRA   FIRE    GOV    OTH
#  * <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#  1 BWA     VA        1960 NA     NA     NA     NA      NA      NA     NA     NA     NA     NA    
#  2 BWA     VA        1961 NA     NA     NA     NA      NA      NA     NA     NA     NA     NA    
#  3 BWA     VA        1962 NA     NA     NA     NA      NA      NA     NA     NA     NA     NA    
#  4 BWA     VA        1963 NA     NA     NA     NA      NA      NA     NA     NA     NA     NA    
#  5 BWA     VA        1964 NA     NA     NA     NA      NA      NA     NA     NA     NA     NA    
#  6 BWA     VA        1965  0.241 -0.824  0.318  0.0359  0.719   1.13   0.363  0.184  1.11   0.454
#  7 BWA     VA        1966  2.74  -0.401 -0.163  0.0743  0.0673  1.56   0.312  0.174  0.955  0.449
#  8 BWA     VA        1967  2.35   0.427  0.174  0.0101 -0.381  -3.55  -0.323  0.246  0.988  0.465
#  9 BWA     VA        1968  2.91  -0.345 -0.141  0.0101  0.365   1.08   0.804 -0.427 -1.66  -0.780
# 10 BWA     VA        1969  1.82   3.50   1.43   0.385   2.32    0.841  0.397  0.252  0.818  0.385
# # ... with 5,017 more rows, and 1 more variable: SUM <dbl>

The quasi-differencing feature was added to fdiff to facilitate the preparation of time series and panel data for least-squares estimations suffering from serial correlation following Cochrane & Orcutt (1949).

Finally, fgrowth computes growth rates in the same way. By default exact growth rates are computed in percentage terms using \((x_t-x_{t-s}) / x_{t-s} \times 100\) (the default argument is scale = 100). The user can also request growth rates obtained by log-differencing using \(log(x_t/ x_{t-s}) \times 100\).

# Exact growth rates, computed as: (x/lag(x) - 1) * 100
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year)
# # A tibble: 5,027 x 25
#    Country Variable  Year G1.AGR L10G1.AGR G1.MIN L10G1.MIN G1.MAN L10G1.MAN G1.PU L10G1.PU G1.CON
#  * <chr>   <chr>    <dbl>  <dbl>     <dbl>  <dbl>     <dbl>  <dbl>     <dbl> <dbl>    <dbl>  <dbl>
#  1 BWA     VA        1960  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  2 BWA     VA        1961  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  3 BWA     VA        1962  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  4 BWA     VA        1963  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  5 BWA     VA        1964  NA           NA   NA          NA   NA          NA  NA         NA   NA  
#  6 BWA     VA        1965  -3.52        NA  -28.6        NA   38.2        NA  29.4       NA  104. 
#  7 BWA     VA        1966  12.4         NA  -21.1        NA  -21.1        NA  50.0       NA    0  
#  8 BWA     VA        1967   8.29        NA   16.7        NA   16.7        NA   0         NA  -33.3
#  9 BWA     VA        1968  10.2         NA  -20          NA  -20          NA   0         NA   35.7
# 10 BWA     VA        1969   3.61        NA  185.         NA  185.         NA 185.        NA  185. 
# # ... with 5,017 more rows, and 13 more variables: L10G1.CON <dbl>, G1.WRT <dbl>, L10G1.WRT <dbl>,
# #   G1.TRA <dbl>, L10G1.TRA <dbl>, G1.FIRE <dbl>, L10G1.FIRE <dbl>, G1.GOV <dbl>, L10G1.GOV <dbl>,
# #   G1.OTH <dbl>, L10G1.OTH <dbl>, G1.SUM <dbl>, L10G1.SUM <dbl>

# Log-difference growth rates, computed as: log(x / lag(x)) * 100
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year, logdiff = TRUE)
# # A tibble: 5,027 x 25
#    Country Variable  Year Dlog1.AGR L10Dlog1.AGR Dlog1.MIN L10Dlog1.MIN Dlog1.MAN L10Dlog1.MAN
#  * <chr>   <chr>    <dbl>     <dbl>        <dbl>     <dbl>        <dbl>     <dbl>        <dbl>
#  1 BWA     VA        1960     NA              NA      NA             NA      NA             NA
#  2 BWA     VA        1961     NA              NA      NA             NA      NA             NA
#  3 BWA     VA        1962     NA              NA      NA             NA      NA             NA
#  4 BWA     VA        1963     NA              NA      NA             NA      NA             NA
#  5 BWA     VA        1964     NA              NA      NA             NA      NA             NA
#  6 BWA     VA        1965     -3.59           NA     -33.6           NA      32.4           NA
#  7 BWA     VA        1966     11.7            NA     -23.6           NA     -23.6           NA
#  8 BWA     VA        1967      7.96           NA      15.4           NA      15.4           NA
#  9 BWA     VA        1968      9.72           NA     -22.3           NA     -22.3           NA
# 10 BWA     VA        1969      3.55           NA     105.            NA     105.            NA
# # ... with 5,017 more rows, and 16 more variables: Dlog1.PU <dbl>, L10Dlog1.PU <dbl>,
# #   Dlog1.CON <dbl>, L10Dlog1.CON <dbl>, Dlog1.WRT <dbl>, L10Dlog1.WRT <dbl>, Dlog1.TRA <dbl>,
# #   L10Dlog1.TRA <dbl>, Dlog1.FIRE <dbl>, L10Dlog1.FIRE <dbl>, Dlog1.GOV <dbl>, L10Dlog1.GOV <dbl>,
# #   Dlog1.OTH <dbl>, L10Dlog1.OTH <dbl>, Dlog1.SUM <dbl>, L10Dlog1.SUM <dbl>

fdiff and fgrowth can also perform leaded (forward) differences and growth rates (i.e. ... %>% fgrowth(-c(1, 10), 1:2, Year) would compute one and 10-year leaded first and second differences). Again it is possible to perform sequential operations:

# This computes the 1 and 10-year growth rates, for the current period and lagged by one period
GGDC10S %>%
  fselect(-Region, -Regioncode) %>% 
    fgroup_by(Variable, Country) %>% fgrowth(c(1, 10), 1, Year) %>% flag(0:1, Year)
# # A tibble: 5,027 x 47
#    Country Variable  Year G1.AGR L1.G1.AGR L10G1.AGR L1.L10G1.AGR G1.MIN L1.G1.MIN L10G1.MIN
#  * <chr>   <chr>    <dbl>  <dbl>     <dbl>     <dbl>        <dbl>  <dbl>     <dbl>     <dbl>
#  1 BWA     VA        1960  NA        NA           NA           NA   NA        NA          NA
#  2 BWA     VA        1961  NA        NA           NA           NA   NA        NA          NA
#  3 BWA     VA        1962  NA        NA           NA           NA   NA        NA          NA
#  4 BWA     VA        1963  NA        NA           NA           NA   NA        NA          NA
#  5 BWA     VA        1964  NA        NA           NA           NA   NA        NA          NA
#  6 BWA     VA        1965  -3.52     NA           NA           NA  -28.6      NA          NA
#  7 BWA     VA        1966  12.4      -3.52        NA           NA  -21.1     -28.6        NA
#  8 BWA     VA        1967   8.29     12.4         NA           NA   16.7     -21.1        NA
#  9 BWA     VA        1968  10.2       8.29        NA           NA  -20        16.7        NA
# 10 BWA     VA        1969   3.61     10.2         NA           NA  185.      -20          NA
# # ... with 5,017 more rows, and 37 more variables: L1.L10G1.MIN <dbl>, G1.MAN <dbl>,
# #   L1.G1.MAN <dbl>, L10G1.MAN <dbl>, L1.L10G1.MAN <dbl>, G1.PU <dbl>, L1.G1.PU <dbl>,
# #   L10G1.PU <dbl>, L1.L10G1.PU <dbl>, G1.CON <dbl>, L1.G1.CON <dbl>, L10G1.CON <dbl>,
# #   L1.L10G1.CON <dbl>, G1.WRT <dbl>, L1.G1.WRT <dbl>, L10G1.WRT <dbl>, L1.L10G1.WRT <dbl>,
# #   G1.TRA <dbl>, L1.G1.TRA <dbl>, L10G1.TRA <dbl>, L1.L10G1.TRA <dbl>, G1.FIRE <dbl>,
# #   L1.G1.FIRE <dbl>, L10G1.FIRE <dbl>, L1.L10G1.FIRE <dbl>, G1.GOV <dbl>, L1.G1.GOV <dbl>,
# #   L10G1.GOV <dbl>, L1.L10G1.GOV <dbl>, G1.OTH <dbl>, L1.G1.OTH <dbl>, L10G1.OTH <dbl>,
# #   L1.L10G1.OTH <dbl>, G1.SUM <dbl>, L1.G1.SUM <dbl>, L10G1.SUM <dbl>, L1.L10G1.SUM <dbl>

3. Benchmarks

This section seeks to demonstrate that the functionality introduced in the preceding 2 sections indeed produces code that evaluates substantially faster than native dplyr.

To do this properly, the different components of a typical piped call (selecting / subsetting, ordering, grouping, and performing some computation) are bechmarked separately on 2 different data sizes.

All benchmarks are run on a Windows 8.1 laptop with a 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD hard drive.

3.1 Data

Bechmarks are run on the original GGDC10S data used throughout this vignette and a larger dataset with approx. 1 million observations, obtained by replicating and row-binding GGDC10S 200 times while maintaining unique groups.

# This shows the groups in GGDC10S
GRP(GGDC10S, ~ Variable + Country)
# collapse grouping object of length 5027 with 85 ordered groups
# 
# Call: GRP.default(X = GGDC10S, by = ~Variable + Country), X is unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4.00   53.00   62.00   59.14   63.00   65.00 
# 
# Groups with sizes: 
# EMP.ARG EMP.BOL EMP.BRA EMP.BWA EMP.CHL EMP.CHN 
#      62      61      62      52      63      62 
#   ---
# VA.TWN VA.TZA VA.USA VA.VEN VA.ZAF VA.ZMB 
#     63     52     65     63     52     52

# This replicates the data 200 times 
data <- replicate(200, GGDC10S, simplify = FALSE) 
# This function adds a number i to the country and variable columns of each dataset
uniquify <- function(x, i) ftransform(x, lapply(unclass(x)[c(1,4)], paste0, i))
# Making datasets unique and row-binding them
data <- unlist2d(Map(uniquify, data, as.list(1:200)), idcols = FALSE)
fdim(data)
# [1] 1005400      16

# This shows the groups in the replicated data
GRP(data, ~ Variable + Country)
# collapse grouping object of length 1005400 with 17000 ordered groups
# 
# Call: GRP.default(X = data, by = ~Variable + Country), X is unordered
# 
# Distribution of group sizes: 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#    4.00   53.00   62.00   59.14   63.00   65.00 
# 
# Groups with sizes: 
# EMP1.ARG1 EMP1.BOL1 EMP1.BRA1 EMP1.BWA1 EMP1.CHL1 EMP1.CHN1 
#        62        61        62        52        63        62 
#   ---
# VA99.TWN99 VA99.TZA99 VA99.USA99 VA99.VEN99 VA99.ZAF99 VA99.ZMB99 
#         63         52         65         63         52         52

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1924737 102.8    3670734 196.1  3670734 196.1
# Vcells 19870476 151.6   28315229 216.1 23044570 175.9

3.1 Selecting, Subsetting, Ordering and Grouping

## Selecting columns
# Small
microbenchmark(dplyr = select(GGDC10S, Country, Variable, AGR:SUM),
               collapse = fselect(GGDC10S, Country, Variable, AGR:SUM))
# Unit: microseconds
#      expr      min       lq       mean    median       uq      max neval cld
#     dplyr 2798.415 2980.931 3159.70634 3127.3000 3306.245 3923.851   100   b
#  collapse   10.710   17.404   26.15927   28.7835   34.808   56.673   100  a

# Large
microbenchmark(dplyr = select(data, Country, Variable, AGR:SUM),
               collapse = fselect(data, Country, Variable, AGR:SUM))
# Unit: microseconds
#      expr      min       lq      mean   median       uq      max neval cld
#     dplyr 2771.641 2826.752 3038.0593 2936.976 3174.825 3767.664   100   b
#  collapse   11.156   15.619   25.7219   29.006   34.361   56.228   100  a

## Subsetting columns 
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA"),
               collapse = fsubset(GGDC10S, Variable == "VA"))
# Unit: microseconds
#      expr      min       lq      mean    median        uq      max neval cld
#     dplyr 1373.102 1637.505 1797.1978 1716.0440 1888.2945 2913.547   100   b
#  collapse  157.972  209.513  266.3651  229.8175  317.0585  497.566   100  a

# Large
microbenchmark(dplyr = filter(data, Variable == "VA"),
               collapse = fsubset(data, Variable == "VA"))
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     dplyr 16.846720 17.146822 22.628601 17.511629 19.158280 171.76675   100   b
#  collapse  7.695085  7.825166  9.181815  7.993847  8.666342  23.32132   100  a

## Ordering rows
# Small
microbenchmark(dplyr = arrange(GGDC10S, desc(Country), Variable, Year),
               collapse = roworder(GGDC10S, -Country, Variable, Year))
# Unit: microseconds
#      expr      min       lq      mean    median        uq       max neval cld
#     dplyr 7784.781 8136.646 8736.6524 8491.4130 8959.9725 14771.671   100   b
#  collapse  559.147  624.523  786.6669  697.4845  876.4295  2740.403   100  a

# Large
microbenchmark(dplyr = arrange(data, desc(Country), Variable, Year),
               collapse = roworder(data, -Country, Variable, Year), times = 2)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     dplyr 2601.7312 2601.7312 2632.6191 2632.6191 2663.5069 2663.5069     2   b
#  collapse  344.6215  344.6215  346.6969  346.6969  348.7724  348.7724     2  a


## Grouping 
# Small
microbenchmark(dplyr = group_by(GGDC10S, Country, Variable),
               collapse = fgroup_by(GGDC10S, Country, Variable))
# Unit: microseconds
#      expr      min        lq      mean   median       uq      max neval cld
#     dplyr 2763.162 2898.8215 3088.2130 2994.988 3264.074 3832.817   100   b
#  collapse  348.519  367.4845  402.0911  396.044  414.564  785.395   100  a

# Large
microbenchmark(dplyr = group_by(data, Country, Variable),
               collapse = fgroup_by(data, Country, Variable), times = 10)
# Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval cld
#     dplyr 76.63803 77.36809 78.87136 78.39401 80.88586 81.68731    10   b
#  collapse 68.31864 68.74615 72.99447 71.08538 71.57669 96.47059    10  a

## Computing a new column 
# Small
microbenchmark(dplyr = mutate(GGDC10S, NEW = AGR+1),
               collapse = ftransform(GGDC10S, NEW = AGR+1))
# Unit: microseconds
#      expr      min       lq      mean    median       uq      max neval cld
#     dplyr 2479.349 2586.226 2717.8860 2659.4095 2825.860 3310.261   100   b
#  collapse   21.866   26.775   36.8424   33.2455   45.964   70.061   100  a

# Large
microbenchmark(dplyr = mutate(data, NEW = AGR+1),
               collapse = ftransform(data, NEW = AGR+1))
# Unit: milliseconds
#      expr      min       lq     mean   median       uq      max neval cld
#     dplyr 6.479508 6.751273 8.913357 7.048919 7.416626 37.09250   100   b
#  collapse 1.751520 3.723040 4.420390 4.000828 4.069104 30.74999   100  a

## All combined with pipes 
# Small
microbenchmark(dplyr = filter(GGDC10S, Variable == "VA") %>% 
                       select(Country, Year, AGR:SUM) %>% 
                       arrange(desc(Country), Year) %>%
                       mutate(NEW = AGR+1) %>%
                       group_by(Country),
               collapse = fsubset(GGDC10S, Variable == "VA", Country, Year, AGR:SUM) %>% 
                       roworder(-Country, Year) %>%
                       ftransform(NEW = AGR+1) %>%
                       fgroup_by(Country))
# Unit: microseconds
#      expr       min        lq      mean     median        uq       max neval cld
#     dplyr 14700.718 15442.380 16238.182 16080.2905 16786.254 21672.882   100   b
#  collapse   698.377   779.148   841.774   813.2855   898.742  1129.898   100  a

# Large
microbenchmark(dplyr = filter(data, Variable == "VA") %>% 
                       select(Country, Year, AGR:SUM) %>% 
                       arrange(desc(Country), Year) %>%
                       mutate(NEW = AGR+1) %>%
                       group_by(Country),
               collapse = fsubset(data, Variable == "VA", Country, Year, AGR:SUM) %>% 
                       roworder(-Country, Year) %>%
                       ftransform(NEW = AGR+1) %>%
                       fgroup_by(Country), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq      max neval cld
#     dplyr 29.952104 30.812915 35.031735 32.106808 32.820580 53.31448    10   b
#  collapse  7.094883  8.230582  8.733189  8.909993  9.348877 10.03476    10  a

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1930366 103.1    3670734 196.1  3670734 196.1
# Vcells 21386676 163.2   44612936 340.4 44612936 340.4

3.1 Aggregation

## Grouping the data
cgGGDC10S <- fgroup_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
gGGDC10S <- group_by(GGDC10S, Variable, Country) %>% fselect(-Region, -Regioncode)
cgdata <- fgroup_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
gdata <- group_by(data, Variable, Country) %>% fselect(-Region, -Regioncode)
rm(data, GGDC10S) 
gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1947353 104.0    3670734 196.1  3670734 196.1
# Vcells 20487216 156.4   44612936 340.4 44612936 340.4

## Conversion of Grouping object: This time would be required extra in all hybrid calls 
## i.e. when calling collapse functions on data grouped with dplyr::group_by
# Small
microbenchmark(GRP(gGGDC10S))
# Unit: microseconds
#           expr     min       lq     mean median      uq     max neval
#  GRP(gGGDC10S) 167.342 190.1015 204.8187 195.68 214.868 553.793   100

# Large
microbenchmark(GRP(gdata))
# Unit: milliseconds
#        expr      min       lq     mean   median       uq      max neval
#  GRP(gdata) 30.90529 33.20525 34.67392 34.65956 35.94654 46.42175   100


## Sum 
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sum, na.rm = TRUE),
               collapse = fsum(cgGGDC10S))
# Unit: microseconds
#      expr      min        lq       mean   median         uq       max neval cld
#     dplyr 8585.348 9057.2545 10219.6699 9580.255 10114.4135 19238.159   100   b
#  collapse  238.296  276.6735   296.1522  294.746   316.6125   490.426   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, sum, na.rm = TRUE),
               collapse = fsum(cgdata), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     dplyr 579.45496 584.42393 613.59644 598.77769 624.97484 738.12035    10   b
#  collapse  40.33181  40.54601  42.07606  42.07909  42.54966  47.16876    10  a

## Mean
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, mean.default, na.rm = TRUE),
               collapse = fmean(cgGGDC10S))
# Unit: microseconds
#      expr       min        lq       mean    median         uq       max neval cld
#     dplyr 11492.648 11748.793 13004.0464 12048.226 12505.1825 31512.631   100   b
#  collapse   253.469   264.179   296.0227   307.688   312.5965   420.812   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, mean.default, na.rm = TRUE),
               collapse = fmean(cgdata), times = 10)
# Unit: milliseconds
#      expr        min         lq       mean     median        uq        max neval cld
#     dplyr 1312.46756 1506.54399 1606.47536 1664.64309 1705.4015 1779.43256    10   b
#  collapse   43.48589   43.93347   45.26195   45.17917   45.9871   49.31209    10  a

## Median
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, median, na.rm = TRUE),
               collapse = fmedian(cgGGDC10S))
# Unit: microseconds
#      expr      min         lq       mean     median        uq        max neval cld
#     dplyr 50473.67 52100.6860 57787.6456 54558.1685 59640.700 212904.930   100   b
#  collapse   489.98   515.6385   568.5235   560.7095   598.641    849.655   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, median, na.rm = TRUE),
               collapse = fmedian(cgdata), times = 2)
# Unit: milliseconds
#      expr        min         lq        mean      median          uq         max neval cld
#     dplyr 9880.28083 9880.28083 10086.79484 10086.79484 10293.30884 10293.30884     2   b
#  collapse   89.30252   89.30252    91.89656    91.89656    94.49059    94.49059     2  a

## Standard Deviation
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, sd, na.rm = TRUE),
               collapse = fsd(cgGGDC10S))
# Unit: microseconds
#      expr      min        lq       mean    median        uq       max neval cld
#     dplyr 23725.62 24611.419 26140.1147 25158.518 26656.124 35508.773   100   b
#  collapse   422.15   453.164   492.8132   485.517   515.862   883.123   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, sd, na.rm = TRUE),
               collapse = fsd(cgdata), times = 2)
# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval cld
#     dplyr 4304.44404 4304.44404 4380.66706 4380.66706 4456.89008 4456.89008     2   b
#  collapse   81.43251   81.43251   82.12664   82.12664   82.82078   82.82078     2  a

## Maximum
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, max, na.rm = TRUE),
               collapse = fmax(cgGGDC10S))
# Unit: microseconds
#      expr       min        lq       mean     median         uq       max neval cld
#     dplyr 11240.964 11646.603 12413.9514 12253.7220 12704.2085 20804.039   100   b
#  collapse   178.945   207.059   236.9262   241.1965   251.6835   548.438   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, max, na.rm = TRUE),
               collapse = fmax(cgdata), times = 10)
# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval cld
#     dplyr 1044.13785 1062.44157 1100.07827 1083.97366 1103.79975 1223.72999    10   b
#  collapse   24.18391   24.85061   25.75836   25.68442   26.44282   28.02878    10  a

## First Value
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, first),
               collapse = ffirst(cgGGDC10S, na.rm = FALSE))
# Unit: microseconds
#      expr      min         lq       mean     median        uq       max neval cld
#     dplyr 27466.06 28773.7895 30433.8506 29823.3625 31313.381 37054.127   100   b
#  collapse    60.69    74.9695   104.9933   111.1155   126.065   213.306   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, first),
               collapse = ffirst(cgdata, na.rm = FALSE), times = 10)
# Unit: milliseconds
#      expr        min          lq        mean      median          uq         max neval cld
#     dplyr 4878.13483 5045.077204 5289.121344 5196.811698 5375.949782 6093.325884    10   b
#  collapse    4.44953    4.526284    4.917643    4.608171    4.914073    6.536181    10  a

## Number of Distinct Values
# Small
microbenchmark(dplyr = summarise_all(gGGDC10S, n_distinct, na.rm = TRUE),
               collapse = fNdistinct(cgGGDC10S))
# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval cld
#     dplyr 370.935750 378.919332 399.013004 385.050768 399.750147 542.351303   100   b
#  collapse   1.276267   1.327139   1.500403   1.387606   1.523264   5.796303   100  a

# Large
microbenchmark(dplyr = summarise_all(gdata, n_distinct, na.rm = TRUE),
               collapse = fNdistinct(cgdata), times = 5)
# Unit: milliseconds
#      expr        min         lq       mean     median         uq        max neval cld
#     dplyr 74045.7033 74201.0000 74950.3911 74891.2018 75278.3091 76335.7411     5   b
#  collapse   317.5298   319.3764   326.7462   325.0972   329.0956   342.6321     5  a

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1950123 104.2    3670734 196.1  3670734 196.1
# Vcells 20493548 156.4   44612936 340.4 44612936 340.4

Below are some additional benchmarks for weighted aggregations and aggregations using the statistical mode, which cannot easily or efficiently be performed with dplyr.

## Weighted Mean
# Small
microbenchmark(fmean(cgGGDC10S, SUM)) 
# Unit: microseconds
#                   expr     min       lq     mean  median      uq     max neval
#  fmean(cgGGDC10S, SUM) 285.152 301.4405 326.9209 325.315 334.686 453.387   100

# Large 
microbenchmark(fmean(cgdata, SUM), times = 10) 
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  fmean(cgdata, SUM) 51.64685 52.21269 53.04709 52.45746 53.25825 55.89691    10

## Weighted Standard-Deviation
# Small
microbenchmark(fsd(cgGGDC10S, SUM)) 
# Unit: microseconds
#                 expr     min      lq     mean   median      uq     max neval
#  fsd(cgGGDC10S, SUM) 433.752 441.115 468.5864 462.0895 476.592 673.387   100

# Large 
microbenchmark(fsd(cgdata, SUM), times = 10) 
# Unit: milliseconds
#              expr     min       lq     mean   median       uq      max neval
#  fsd(cgdata, SUM) 81.2134 82.15186 84.33088 83.91788 85.15689 89.66487    10

## Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S)) 
# Unit: milliseconds
#              expr      min       lq     mean   median       uq     max neval
#  fmode(cgGGDC10S) 1.587748 1.637504 1.679478 1.663387 1.697747 1.95367   100

# Large 
microbenchmark(fmode(cgdata), times = 10) 
# Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval
#  fmode(cgdata) 381.3137 391.2128 425.0835 404.3425 424.7772 549.6336    10

## Weighted Statistical Mode
# Small
microbenchmark(fmode(cgGGDC10S, SUM)) 
# Unit: milliseconds
#                   expr      min      lq     mean   median       uq      max neval
#  fmode(cgGGDC10S, SUM) 1.793468 1.84791 1.935726 1.890302 2.007889 2.626164   100

# Large 
microbenchmark(fmode(cgdata, SUM), times = 10) 
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  fmode(cgdata, SUM) 471.0955 478.6665 491.8299 481.7983 515.7568 517.8626    10

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1949570 104.2    3670734 196.1  3670734 196.1
# Vcells 20490142 156.4   44612936 340.4 44612936 340.4

3.2 Transformation


## Replacing with group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, sum, na.rm = TRUE),
               collapse = fsum(cgGGDC10S, TRA = "replace_fill"))
# Unit: microseconds
#      expr      min       lq       mean    median        uq       max neval cld
#     dplyr 9422.508 9950.641 10949.9220 10214.373 10769.504 27304.075   100   b
#  collapse  291.399  340.710   366.2888   363.692   391.359   553.793   100  a

# Large
microbenchmark(dplyr = mutate_all(gdata, sum, na.rm = TRUE),
               collapse = fsum(cgdata, TRA = "replace_fill"), times = 10)
# Unit: milliseconds
#      expr        min         lq     mean     median        uq       max neval cld
#     dplyr 1152.99314 1200.45642 1273.494 1249.90417 1298.4063 1512.5563    10   b
#  collapse   58.45658   79.04419  116.200   94.37434  113.0353  329.2473    10  a

## Dividing by group sum
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x/sum(x, na.rm = TRUE)),
               collapse = fsum(cgGGDC10S, TRA = "/"))
# Unit: microseconds
#      expr      min       lq       mean    median         uq       max neval cld
#     dplyr 9622.872 10204.11 11141.4289 10644.109 11165.1020 23135.235   100   b
#  collapse  548.884   581.46   631.4444   619.615   652.6365  1326.693   100  a

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x/sum(x, na.rm = TRUE)),
               collapse = fsum(cgdata, TRA = "/"), times = 10)
# Unit: milliseconds
#      expr       min        lq      mean    median        uq       max neval cld
#     dplyr 1241.6062 1398.6664 1530.9812 1556.4277 1632.4495 1861.7946    10   b
#  collapse  113.1424  121.9201  136.9692  133.0716  143.1489  196.7169    10  a

## Centering
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) x-mean.default(x, na.rm = TRUE)),
               collapse = fwithin(cgGGDC10S))
# Unit: microseconds
#      expr       min         lq       mean    median        uq       max neval cld
#     dplyr 12829.158 13595.3640 14916.2687 13954.146 14407.980 35788.124   100   b
#  collapse   310.589   348.7425   379.9439   374.625   400.507   700.608   100  a

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) x-mean.default(x, na.rm = TRUE)),
               collapse = fwithin(cgdata), times = 10)
# Unit: milliseconds
#      expr        min         lq      mean    median        uq       max neval cld
#     dplyr 2205.50124 2646.09750 2706.8577 2763.9730 2831.4105 2870.2822    10   b
#  collapse   64.86068   90.56183  140.7654  112.0254  131.0632  374.7248    10  a

## Centering and Scaling (Standardizing)
# Small
microbenchmark(dplyr = mutate_all(gGGDC10S, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
               collapse = fscale(cgGGDC10S))
# Unit: microseconds
#      expr       min        lq       mean     median         uq       max neval cld
#     dplyr 31800.906 32996.848 36621.5260 34130.0930 37528.0415 57398.085   100   b
#  collapse   516.754   565.172   634.9027   596.1865   639.2495  1398.092   100  a

# Large
microbenchmark(dplyr = mutate_all(gdata, function(x) (x-mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE)),
               collapse = fscale(cgdata), times = 2)
# Unit: milliseconds
#      expr       min        lq      mean    median       uq      max neval cld
#     dplyr 6435.4363 6435.4363 6704.2027 6704.2027 6972.969 6972.969     2   b
#  collapse  106.2894  106.2894  122.0602  122.0602  137.831  137.831     2  a

## Lag
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, dplyr::lag),
               collapse_unordered = flag(cgGGDC10S),
               dplyr_ordered = mutate_all(gGGDC10S, dplyr::lag, order_by = "Year"),
               collapse_ordered = flag(cgGGDC10S, t = Year))
# Unit: microseconds
#                expr        min         lq        mean      median         uq        max neval cld
#     dplyr_unordered  47350.386  48740.669  51035.0391  49742.4930  52168.069  68811.747   100  b 
#  collapse_unordered    375.294    442.231    468.3410    468.1135    490.426    582.799   100 a  
#       dplyr_ordered 128288.897 131983.154 136240.5114 134911.4270 139166.616 161151.871   100   c
#    collapse_ordered    322.637    375.964    408.4056    394.4825    421.257   1137.038   100 a

# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, dplyr::lag),
               collapse_unordered = flag(cgdata),
               dplyr_ordered = mutate_all(gdata, dplyr::lag, order_by = "Year"),
               collapse_ordered = flag(cgdata, t = Year), times = 2)
# Unit: milliseconds
#                expr         min          lq        mean      median          uq         max neval
#     dplyr_unordered  8975.91694  8975.91694  9083.08228  9083.08228  9190.24763  9190.24763     2
#  collapse_unordered    53.55188    53.55188    62.14437    62.14437    70.73686    70.73686     2
#       dplyr_ordered 25813.19249 25813.19249 26739.83750 26739.83750 27666.48252 27666.48252     2
#    collapse_ordered    96.89318    96.89318   106.13697   106.13697   115.38075   115.38075     2
#  cld
#   b 
#  a  
#    c
#  a

## First-Difference (unordered)
# Small
microbenchmark(dplyr_unordered = mutate_all(gGGDC10S, function(x) x - dplyr::lag(x)),
               collapse_unordered = fdiff(cgGGDC10S))
# Unit: microseconds
#                expr      min         lq       mean    median        uq       max neval cld
#     dplyr_unordered 62157.31 63476.4170 66662.2014 64777.896 68252.823 95926.613   100   b
#  collapse_unordered   386.45   460.5275   513.5147   497.789   528.803  1019.229   100  a

# Large
microbenchmark(dplyr_unordered = mutate_all(gdata, function(x) x - dplyr::lag(x)),
               collapse_unordered = fdiff(cgdata), times = 2)
# Unit: milliseconds
#                expr         min          lq        mean      median          uq         max neval
#     dplyr_unordered 12545.43710 12545.43710 13110.48386 13110.48386 13675.53062 13675.53062     2
#  collapse_unordered    64.10741    64.10741    73.32665    73.32665    82.54589    82.54589     2
#  cld
#    b
#   a

gc()
#            used  (Mb) gc trigger  (Mb) max used  (Mb)
# Ncells  1951949 104.3    3670735 196.1  3670735 196.1
# Vcells 21539057 164.4   65092104 496.7 65092101 496.7

Below again some benchmarks for transformations not easily of efficiently performed with dplyr, such as centering on the overall mean, mean-preserving scaling, weighted scaling and centering, sequences of lags / leads, (iterated) panel-differences and growth rates.

# Centering on overall mean
microbenchmark(fwithin(cgdata, mean = "overall.mean"), times = 10)
# Unit: milliseconds
#                                    expr      min       lq     mean   median       uq     max neval
#  fwithin(cgdata, mean = "overall.mean") 65.08112 83.53835 92.52791 95.56872 103.4503 115.175    10

# Weighted Centering
microbenchmark(fwithin(cgdata, SUM), times = 10)
# Unit: milliseconds
#                  expr      min       lq     mean  median      uq      max neval
#  fwithin(cgdata, SUM) 64.75179 65.81118 84.23177 81.5579 101.527 106.3509    10
microbenchmark(fwithin(cgdata, SUM, mean = "overall.mean"), times = 10)
# Unit: milliseconds
#                                         expr      min       lq     mean  median       uq      max
#  fwithin(cgdata, SUM, mean = "overall.mean") 67.02944 67.83268 99.77674 87.0155 103.1862 233.4667
#  neval
#     10

# Weighted Scaling and Standardizing
microbenchmark(fsd(cgdata, SUM, TRA = "/"), times = 10)
# Unit: milliseconds
#                         expr      min       lq     mean  median       uq     max neval
#  fsd(cgdata, SUM, TRA = "/") 143.8683 156.2623 164.8902 167.915 170.7149 185.994    10
microbenchmark(fscale(cgdata, SUM), times = 10)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval
#  fscale(cgdata, SUM) 97.76158 101.3057 117.8158 112.4665 135.0901 144.7045    10

# Sequence of lags and leads
microbenchmark(flag(cgdata, -1:1), times = 10)
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  flag(cgdata, -1:1) 72.85832 110.3707 200.5462 231.0337 250.2536 280.4177    10

# Iterated difference
microbenchmark(fdiff(cgdata, 1, 2), times = 10)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq    max neval
#  fdiff(cgdata, 1, 2) 61.83958 65.19715 85.75602 86.90238 99.94373 110.68    10

# Growth Rate
microbenchmark(fgrowth(cgdata,1), times = 10)
# Unit: milliseconds
#                expr      min       lq     mean   median       uq      max neval
#  fgrowth(cgdata, 1) 64.35151 79.62878 90.42032 96.65288 101.8938 110.4622    10

References

Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). “Patterns of Structural Change in Developing Countries.” . In J. Weiss, & M. Tribe (Eds.), Routledge Handbook of Industry and Development. (pp. 65-83). Routledge.

Cochrane, D. & Orcutt, G. H. (1949). “Application of Least Squares Regression to Relationships Containing Auto-Correlated Error Terms”. Journal of the American Statistical Association. 44 (245): 32–61.

Prais, S. J. & Winsten, C. B. (1954). “Trend Estimators and Serial Correlation”. Cowles Commission Discussion Paper No. 383. Chicago.


  1. Row-wise operations are not supported by TRA.↩︎