# collapse 1.8.9

• Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).

• .pseries / .indexed_series methods also change the implicit class of the vector (attached after "pseries"), if the data type changed. e.g. calling a function like fgrowth on an integer pseries changed the data type to double, but the “integer” class was still attached after “pseries”.

• Fixed bad testing for SE inputs in fgroup_by() and findex_by(). See #320.

• Added rsplit.matrix method.

• descr() now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA’s detailed summary statistics), and can also be applied to ‘pseries’ / ‘indexed_series’. Furthermore, descr() itself now has an argument stepwise such that descr(big_data, stepwise = TRUE) yields computation of summary statistics on a variable-by-variable basis (and the finished ‘descr’ object is returned invisibly). The printed result is thus identical to print(descr(big_data), stepwise = TRUE), with the difference that the latter first does the entire computation whereas the former computes statistics on demand.

• Function ss() has a new argument check = TRUE. Setting check = FALSE allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers.

• Function get_vars() has a new argument rename allowing select-renaming of columns in standard evaluation programming, e.g. get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE). The default is rename = FALSE, to warrant full backwards compatibility. See #327.

• Added helper function setattrib(), to set a new attribute list for an object by reference + invisible return. This is different from the existing function setAttrib() (note the capital A), which takes a shallow copy of list-like objects and returns the result.

# collapse 1.8.8

• flm and fFtest are now internal generic with an added formula method e.g. flm(mpg ~ hp + carb, mtcars, weights = wt) or fFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt) in addition to the programming interface. Thanks to Grant McDermott for suggesting.

• Added method as.data.frame.qsu, to efficiently turn the default array outputs from qsu() into tidy data frames.

• Major improvements to setv and copyv, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such as X[v] <- R can now be done significantly faster using setv(X, v, R).

• n and qtab can now be added to options("collapse_mask") e.g. options(collapse_mask = c("manip", "helper", "n", "qtab")). This will export a function n() to get the (group) count in fsummarise and fmutate (which can also always be done using GRPN() but n() is more familiar to dplyr users), and will mask table() with qtab(), which is principally a fast drop-in replacement, but with some different further arguments.

• Added C-level helper function all_funs, which fetches all the functions called in an expression, similar to setdiff(all.names(x), all.vars(x)) but better because it takes account of the syntax. For example let x = quote(sum(sum)) i.e. we are summing a column named sum. Then all.names(x) = c("sum", "sum") and all.vars(x) = "sum" so that the difference is character(0), whereas all_funs(x) returns "sum". This function makes collapse smarter when parsing expressions in fsummarise and fmutate and deciding which ones to vectorize.

# collapse 1.8.7

• Fixed a bug in fscale.pdata.frame where the default C++ method was being called instead of the list method (i.e. the method didn’t work at all).

• Fixed 2 minor rchk issues (the remaining ones are spurious).

• fsum has an additional argument fill = TRUE (default FALSE) that initializes the result vector with 0 instead of NA when na.rm = TRUE, so that fsum(NA, fill = TRUE) gives 0 like base::sum(NA, na.rm = TRUE).

• Slight performance increase in fmean with groups if na.rm = TRUE (the default).

• Significant performance improvement when using base R expressions involving multiple functions and one column e.g. mid_col = (min(col) + max(col)) / 2 or lorentz_col = cumsum(sort(col)) / sum(col) etc. inside fsummarise and fmutate. Instead of evaluating such expressions on a data subset of one column for each group, they are now turned into a function e.g. function(x) cumsum(sort(x)) / sum(x) which is applied to a single vector split by groups.

• fsummarise now also adds groupings to transformation functions and operators, which allows full vectorization of more complex tasks involving transformations which are subsequently aggregated. A prime example is grouped bivariate linear model fitting, which can now be done using mtcars |> fgroup_by(cyl) |> fsummarise(slope = fsum(W(mpg), hp) / fsum(W(mpg)^2)). Before 1.8.7 it was necessary to do a mutate step first e.g. mtcars |> fgroup_by(cyl) |> fmutate(dm_mpg = W(mpg)) |> fsummarise(slope = fsum(dm_mpg, hp) / fsum(dm_mpg^2)), because fsummarise did not add groupings to transformation functions like fwithin/W. Thanks to Brodie Gaslam for making me aware of this.

• Argument return.groups from GRP.default is now also available in fgroup_by, allowing grouped data frames without materializing the unique grouping columns. This allows more efficient mutate-only operations e.g. mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmutate(across(hp:carb, fscale)). Similarly for aggregation with dropping of grouping columns mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmean() is equivalent and faster than mtcars |> fgroup_by(cyl) |> fmean(keep.group_vars = FALSE).

# collapse 1.8.6

• Fixed further minor issues:
• some inline functions in TRA.c needed to be declared ‘static’ to be local in scope (#275)
• timeid.Rd now uses zoo package conditionally and limits size of printout

# collapse 1.8.5

• Fixed some issues flagged by CRAN:
• Installation on some linux distributions failed because omp.h was included after Rinternals.h
• Some signed integer overflows while running tests caused UBSAN warnings. (This happened inside a hash function where overflows are not a problem. I changed to unsigned int to avoid the UBSAN warning.)
• Ensured that package passes R CMD Check without suggested packages

# collapse 1.8.4

• Makevars text substitution hack to have CRAN accept a package that combines C, C++ and OpenMP. Thanks also to @MichaelChirico for pointing me in the right direction.

# collapse 1.8.3

• Significant speed improvement in qF/qG (factor-generation) for character vectors with more than 100,000 obs and many levels if sort = TRUE (the default). For details see the method argument of ?qF.

• Optimizations in fmode and fndistinct for singleton groups.

# collapse 1.8.2

• Fixed some rchk issues found by Thomas Kalibera from CRAN.

• faster funique.default method.

• group now also internally optimizes on ‘qG’ objects.

# collapse 1.8.1

• Added function fnunique (yet another alternative to data.table::uniqueN, kit::uniqLen or dplyr::n_distinct, and principally a simple wrapper for attr(group(x), "N.groups")). At present fnunique generally outperforms the others on data frames.

• finteraction has an additional argument factor = TRUE. Setting factor = FALSE returns a ‘qG’ object, which is more efficient if just an integer id but no factor object itself is required.

• Operators (see .OPERATOR_FUN) have been improved a bit such that id-variables selected in the .data.frame (by, w or t arguments) or .pdata.frame methods (variables in the index) are not computed upon even if they are numeric (since the default is cols = is.numeric). In general, if cols is a function used to select columns of a certain data type, id variables are excluded from computation even if they are of that data type. It is still possible to compute on id variables by explicitly selecting them using names or indices passed to cols, or including them in the lhs of a formula passed to by.

• Further efforts to facilitate adding the group-count in fsummarise and fmutate:

• if options(collapse_mask = "all") before loading the package, an additional function n() is exported that works just like dplyr:::n().
• otherwise the same can now always be done using GRPN(). The previous uses of GRPN are unaltered i.e. GRPN can also:
• fetch group sizes directly grouping object or grouped data frame i.e. data |> gby(id) |> GRPN() or data %>% gby(id) %>% ftransform(N = GRPN(.)) (note the dot).
• compute group sizes on the fly, for example fsubset(data, GRPN(id) > 10L) or fsubset(data, GRPN(list(id1, id2)) > 10L) or GRPN(data, by = ~ id1 + id2).

# collapse 1.8.0

collapse 1.8.0, released mid of May 2022, brings enhanced support for indexed computations on time series and panel data by introducing flexible ‘indexed_frame’ and ‘indexed_series’ classes and surrounding infrastructure, sets a modest start to OpenMP multithreading as well as data transformation by reference in statistical functions, and enhances the packages descriptive statistics toolset.

### Changes to functionality

• Functions Recode, replace_non_finite, depreciated since collapse v1.1.0 and is.regular, depreciated since collapse v1.5.1 and clashing with a more important function in the zoo package, are now removed.

• Fast Statistical Functions operating on numeric data (such as fmean, fmedian, fsum, fmin, fmax, …) now preserve attributes in more cases. Previously these functions did not preserve attributes for simple computations using the default method, and only preserved attributes in grouped computations if !is.object(x) (see NEWS section for collapse 1.4.0). This meant that fmin and fmax did not preserve the attributes of Date or POSIXct objects, and none of these functions preserved ‘units’ objects (used a lot by the sf package). Now, attributes are preserved if !inherits(x, "ts"), that is the new default of these functions is to generally keep attributes, except for ‘ts’ objects where doing so obviously causes an unwanted error (note that ‘xts’ and others are handled by the matrix or data.frame method where other principles apply, see NEWS for 1.4.0). An exception are the functions fnobs and fndistinct where the previous default is kept.

• Time Series Functions flag, fdiff, fgrowth and psacf/pspacf/psccf (and the operators L/F/D/Dlog/G) now internally process time objects passed to the t argument (where is.object(t) && is.numeric(unclass(t))) via a new function called timeid which turns them into integer vectors based on the greatest common divisor (GCD) (see below). Previously such objects were converted to factor. This can change behavior of code e.g. a ‘Date’ variable representing monthly data may be regular when converted to factor, but is now irregular and regarded as daily data (with a GCD of 1) because of the different day counts of the months. Users should fix such code by either by calling qG on the time variable (for grouping / factor-conversion) or using appropriate classes e.g. zoo::yearmon. Note that plain numeric vectors where !is.object(t) are still used directly for indexation without passing them through timeid (which can still be applied manually if desired).

• BY now has an argument reorder = TRUE, which casts elements in the original order if NROW(result) == NROW(x) (like fmutate). Previously the result was just in order of the groups, regardless of the length of the output. To obtain the former outcome users need to set reorder = FALSE.

• options("collapse_DT_alloccol") was removed, the default is now fixed at 100. The reason is that data.table automatically expands the range of overallocated columns if required (so the option is not really necessary), and calling R options from C slows down C code and can cause problems in parallel code.

### Bug Fixes

• Fixed a bug in fcumsum that caused a segfault during grouped operations on larger data, due to flawed internal memory allocation. Thanks @Gulde91 for reporting #237.

• Fixed a bug in across caused by two function(x) statements being passed in a list e.g. mtcars |> fsummarise(acr(mpg, list(ssdd = function(x) sd(x), mu = function(x) mean(x)))). Thanks @trang1618 for reporting #233.

• Fixed an issue in across() when logical vectors were used to select column on grouped data e.g. mtcars %>% gby(vs, am) %>% smr(acr(startsWith(names(.), "c"), fmean)) now works without error.

• qsu gives proper output for length 1 vectors e.g. qsu(1).

• collapse depends on R > 3.3.0, due to the use of newer C-level macros introduced then. The earlier indication of R > 2.1.0 was only based on R-level code and misleading. Thanks @ben-schwen for reporting #236. I will try to maintain this dependency for as long as possible, without being too restrained by development in R’s C API and the ALTREP system in particular, which collapse might utilize in the future.

• Introduction of ‘indexed_frame’,‘indexed_series’ and ‘index_df’ classes: fast and flexible indexed time series and panel data classes that inherit from plm’s ‘pdata.frame’, ‘pseries’ and ‘pindex’ classes. These classes take full advantage of collapse’s computational infrastructure, are class-agnostic i.e. they can be superimposed upon any data frame or vector/matrix like object while maintaining most of the functionality of that object, support both time series and panel data, natively handle irregularity, and supports ad-hoc computations inside arbitrary data masking functions and model formulas. This infrastructure comprises of additional functions and methods, and modification of some existing functions and ‘pdata.frame’ / ‘pseries’ methods.

• New functions: findex_by/iby, findex/ix, unindex, reindex, is_irregular, to_plm.

• New methods: [.indexed_series, [.indexed_frame, [<-.indexed_frame, $.indexed_frame, $<-.indexed_frame, [[.indexed_frame, [[<-.indexed_frame, [.index_df, fsubset.pseries, fsubset.pdata.frame, funique.pseries, funique.pdata.frame, roworder(v) (internal) na_omit (internal), print.indexed_series, print.indexed_frame, print.index_df, Math.indexed_series, Ops.indexed_series.

• Modification of ‘pseries’ and ‘pdata.frame’ methods for functions flag/L/F, fdiff/D/Dlog, fgrowth/G, fcumsum, psmat, psacf/pspacf/psccf, fscale/STD, fbetween/B, fwithin/W, fhdbetween/HDB, fhdwithin/HDW, qsu and varying to take advantage of ‘indexed_frame’ and ‘indexed_series’ while continuing to work as before with ‘pdata.frame’ and ‘pseries’.

For more information and details see help("indexing").

• Added function timeid: Generation of an integer-id/time-factor from time or date sequences represented by integer of double vectors (such as ‘Date’, ‘POSIXct’, ‘ts’, ‘yearmon’, ‘yearquarter’ or plain integers / doubles) by a numerically quite robust greatest common divisor method (see below). This function is used internally in findex_by, reindex and also in evaluation of the t argument to functions like flag/fdiff/fgrowth whenever is.object(t) && is.numeric(unclass(t)) (see also note above).

• Programming helper function vgcd to efficiently compute the greatest common divisor from a vector or positive integer or double values (which should ideally be unique and sorted as well, timeid uses vgcd(sort(unique(diff(sort(unique(na_rm(x)))))))). Precision for doubles is up to 6 digits.

• Programming helper function frange: A significantly faster alternative to base::range, which calls both min and max. Note that frange inherits collapse’s global na.rm = TRUE default.

• Added function qtab/qtable: A versatile and computationally more efficient alternative to base::table. Notably, it also supports tabulations with frequency weights, and computation of a statistic over combinations of variables. Objects are of class ‘qtab’ that inherits from ‘table’. Thus all ‘table’ methods apply to it.

### Improvements

• Full data.table support, including reference semantics (set*, :=)!! There is some complex C-level programming behind data.table’s operations by reference. Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the data.table, and an ".internal.selfref" attribute containing an external pointer is used to check if any shallow copy was made using base R commands like <-. This is done to avoid even a shallow copy of the data.table in manipulations using := (and is in my opinion not worth it as even large tables are shallow copied by base R (>=3.1.0) within microseconds and all of this complicates development immensely). Previously, collapse treated data.table’s like any other data frame, using shallow copies in manipulations and preserving the attributes (thus ignoring how data.table works internally). This produced a warning whenever you wanted to use data.table reference semantics (set*, :=) after passing the data.table through a collapse function such as collap, fselect, fsubset, fgroup_by etc. From v1.6.0, I have adopted essential C code from data.table to do the overallocation and generate the ".internal.selfref" attribute, thus seamless workflows combining collapse and data.table are now possible. This comes at a cost of about 2-3 microseconds per function, as to do this I have to shallow copy the data.table again and add extra column pointers and an ".internal.selfref" attribute telling data.table that this table was not copied (it seems to be the only way to do it for now). This integration encompasses all data manipulation functions in collapse, but not the Fast Statistical Functions themselves. Thus you can do agDT <- DT %>% fselect(id, col1:coln) %>% collap(~id, fsum); agDT[, newcol := 1], but you would need to do add a qDT after a function like fsum if you want to use reference semantics without incurring a warning: agDT <- DT %>% fselect(id, col1:coln) %>% fgroup_by(id) %>% fsum %>% qDT; agDT[, newcol := 1]. collapse appears to be the first package that attempts to account for data.table’s internal working without importing data.table, and qDT is now the fastest way to create a fully functional data.table from any R object. A global option "collapse_DT_alloccol" was added to regulate how many columns collapse overallocates when creating data.table’s. The default is 100, which is lower than the data.table default of 1024. This was done to increase efficiency of the additional shallow copies, and may be changed by the user.

• Programming enabled with fselect and fgroup_by (you can now pass vectors containing column names or indices). Note that instead of fselect you should use get_vars for standard eval programming.

• fselect and fsubset support in-place renaming, e.g. fselect(data, newname = var1, var3:varN), fsubset(data, vark > varp, newname = var1, var3:varN).

• collap supports renaming columns in the custom argument, e.g. collap(data, ~ id, custom = list(fmean = c(newname = "var1", "var2"), fmode = c(newname = 3), flast = is_date)).

• Performance improvements: fsubset / ss return the data or perform a simple column subset without deep copying the data if all rows are selected through a logical expression. fselect and get_vars, num_vars etc. are slightly faster through data frame subsetting done fully in C. ftransform / fcompute use alloc instead of base::rep to replicate a scalar value which is slightly more efficient.

• fcompute now has a keep argument, to preserve several existing columns when computing columns on a data frame.

• replace_NA now has a cols argument, so we can do replace_NA(data, cols = is.numeric), to replace NA’s in numeric columns. I note that for big numeric data data.table::setnafill is the most efficient solution.

• fhdbetween and fhdwithin have an effect argument in plm methods, allowing centering on selected identifiers. The default is still to center on all panel identifiers.

• The plot method for panel series matrices and arrays plot.psmat was improved slightly. It now supports custom colours and drawing of a grid.

• settransform and settransformv can now be called without attaching the package e.g. collapse::settransform(data, ...). These errored before when collapse is not loaded because they are simply wrappers around data <- ftransform(data, ...). I’d like to note from a discussion that avoiding shallow copies with <- (e.g. via :=) does not appear to yield noticeable performance gains. Where data.table is faster on big data this mostly has to do with parallelism and sometimes with algorithms, generally not memory efficiency.

• Functions setAttrib, copyAttrib and copyMostAttrib only make a shallow copy of lists, not of atomic vectors (which amounts to doing a full copy and is inefficient). Thus atomic objects are now modified in-place.

• Small improvements: Calling qF(x, ordered = FALSE) on an ordered factor will remove the ordered class, the operators L, F, D, Dlog, G, B, W, HDB, HDW and functions like pwcor now work on unnamed matrices or data frames.

# collapse 1.5.3

• A test that occasionally fails on Mac is removed, and all unit testing is now removed from CRAN. collapse has close to 10,000 unit tests covering all central pieces of code. Half of these tests depend on generated data, and for some reasons there is always a test or two that occasionally fail on some operating system (usually not Windows), requiring me to submit a patch. This is not constructive to either the development or the use of this package, therefore tests are now removed from CRAN. They are still run on codecov.io, and every new release is thoroughly tested on Windows.

# collapse 1.5.2

### Changes to Functionality

• The first argument of ftransform was renamed to .data from X. This was done to enable the user to transform columns named “X”. For the same reason the first argument of frename was renamed to .x from x (not .data to make it explicit that .x can be any R object with a “names” attribute). It is not possible to depreciate X and x without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.

• The function is.regular to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the zoo package (#127).

### Bug Fixes

• unlist2d produced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or NULL elements in the list-tree are simply ignored (#99).

• A function fsummarize was added to facilitate translating dplyr / data.table code to collapse. Like collap, it is only very fast when used with the Fast Statistical Functions.

• A function t_list is made available to efficiently transpose lists of lists.

### Improvements

• C files are compiled -O3 on Windows, which gives a boost of around 20% for the grouping mechanism applied to character data.

# collapse 1.5.1

A small patch for 1.5.0 that:

• Fixes a numeric precision issue when grouping doubles (e.g. before qF(wlddev$LIFEEX) gave an error, now it works). • Fixes a minor issue with fhdwithin when applied to pseries and fill = FALSE. # collapse 1.5.0 collapse 1.5.0, released early January 2021, presents important refinements and some additional functionality. ### Back to CRAN • I apologize for inconveniences caused by the temporal archival of collapse from December 19, 2020. This archival was caused by the archival of the important lfe package on the 4th of December. collapse depended on lfe for higher-dimensional centering, providing the fhdbetween / fhdwithin functions for generalized linear projecting / partialling out. To remedy the damage caused by the removal of lfe, I had to rewrite fhdbetween / fhdwithin to take advantage of the demeaning algorithm provided by fixest, which has some quite different mechanics. Beforehand, I made some significant changes to fixest::demean itself to make this integration happen. The CRAN deadline was the 18th of December, and I realized too late that I would not make this. A request to CRAN for extension was declined, so collapse got archived on the 19th. I have learned from this experience, and collapse is now sufficiently insulated that it will not be taken off CRAN even if all suggested packages were removed from CRAN. ### Bug Fixes • Segfaults in several Fast Statistical Functions when passed numeric(0) are fixed (thanks to @eshom and @acylam, #101). The default behavior is that all collapse functions return numeric(0) again, except for fnobs, fndistinct which return 0L, and fvar, fsd which return NA_real_. ### Changes to Functionality • Functions fhdwithin / HDW and fhdbetween / HDB have been reworked, delivering higher performance and greater functionality: For higher-dimensional centering and heterogeneous slopes, the demean function from the fixest package is imported (conditional on the availability of that package). The linear prediction and partialling out functionality is now built around flm and also allows for weights and different fitting methods. • In collap, the default behavior of give.names = "auto" was altered when used together with the custom argument. Before the function name was always added to the column names. Now it is only added if a column is aggregated with two different functions. I apologize if this breaks any code dependent on the new names, but this behavior just better reflects most common use (applying only one function per column), as well as STATA’s collapse. • For list processing functions like get_elem, has_elem etc. the default for the argument DF.as.list was changed from TRUE to FALSE. This means if a nested lists contains data frame’s, these data frame’s will not be searched for matching elements. This default also reflects the more common usage of these functions (extracting entire data frame’s or computed quantities from nested lists rather than searching / subsetting lists of data frame’s). The change also delivers a considerable performance gain. • Vignettes were outsourced to the website. This nearly halves the size of the source package, and should induce users to appreciate the built-in documentation. The website also makes for much more convenient reading and navigation of these book-style vignettes. ### Additions • Added a set of 10 operators %rr%, %r+%, %r-%, %r*%, %r/%, %cr%, %c+%, %c-%, %c*%, %c/% to facilitate and speed up row- and column-wise arithmetic operations involving a vector and a matrix / data frame / list. For example X %r*% v efficiently multiplies every row of X with v. Note that more advanced functionality is already provided in TRA(), dapply() and the Fast Statistical Functions, but these operators are intuitive and very convenient to use in matrix or matrix-style code, or in piped expressions. • Added function missing_cases (opposite of complete.cases and faster for data frame’s / lists). • Added function allNA for atomic vectors. • New vignette about using collapse together with data.table, available online. ### Improvements • Time series functions and operators flag / L / F, fdiff / D / Dlog and fgrowth / G now natively support irregular time series and panels, and feature a ‘complete approach’ i.e. values are shifted around taking full account of the underlying time-dimension! • Functions pwcor and pwcov can now compute weighted correlations on the pairwise or complete observations, supported by C-code that is (conditionally) imported from the weights package. • fFtest now also supports weights. • collap now provides an easy workaround to aggregate some columns using weights and others without. The user may simply append the names of Fast Statistical Functions with _uw to disable weights. Example: collapse::collap(mtcars, ~ cyl, custom = list(fmean_uw = 3:4, fmean = 8:10), w = ~ wt) aggregates columns 3 through 4 using a simple mean and columns 8 through 10 using the weighted mean. • The parallelism in collap using parallel::mclapply has been reworked to operate at the column-level, and not at the function level as before. It is still not available for Windows though. The default number of cores was set to mc.cores = 2L, which now gives an error on windows if parallel = TRUE. • function recode_char now has additional options ignore.case and fixed (passed to grepl), for enhanced recoding character data based on regular expressions. • rapply2d now has classes argument permitting more flexible use. • na_rm and some other internal functions were rewritten in C. na_rm is now 2x faster than x[!is.na(x)] with missing values and 10x faster without missing values. # collapse 1.4.2 • An improvement to the [.GRP_df method enabling the use of most data.table methods (such as :=) on a grouped data.table created with fgroup_by. • Some documentation updates by Kevin Tappe. # collapse 1.4.1 collapse 1.4.1 is a small patch for 1.4.0 that: • fixes clang-UBSAN and rchk issues in 1.4.0 (minor bugs in compiled code resulting, in this case, from trying to coerce a NaN value to integer, and failing to protect a shallow copy of a variable). • Adds a method [.GRP_df that allows robust subsetting of grouped objects created with fgroup_by (thanks to Patrice Kiener for flagging this). # collapse 1.4.0 collapse 1.4.0, released early November 2020, presents some important refinements, particularly in the domain of attribute handling, as well as some additional functionality. The changes make collapse smarter, more broadly compatible and more secure, and should not break existing code. ### Changes to Functionality • Deep Matrix Dispatch / Extended Time Series Support: The default methods of all statistical and transformation functions dispatch to the matrix method if is.matrix(x) && !inherits(x, "matrix") evaluates to TRUE. This specification avoids invoking the default method on classed matrix-based objects (such as multivariate time series of the xts / zoo class) not inheriting a ‘matrix’ class, while still allowing the user to manually call the default method on matrices (objects with implicit or explicit ‘matrix’ class). The change implies that collapse’s generic statistical functions are now well suited to transform xts / zoo and many other time series and matrix-based classes. • Fully Non-Destructive Piped Workflow: fgroup_by(x, ...) now only adds a class grouped_df, not classes table_df, tbl, grouped_df, and preserves all classes of x. This implies that workflows such as x %>% fgroup_by(...) %>% fmean etc. yields an object xAG of the same class and attributes as x, not a tibble as before. collapse aims to be as broadly compatible, class-agnostic and attribute preserving as possible. • Thorough and Controlled Object Conversions: Quick conversion functions qDF, qDT and qM now have additional arguments keep.attr and class providing precise user control over object conversions in terms of classes and other attributes assigned / maintained. The default (keep.attr = FALSE) yields hard conversions removing all but essential attributes from the object. E.g. before qM(EuStockMarkets) would just have returned EuStockMarkets (because is.matrix(EuStockMarkets) is TRUE) whereas now the time series class and ‘tsp’ attribute are removed. qM(EuStockMarkets, keep.attr = TRUE) returns EuStockMarkets as before. • Smarter Attribute Handling: Drawing on the guidance given in the R Internals manual, the following standards for optimal non-destructive attribute handling are formalized and communicated to the user: • The default and matrix methods of the Fast Statistical Functions preserve attributes of the input in grouped aggregations (‘names’, ‘dim’ and ‘dimnames’ are suitably modified). If inputs are classed objects (e.g. factors, time series, checked by is.object), the class and other attributes are dropped. Simple (non-grouped) aggregations of vectors and matrices do not preserve attributes, unless drop = FALSE in the matrix method. An exemption is made in the default methods of functions ffirst, flast and fmode, which always preserve the attributes (as the input could well be a factor or date variable). • The data frame methods are unaltered: All attributes of the data frame and columns in the data frame are preserved unless the computation result from each column is a scalar (not computing by groups) and drop = TRUE (the default). • Transformations with functions like flag, fwithin, fscale etc. are also unaltered: All attributes of the input are preserved in the output (regardless of whether the input is a vector, matrix, data.frame or related classed object). The same holds for transformation options modifying the input (“-”, “-+”, “/”, “+”, “*”, “%%”, “-%%”) when using TRA() function or the TRA = "..." argument to the Fast Statistical Functions. • For TRA ‘replace’ and ‘replace_fill’ options, the data type of the STATS is preserved, not of x. This provides better results particularly with functions like fnobs and fndistinct. E.g. previously fnobs(letters, TRA = "replace") would have returned the observation counts coerced to character, because letters is character. Now the result is integer typed. For attribute handling this means that the attributes of x are preserved unless x is a classed object and the data types of x and STATS do not match. An exemption to this rule is made if x is a factor and an integer (non-factor) replacement is offered to STATS. In that case the attributes of x are copied exempting the ‘class’ and ‘levels’ attribute, e.g. so that fnobs(iris$Species, TRA = "replace") gives an integer vector, not a (malformed) factor. In the unlikely event that STATS is a classed object, the attributes of STATS are preserved and the attributes of x discarded.

• Reduced Dependency Burden: The dependency on the lfe package was made optional. Functions fhdwithin / fhdbetween can only perform higher-dimensional centering if lfe is available. Linear prediction and centering with a single factor (among a list of covariates) is still possible without installing lfe. This change means that collapse now only depends on base R and Rcpp and is supported down to R version 2.10.

• Added function rsplit for efficient (recursive) splitting of vectors and data frames.

• Added function fdroplevels for very fast missing level removal + added argument drop to qF and GRP.factor, the default is drop = FALSE. The addition of fdroplevels also enhances the speed of the fFtest function.

• fgrowth supports annualizing / compounding growth rates through added power argument.

• A function flm was added for bare bones (weighted) linear regression fitting using different efficient methods: 4 from base R (.lm.fit, solve, qr, chol), using fastLm from RcppArmadillo (if installed), or fastLm from RcppEigen (if installed).

• Added function qTBL to quickly convert R objects to tibble.

• helpers setAttrib, copyAttrib and copyMostAttrib exported for fast attribute handling in R (similar to attributes<-(), these functions return a shallow copy of the first argument with the set of attributes replaced, but do not perform checks for attribute validity like attributes<-(). This can yield large performance gains with big objects).

• helper cinv added wrapping the expression chol2inv(chol(x)) (efficient inverse of a symmetric, positive definite matrix via Choleski factorization).

• A shortcut gby is now available to abbreviate the frequently used fgroup_by function.

• A print method for grouped data frames of any class was added.

### Improvements

• Faster internal methods for factors for funique, fmode and fndistinct.
• The grouped_df methods for flag, fdiff, fgrowth now also support multiple time variables to identify a panel e.g. data %>% fgroup_by(region, person_id) %>% flag(1:2, list(month, day)).

• More security features for fsubset.data.frame / ss, ss is now internal generic and also supports subsetting matrices.

• In some functions (like na_omit), passing double values (e.g. 1 instead of integer 1L) or negative indices to the cols argument produced an error or unexpected behavior. This is now fixed in all functions.

• Fixed a bug in helper function all_obj_equal occurring if objects are not all equal.

• Some performance improvements through increased use of pointers and C API functions.

# collapse 1.3.2

collapse 1.3.2, released mid September 2020:

• Fixed a small bug in fndistinct for grouped distinct value counts on logical vectors.

• Additional security for ftransform, which now efficiently checks the names of the data and replacement arguments for uniqueness, and also allows computing and transforming list-columns.

• Added function ftransformv to facilitate transforming selected columns with function - a very efficient replacement for dplyr::mutate_if and dplyr::mutate_at.

• frename now allows additional arguments to be passed to a renaming function.

# collapse 1.3.1

collapse 1.3.1, released end of August 2020, is a patch for v1.3.0 that takes care of some unit test failures on certain operating systems (mostly because of numeric precision issues). It provides no changes to the code or functionality.

# collapse 1.3.0

collapse 1.3.0, released mid August 2020:

### Changes to Functionality

• dapply and BY now drop all unnecessary attributes if return = "matrix" or return = "data.frame" are explicitly requested (the default return = "same" still seeks to preserve the input data structure).

• unlist2d now saves integer rownames if row.names = TRUE and a list of matrices without rownames is passed, and id.factor = TRUE generates a normal factor not an ordered factor. It is however possible to write id.factor = "ordered" to get an ordered factor id.

• fdiff argument logdiff renamed to log, and taking logs is now done in R (reduces size of C++ code and does not generate as many NaN’s). logdiff may still be used, but it may be deactivated in the future. Also in the matrix and data.frame methods for flag, fdiff and fgrowth, columns are only stub-renamed if more than one lag/difference/growth rate is computed.

• Added fnth for fast (grouped, weighted) n’th element/quantile computations.

• Added roworder(v) and colorder(v) for fast row and column reordering.

• Added frename and setrename for fast and flexible renaming (by reference).

• Added function fungroup, as replacement for dplyr::ungroup, intended for use with fgroup_by.

• fmedian now supports weights, computing a decently fast (grouped) weighted median based on radix ordering.

• fmode now has the option to compute min and max mode, the default is still simply the first mode.

• fwithin now supports quasi-demeaning (added argument theta) and can thus be used to manually estimate random-effects models.

• funique is now generic with a default vector and data.frame method, providing fast unique values and rows of data. The default was changed to sort = FALSE.

• The shortcut gvr was created for get_vars(..., regex = TRUE).

• A helper .c was introduced for non-standard concatenation (i.e. .c(a, b) == c("a", "b")).

### Improvements

• fmode and fndistinct have become a bit faster.

• fgroup_by now preserves data.table’s.

• ftransform now also supports a data.frame as replacement argument, which automatically replaces matching columns and adds unmatched ones. Also ftransform<- was created as a more formal replacement method for this feature.

• collap columns selected through cols argument are returned in the order selected if keep.col.order = FALSE. Argument sort.row is depreciated, and replace by argument sort. In addition the decreasing and na.last arguments were added and handed down to GRP.default.

• radixorder ‘sorted’ attribute is now always attached.

• stats::D which is masked when collapse is attached, is now preserved through methods D.expression and D.call.

• GRP option call = FALSE to omit a call to match.call -> minor performance improvement.

• Several small performance improvements through rewriting some internal helper functions in C and reworking some R code.

• Performance improvements for some helper functions, setRownames / setColnames, na_insert etc.

• Increased scope of testing statistical functions. The functionality of the package is now secured by 7700 unit tests covering all central bits and pieces.

# collapse 1.2.1

collapse 1.2.1, released end of May 2020:

• Minor fixes for 1.2.0 issues that prevented correct installation on Mac OS X and a vignette rebuilding error on solaris.

• fmode.grouped_df with groups and weights now saves the sum of the weights instead of the max (this makes more sense as the max only applies if all elements are unique).

# collapse 1.2.0

collapse 1.2.0, released mid May 2020:

### Changes to Functionality

• grouped_df methods for fast statistical functions now always attach the grouping variables to the output in aggregations, unless argument keep.group_vars = FALSE. (formerly grouping variables were only attached if also present in the data. Code hinged on this feature should be adjusted)

• qF ordered argument default was changed to ordered = FALSE, and the NA level is only added if na.exclude = FALSE. Thus qF now behaves exactly like as.factor.

• Recode is depreciated in favor of recode_num and recode_char, it will be removed soon. Similarly replace_non_finite was renamed to replace_Inf.

• In mrtl and mctl the argument ret was renamed return and now takes descriptive character arguments (the previous version was a direct C++ export and unsafe, code written with these functions should be adjusted).

• GRP argument order is depreciated in favor of argument decreasing. order can still be used but will be removed at some point.

### Bug Fixes

• Fixed a bug in flag where unused factor levels caused a group size error.

• Added a suite of functions for fast data manipulation:

• fselect selects variables from a data frame and is equivalent but much faster than dplyr::select.
• fsubset is a much faster version of base::subset to subset vectors, matrices and data.frames. The function ss was also added as a faster alternative to [.data.frame.
• ftransform is a much faster update of base::transform, to transform data frames by adding, modifying or deleting columns. The function settransform does all of that by reference.
• fcompute is equivalent to ftransform but returns a new data frame containing only the columns computed from an existing one.
• na_omit is a much faster and enhanced version of base::na.omit.
• replace_NA efficiently replaces missing values in multi-type data.
• Added function fgroup_by as a much faster version of dplyr::group_by based on collapse grouping. It attaches a ‘GRP’ object to a data frame, but only works with collapse’s fast functions. This allows dplyr like manipulations that are fully collapse based and thus significantly faster, i.e. data %>% fgroup_by(g1,g2) %>% fselect(cola,colb) %>% fmean. Note that data %>% dplyr::group_by(g1,g2) %>% dplyr::select(cola,colb) %>% fmean still works, in which case the dplyr ‘group’ object is converted to ‘GRP’ as before. However data %>% fgroup_by(g1,g2) %>% dplyr::summarize(...) does not work.

• Added function varying to efficiently check the variation of multi-type data over a dimension or within groups.

• Added function radixorder, same as base::order(..., method = "radix") but more accessible and with built-in grouping features.

• Added functions seqid and groupid for generalized run-length type id variable generation from grouping and time variables. seqid in particular strongly facilitates lagging / differencing irregularly spaced panels using flag, fdiff etc.

• fdiff now supports quasi-differences i.e. $$x_t - \rho x_{t-1}$$ and quasi-log differences i.e. $$log(x_t) - \rho log(x_{t-1})$$. an arbitrary $$\rho$$ can be supplied.

• Added a Dlog operator for faster access to log-differences.

### Improvements

• Faster grouping with GRP and faster factor generation with added radix method + automatic dispatch between hash and radix method. qF is now ~ 5x faster than as.factor on character and around 30x faster on numeric data. Also qG was enhanced.

• Further slight speed tweaks here and there.

• collap now provides more control for weighted aggregations with additional arguments w, keep.w and wFUN to aggregate the weights as well. The defaults are keep.w = TRUE and wFUN = fsum. A specialty of collap remains that keep.by and keep.w also work for external objects passed, so code of the form collap(data, by, FUN, catFUN, w = data\$weights) will now have an aggregated weights vector in the first column.

• qsu now also allows weights to be passed in formula i.e. qsu(data, by = ~ group, pid = ~ panelid, w = ~ weights).

• fgrowth has a scale argument, the default is scale = 100 which provides growth rates in percentage terms (as before), but this may now be changed.

• All statistical and transformation functions now have a hidden list method, so they can be applied to unclassed list-objects as well. An error is however provided in grouped operations with unequal-length columns.

# collapse 1.1.0

collapse 1.1.0 released early April 2020:

• Fixed remaining gcc10, LTO and valgrind issues in C/C++ code, and added some more tests (there are now ~ 5300 tests ensuring that collapse statistical functions perform as expected).

• Fixed the issue that supplying an unnamed list to GRP(), i.e. GRP(list(v1, v2)) would give an error. Unnamed lists are now automatically named ‘Group.1’, ‘Group.2’, etc…

• Fixed an issue where aggregating by a single id in collap() (i.e. collap(data, ~ id1)), the id would be coded as factor in the aggregated data.frame. All variables including id’s now retain their class and attributes in the aggregated data.

• Added weights (w) argument to fsum and fprod.

• Added an argument mean = 0 to fwithin / W. This allows simple and grouped centering on an arbitrary mean, 0 being the default. For grouped centering mean = "overall.mean" can be specified, which will center data on the overall mean of the data. The logical argument add.global.mean = TRUE used to toggle this in collapse 1.0.0 is therefore depreciated.

• Added arguments mean = 0 (the default) and sd = 1 (the default) to fscale / STD. These arguments now allow to (group) scale and center data to an arbitrary mean and standard deviation. Setting mean = FALSE will just scale data while preserving the mean(s). Special options for grouped scaling are mean = "overall.mean" (same as fwithin / W), and sd = "within.sd", which will scale the data such that the standard deviation of each group is equal to the within- standard deviation (= the standard deviation computed on the group-centered data). Thus group scaling a panel-dataset with mean = "overall.mean" and sd = "within.sd" harmonizes the data across all groups in terms of both mean and variance. The fast algorithm for variance calculation toggled with stable.algo = FALSE was removed from fscale. Welford’s numerically stable algorithm used by default is fast enough for all practical purposes. The fast algorithm is still available for fvar and fsd.

• Added the modulus (%%) and subtract modulus (-%%) operations to TRA().

• Added the function finteraction, for fast interactions, and as_character_factor to coerce a factor, or all factors in a list, to character (analogous to as_numeric_factor). Also exported the function ckmatch, for matching with error message showing non-matched elements.

# collapse 1.0.0 and earlier

• First version of the package featuring only the functions collap and qsu based on code shared by Sebastian Krantz on R-devel, February 2019.

• Major rework of the package using Rcpp and data.table internals, introduction of fast statistical functions and operators and expansion of the scope of the package to a broad set of data transformation and exploration tasks. Several iterations of enhancing speed of R code. Seamless integration of collapse with dplyr, plm and data.table. CRAN release of collapse 1.0.0 on 19th March 2020.