--- title: "Working with Expressions: Helper Functions for Advanced Data Manipulation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Working with Expressions: Helper Functions for Advanced Data Manipulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", echo = TRUE, warning = FALSE, message = FALSE ) ``` ## Introduction Three blocks accept expressions: **expression filter**, **mutate**, and **summarize**. Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers: - Common functions for mutate and summarize - Helper functions: `across()`, `where()`, `if_any()`, `if_all()`, `pick()` - Column selection helpers - Practical examples See also: [dplyr column-wise operations](https://dplyr.tidyverse.org/articles/colwise.html) and [dplyr window functions](https://dplyr.tidyverse.org/articles/window-functions.html). ## Expression Blocks **Expression filter block**: Logical expressions to keep rows - Expressions return TRUE/FALSE for each row - Use comparison operators: `>`, `<`, `==`, `!=`, `>=`, `<=` - Combine with `&` (AND) or `|` (OR) - Helpers: `if_any()`, `if_all()` **Mutate block**: Create or modify columns - Perform calculations on existing columns - Use arithmetic, logs, lags, cumulative functions - Helper: `across()` to transform multiple columns **Summarize block**: Aggregate data - Reduce rows to summary statistics - Use aggregation functions: `mean()`, `sum()`, `n()` - Helpers: `across()` for multiple columns, `pick()` for custom functions ## Useful Functions for Mutate ### Arithmetic - `+`, `-`, `*`, `/`, `^` - basic operations - `%/%` - integer division - `%%` - remainder (modulo) ### Logs and exponentials - `log()`, `log2()`, `log10()` - logarithms - `exp()` - exponential - `sqrt()` - square root ### Offsets - `lag()` - previous values - `lead()` - next values ### Cumulative aggregates - `cumsum()` - cumulative sum - `cummean()` - cumulative mean - `cummin()`, `cummax()` - cumulative min/max - `cumprod()` - cumulative product ### Ranking - `row_number()` - sequential ranks (1, 2, 3, 4) - `min_rank()` - ranks with ties (1, 2, 2, 4) - `dense_rank()` - ranks without gaps (1, 2, 2, 3) - `percent_rank()` - percentile (0 to 1) - `ntile(n)` - divide into n bins ### Logical comparisons - `<`, `>`, `<=`, `>=`, `==`, `!=` - `&` (and), `|` (or), `!` (not) - `%in%` - test membership ### Examples ``` # Calculate percentage across(c(hp, wt), \(x) x / sum(x) * 100) # Lag differences mpg - lag(mpg) # Cumulative sums by group (use by parameter) cumsum(sales) # Rank values min_rank(desc(hp)) ``` See [dplyr window functions](https://dplyr.tidyverse.org/articles/window-functions.html) for more. ## Useful Functions for Summarize ### Center - `mean()` - average - `median()` - middle value ### Spread - `sd()` - standard deviation - `IQR()` - interquartile range - `mad()` - median absolute deviation ### Range - `min()`, `max()` - minimum and maximum - `quantile(x, 0.25)` - percentiles ### Position - `first()`, `last()`, `nth(x, 2)` - positional values ### Count - `n()` - count rows - `n_distinct()` - count unique values - `sum(!is.na(x))` - count non-missing ### Sums and products - `sum()` - total - `prod()` - product ### Examples ``` # Basic statistics across(where(is.numeric), list(mean = mean, sd = sd)) # Count by group (use by parameter) n() # Multiple stats list( avg = mean(hp), min = min(hp), max = max(hp), count = n() ) ``` For handling missing values, add `na.rm = TRUE`: ``` mean(hp, na.rm = TRUE) across(where(is.numeric), \(x) mean(x, na.rm = TRUE)) ``` See [dplyr summarise](https://dplyr.tidyverse.org/reference/summarise.html) for more. ## Column Selection Helpers Select columns by name pattern or type (used inside `across()`, `if_any()`, `if_all()`): - `everything()` - all columns - `starts_with("prefix")` - columns starting with prefix - `ends_with("suffix")` - columns ending with suffix - `contains("text")` - columns containing text - `where(is.numeric)` - columns by type (`is.character`, `is.factor`) - `c(col1, col2)` - specific columns Combine selections: ``` c(starts_with("Sepal"), ends_with("Width")) where(is.numeric) & starts_with("x") ``` ## The across() Function Apply the same operation to multiple columns. **Syntax**: `across(.cols, .fns, .names = NULL)` - `.cols` - which columns (use selection helpers) - `.fns` - function(s) to apply - `.names` - control output names (default: `{.col}_{.fn}`) ### In Mutate Block Transform multiple columns: ``` # Round all numeric columns across(where(is.numeric), round) # Scale to 0-1 range across(c(mpg, hp, wt), \(x) x / max(x)) # Log transform with custom names across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}") ``` Use `\(x)` to create anonymous functions where `x` represents the current column. ### In Summarize Block Calculate statistics for multiple columns: ``` # Mean of all numeric columns across(where(is.numeric), mean) # Multiple functions across(c(hp, wt), list(mean = mean, sd = sd)) # With grouping (use by parameter) across(everything(), n_distinct) ``` ### Custom names ``` # Default: col_fn across(c(mpg, hp), list(mean = mean, sd = sd)) # Result: mpg_mean, mpg_sd, hp_mean, hp_sd # Custom: fn.col across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}") # Result: mean.mpg, sd.mpg, mean.hp, sd.hp ``` ## The pick() Function Select columns as a data frame for custom functions. **Syntax**: `pick(.cols)` Use in summarize block with custom functions that need a data frame: ``` # With custom function calc_stats(pick(everything())) # Select specific columns my_function(pick(c(hp, wt, mpg))) ``` ### Relationship to unpack parameter When your expression returns a data frame, use the **unpack** option: - **unpack = FALSE** (default): Result is a single list-column - **unpack = TRUE**: Columns are spread into separate columns Example: ``` # Custom function that returns data frame calc_stats <- function(df) { data.frame(mean_x = mean(df$x), sd_x = sd(df$x)) } # In summarize block with unpack = TRUE: calc_stats(pick(everything())) # Result: mean_x and sd_x as separate columns ``` ## Filter Helpers: if_any() and if_all() Check conditions across multiple columns in the expression filter block. **if_any()**: TRUE when condition is true for **at least one** column ``` # Rows with any NA if_any(everything(), is.na) # Any numeric column > 100 if_any(where(is.numeric), \(x) x > 100) # Search across text columns if_any(where(is.character), \(x) x == "setosa") ``` **if_all()**: TRUE when condition is true for **all** columns ``` # All numeric columns positive if_all(where(is.numeric), \(x) x > 0) # All width measurements > 2 if_all(ends_with("Width"), \(x) x > 2) # No missing values if_all(everything(), \(x) !is.na(x)) ``` ## Common Patterns ### Mutate Block ``` # Round numeric columns across(where(is.numeric), round) # Scale to max across(c(mpg, hp, wt), \(x) x / max(x)) # Uppercase text across(where(is.character), toupper) # Log transform across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}") ``` ### Summarize Block ``` # Means (add grouping with by parameter) across(where(is.numeric), mean) # Multiple statistics across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max)) # Count non-missing across(everything(), \(x) sum(!is.na(x))) ``` ### Expression Filter Block ``` # Complete rows only if_all(everything(), \(x) !is.na(x)) # Any negative if_any(where(is.numeric), \(x) x < 0) # All widths > threshold if_all(ends_with("Width"), \(x) x > 2.5) # Search text columns if_any(where(is.character), \(x) grepl("pattern", x)) ``` ## Tips **Start simple**: Test on one column, then use `across()` **Check preview**: Verify results in the block preview **Unpack option**: In summarize, enable unpack when expressions return data frames **Combine helpers**: Use `&` and `|` to combine selections ``` across(where(is.numeric) & starts_with("Sepal"), mean) ``` **Function syntax**: - Direct: `round` (no arguments needed) - Anonymous function: `\(x) round(x, 2)` (with arguments) - The `x` represents the current column **Missing values**: Add `na.rm = TRUE` to aggregation functions ``` across(where(is.numeric), \(x) mean(x, na.rm = TRUE)) ``` ## Learn More For comprehensive documentation on column-wise operations, see: - [Column-wise operations vignette](https://dplyr.tidyverse.org/articles/colwise.html) - detailed guide to across(), where(), if_any(), and if_all() - [across() reference](https://dplyr.tidyverse.org/reference/across.html) - complete documentation with all arguments and options - [R for Data Science (2e)](https://r4ds.hadley.nz/data-transform.html) - comprehensive tutorial on data transformation These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.