--- title: "Quick Start" author: "Gilles Colling" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Quick Start} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(keyed) library(dplyr) set.seed(42) ``` ## The Problem: Silent Data Corruption You receive monthly customer exports from a CRM system. The data should have unique `customer_id` values and complete `email` addresses. One month, someone upstream changes the export logic. Now `customer_id` has duplicates and some emails are missing. **Without explicit checks, you won't notice until something breaks downstream**—wrong row counts after a join, duplicated invoices, failed email campaigns. ```{r} # January export: clean data january <- data.frame( customer_id = c(101, 102, 103, 104, 105), email = c("alice@example.com", "bob@example.com", "carol@example.com", "dave@example.com", "eve@example.com"), segment = c("premium", "basic", "premium", "basic", "premium") ) # February export: corrupted upstream (duplicates + missing email) february <- data.frame( customer_id = c(101, 102, 102, 104, 105), # Note: 102 is duplicated email = c("alice@example.com", "bob@example.com", NA, "dave@example.com", "eve@example.com"), segment = c("premium", "basic", "basic", "basic", "premium") ) ``` The February data looks fine at a glance: ```{r} head(february) nrow(february) # Same row count ``` But it will silently corrupt your analysis. --- ## The Solution: Make Assumptions Explicit **keyed** catches these issues by making your assumptions explicit: ```{r error=TRUE} # Define what you expect: customer_id is unique january_keyed <- january |> key(customer_id) |> lock_no_na(email) # This works - January data is clean january_keyed ``` Now try the same with February's corrupted data: ```{r error=TRUE} # This fails immediately - duplicates detected february |> key(customer_id) ``` The error catches the problem **at import time**, not downstream when you're debugging a mysterious row count mismatch. --- ## Workflow 1: Monthly Data Validation **Goal**: Validate each month's export against expected constraints before processing. **Challenge**: Data quality varies month-to-month. Silent corruption causes cascading errors. **Strategy**: Define keys and assumptions once, apply consistently to each import. ### Define validation function ```{r} validate_customer_export <- function(df) { df |> key(customer_id) |> lock_no_na(email) |> lock_nrow(min = 1) } # January: passes january_clean <- validate_customer_export(january) summary(january_clean) ``` ### Keys survive transformations Once defined, keys persist through dplyr operations: ```{r} # Filter preserves key premium_customers <- january_clean |> filter(segment == "premium") has_key(premium_customers) get_key_cols(premium_customers) # Mutate preserves key enriched <- january_clean |> mutate(domain = sub(".*@", "", email)) has_key(enriched) ``` ### Strict enforcement If an operation breaks uniqueness, keyed errors and tells you to use `unkey()` first: ```{r error=TRUE} # This creates duplicates - keyed stops you january_clean |> mutate(customer_id = 1) ``` To proceed, you must explicitly acknowledge breaking the key: ```{r} january_clean |> unkey() |> mutate(customer_id = 1) ``` --- ## Workflow 2: Safe Joins **Goal**: Join customer data with orders without accidentally duplicating rows. **Challenge**: Join cardinality mistakes are common and hard to debug. A "one-to-one" join that's actually one-to-many silently inflates your data. **Strategy**: Use `diagnose_join()` to understand cardinality *before* joining. ### Create sample data ```{r} customers <- data.frame( customer_id = 1:5, name = c("Alice", "Bob", "Carol", "Dave", "Eve"), tier = c("gold", "silver", "gold", "bronze", "silver") ) |> key(customer_id) orders <- data.frame( order_id = 1:8, customer_id = c(1, 1, 2, 3, 3, 3, 4, 5), amount = c(100, 150, 200, 50, 75, 125, 300, 80) ) |> key(order_id) ``` ### Diagnose before joining ```{r} diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE) ``` The diagnosis shows: - **Cardinality is one-to-many**: Each customer can have multiple orders - **Coverage**: Shows how many keys match vs. don't match Now you know what to expect. A `left_join()` will create 8 rows (one per order), not 5 (one per customer). ### Compare key structures ```{r} compare_keys(customers, orders) ``` This shows the join key exists in both tables but with different uniqueness properties—essential information before joining. --- ## Workflow 3: Row Identity Tracking **Goal**: Track which original rows survive through a complex pipeline. **Challenge**: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data. **Strategy**: Use `add_id()` to attach stable identifiers that survive transformations. ### Add row IDs ```{r} # Add UUIDs to rows customers_tracked <- customers |> add_id() customers_tracked ``` ### IDs survive transformations ```{r} # Filter: IDs persist gold_customers <- customers_tracked |> filter(tier == "gold") get_id(gold_customers) # Compare with original compare_ids(customers_tracked, gold_customers) ``` The comparison shows exactly which rows were lost (filtered out) and which were preserved. ### Combining data with ID handling When appending new data, `bind_id()` handles ID conflicts: ```{r} batch1 <- data.frame(x = 1:3) |> add_id() batch2 <- data.frame(x = 4:6) # No IDs yet # bind_id assigns new IDs to batch2 and checks for conflicts combined <- bind_id(batch1, batch2) combined ``` --- ## Workflow 4: Drift Detection **Goal**: Detect when data changes unexpectedly between pipeline runs. **Challenge**: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions. **Strategy**: Commit snapshots with `commit_keyed()` and check for drift with `check_drift()`. ### Commit a reference snapshot ```{r} # Commit current state as reference reference_data <- data.frame( region_id = c("US", "EU", "APAC"), tax_rate = c(0.08, 0.20, 0.10) ) |> key(region_id) |> commit_keyed() ``` ### Check for drift ```{r} # No changes yet check_drift(reference_data) ``` ### Detect changes ```{r} # Simulate upstream change: EU tax rate changed modified_data <- reference_data modified_data$tax_rate[2] <- 0.21 # Drift detected! check_drift(modified_data) ``` The drift report shows exactly what changed, letting you decide whether to accept the new data or investigate. ### Cleanup ```{r} # Remove snapshots when done clear_all_snapshots() ``` --- ## Quick Reference ### Core Functions | Function | Purpose | |----------|---------| | `key()` | Define key columns (validates uniqueness) | | `unkey()` | Remove key | | `has_key()`, `get_key_cols()` | Query key status | ### Assumption Checks | Function | Validates | |----------|-----------| | `lock_unique()` | No duplicate values | | `lock_no_na()` | No missing values | | `lock_complete()` | All expected values present | | `lock_coverage()` | Reference values covered | | `lock_nrow()` | Row count within bounds | ### Diagnostics | Function | Purpose | |----------|---------| | `diagnose_join()` | Analyze join cardinality | | `compare_keys()` | Compare key structures | | `compare_ids()` | Compare row identities | | `find_duplicates()` | Find duplicate key values | | `key_status()` | Quick status summary | ### Row Identity | Function | Purpose | |----------|---------| | `add_id()` | Add UUID to rows | | `get_id()` | Retrieve row IDs | | `bind_id()` | Combine data with ID handling | | `make_id()` | Create deterministic IDs from columns | | `check_id()` | Validate ID integrity | ### Drift Detection | Function | Purpose | |----------|---------| | `commit_keyed()` | Save reference snapshot | | `check_drift()` | Compare against snapshot | | `list_snapshots()` | View saved snapshots | | `clear_snapshot()` | Remove specific snapshot | --- ## When to Use Something Else keyed is designed for **flat-file workflows** without database infrastructure. If you need: | Need | Better Alternative | |------|-------------------| | Enforced schema | Database (SQLite, DuckDB) | | Version history | Git, git2r | | Full data validation | pointblank, validate | | Production pipelines | targets | keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction. --- ## See Also - [Design Philosophy](philosophy.html) - The reasoning behind keyed's approach - [Function Reference](https://gillescolling.com/keyed/reference/index.html) - Complete API documentation