---
title: "Getting started with quanteda.tidy"
author: "Ken Benoit"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with quanteda.tidy}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##"
)
```

## Introduction

**quanteda.tidy** extends the **quanteda** package with **dplyr**-style verbs
for manipulating corpus objects. These functions operate on document variables
(docvars) while preserving the text content and structure of quanteda objects.

Note that **quanteda.tidy** very different from **tidytext**. While tidytext
converts text to data frames with one token per row, **quanteda.tidy** keeps your
corpus intact and extends **dplyr** functions to work directly with quanteda
objects.

```{r setup, message=FALSE}
library(quanteda.tidy)
```

## Overview of Functions

The functions in **quanteda.tidy** are organized into four categories,
following the [dplyr documentation](https://dplyr.tidyverse.org/reference/):

```{r function-table, echo=FALSE}
func_table <- data.frame(
  Category = c(
    rep("Rows", 5),
    rep("Columns", 6),
    rep("Groups of rows", 2),
    "Pairs of data frames"
  ),
  Function = c(
    # Rows
    "`filter()`", "`slice()`, `slice_head()`, `slice_tail()`",
    "`slice_sample()`", "`slice_min()`, `slice_max()`", "`arrange()`, `distinct()`",
    # Columns
    "`select()`", "`rename()`, `rename_with()`", "`relocate()`",
    "`mutate()`, `transmute()`", "`pull()`", "`glimpse()`",
    # Groups
    "`add_count()`", "`add_tally()`",
    # Pairs
    "`left_join()`"
  ),
  Description = c(
    # Rows
    "Subset documents based on docvar conditions",
    "Subset documents by position",
    "Randomly sample documents",
    "Select documents with min/max docvar values",
    "Reorder documents; keep unique documents",
    # Columns
    "Keep or drop docvars by name",
    "Rename docvars",
    "Change docvar column order",
    "Create or modify docvars",
    "Extract a single docvar as a vector",
    "Get a quick overview of the corpus",
    # Groups
    "Add count by group as a docvar",
    "Add total count as a docvar",
    # Pairs
    "Join corpus with external data frame"
  )
)
knitr::kable(func_table, caption = "quanteda.tidy functions by category")
```

## Verbs That Operate on Rows

These functions subset, reorder, or select documents based on their document
variables or positions.

### Filtering documents

Use `filter()` to keep documents that match specified conditions:

```{r filter}
# Keep only Roosevelt's speeches
data_corpus_inaugural %>%
  filter(President == "Roosevelt") %>%
  summary()
```

### Slicing documents by position

Use `slice()` and its variants to select documents by position:

```{r slice}
# First 3 documents
slice(data_corpus_inaugural, 1:3)

# First 10%
slice_head(data_corpus_inaugural, prop = 0.10)

# Last 3 documents
slice_tail(data_corpus_inaugural, n = 3)
```

Random sampling:

```{r slice-sample}
set.seed(42)
slice_sample(data_corpus_inaugural, n = 5)
```

Select by minimum or maximum values of a docvar:

```{r slice-minmax}
# Add token counts first
corp <- data_corpus_inaugural %>%
  mutate(n_tokens = ntoken(data_corpus_inaugural))

# Shortest speeches
slice_min(corp, n_tokens, n = 3)

# Longest speeches
slice_max(corp, n_tokens, n = 3)
```

### Arranging documents

Use `arrange()` to reorder documents:

```{r arrange}
# Sort alphabetically by president
data_corpus_inaugural[1:5] %>%
  arrange(President)

# Sort by year descending
data_corpus_inaugural[1:5] %>%
  arrange(desc(Year))
```

### Keeping distinct documents

Use `distinct()` to keep only unique combinations of docvar values:
```{r distinct}
# Keep first document for each president
data_corpus_inaugural %>%
  distinct(President, .keep_all = TRUE) %>%
  summary(n = 10)
```

## Verbs That Operate on Columns

These functions create, modify, rename, reorder, or select document variables.

### Selecting docvars

Use `select()` to keep or drop docvars:

```{r select}
data_corpus_inaugural %>%
  select(President, Year) %>%
  summary(n = 5)
```

### Renaming docvars

Use `rename()` for direct renaming:

```{r rename}
data_corpus_inaugural %>%
  rename(LastName = President, Given = FirstName) %>%
  summary(n = 5)
```

Use `rename_with()` to rename using a function:

```{r rename-with}
data_corpus_inaugural %>%
  rename_with(toupper) %>%
  summary(n = 5)
```

### Relocating docvars

Use `relocate()` to change column order:

```{r relocate}
data_corpus_inaugural %>%
  relocate(Party, President) %>%
  summary(n = 5)
```

### Creating and modifying docvars

Use `mutate()` to add new docvars or modify existing ones:

```{r mutate}
data_corpus_inaugural %>%
  mutate(
    fullname = paste(FirstName, President, sep = " "),
    century = floor(Year / 100) + 1
  ) %>%
  summary(n = 5)
```

Use `transmute()` to create new docvars and drop all others:

```{r transmute}
data_corpus_inaugural %>%
  transmute(
    speech_id = paste(Year, President, sep = "-"),
    party = Party
  ) %>%
  summary(n = 5)
```

### Extracting docvars

Use `pull()` to extract a single docvar as a vector:

```{r pull}
data_corpus_inaugural %>%
  filter(Year >= 2000) %>%
  pull(President)
```

### Getting an overview

Use `glimpse()` (from **tibble**) to see a compact summary:

```{r glimpse}
glimpse(data_corpus_inaugural)
```

## Verbs That Operate on Groups of Rows

These functions compute summaries or add variables based on groups.

### Counting observations

Use `add_count()` to add a count variable by group:

```{r add-count}
# Count speeches per president
data_corpus_inaugural %>%
  add_count(President, name = "n_speeches") %>%
  filter(n_speeches > 1) %>%
  summary(n = 10)
```

Use `add_tally()` to add the total count:

```{r add-tally}
data_corpus_inaugural %>%
  slice(1:5) %>%
  add_tally() %>%
  summary()
```

## Verbs That Operate on Pairs of Data Frames
These functions combine a corpus with an external data frame.

### Joining with external data

Use `left_join()` to add columns from a data frame to your corpus:

```{r left-join}
# Create some external data
party_colors <- data.frame(
  Party = c("Democratic", "Republican", "none", "Federalist",
            "Democratic-Republican", "Whig"),
  color = c("blue", "red", "gray", "purple", "green", "orange")
)

# Join to corpus
data_corpus_inaugural %>%
  left_join(party_colors, by = "Party") %>%
  summary(n = 10)
```

#### Special handling of document names

`left_join()` provides special handling for joining on document names. Use
`"docname"` in the `by` argument to match on document names even when
`"docname"` is not a docvar:

```{r left-join-docname}
# Create data with document name as key
doc_metadata <- data.frame(
  docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
  notes = c("First inaugural", "Second inaugural", "First Adams speech")
)

# Join using docname
data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata, by = "docname") %>%
  summary()
```

You can also match document names to a differently-named column:

```{r left-join-docname2}
doc_metadata2 <- data.frame(
  doc_id = c("1789-Washington", "1793-Washington"),
  rating = c(5, 4)
)

data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata2, by = c("docname" = "doc_id")) %>%
  summary()
```

## Piping Operations

All **quanteda.tidy** functions work seamlessly with the pipe operator,
allowing you to chain multiple operations:

```{r piping}
data_corpus_inaugural %>%
  # Add metadata
  mutate(
    decade = floor(Year / 10) * 10,
    n_tokens = ntoken(data_corpus_inaugural)
  ) %>%
  # Filter to 20th century

  filter(Year >= 1900, Year < 2000) %>%
  # Keep only relevant columns
  select(President, Party, decade, n_tokens) %>%
  # Sort by speech length

  arrange(desc(n_tokens)) %>%
  summary(n = 10)
```