--- title: "Getting Started with gutenbergr" description: > A simple introduction to the gutenbergr package output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Getting Started with gutenbergr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| label: setup #| include: false knitr::opts_chunk$set( collapse = FALSE, comment = "#>", fig.width = 7, fig.height = 6, warning = FALSE, message = FALSE ) ``` The gutenbergr package helps you download and process public domain works from [Project Gutenberg](http://www.gutenberg.org/). This vignette introduces the package's metadata datasets and core downloading functionality. ## Required Libraries ```{r} #| label: windows-check #| include: false tryCatch( library(gutenbergr), error = function(e) { # Fallback for Windows check environments devtools::load_all("..") } ) ``` ```{r} #| label: packages library(dplyr) library(stringr) ``` ## Exploring the Metadata ### `gutenberg_metadata` The `gutenberg_metadata` dataset contains information about each work in the Project Gutenberg collection: ```{r} #| label: metadata gutenberg_metadata ``` You can filter this to find specific works: ```{r} #| label: filter-metadata gutenberg_metadata |> filter(title == "Persuasion") ``` The metadata currently in the package was last updated on **`r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y')`**. ### `gutenberg_works()` In most analyses, you'll want to filter for English works, avoid duplicates, and include only books with downloadable text. The `gutenberg_works()` function does this automatically: ```{r} #| label: works gutenberg_works() ``` You can also filter directly within the function: ```{r} #| label: works-filter gutenberg_works(author == "Austen, Jane") # Using regular expressions gutenberg_works(str_detect(author, "Austen")) # Multiple conditions gutenberg_works(author == "Dickens, Charles", has_text == TRUE) ``` ### `gutenberg_subjects` The `gutenberg_subjects` dataset pairs works with Library of Congress classifications and subject headings: ```{r} #| label: subjects gutenberg_subjects ``` This is useful for finding works by genre or topic: ```{r} #| label: filter-subjects # Find detective stories gutenberg_subjects |> filter(subject == "Detective and mystery stories") # Find Sherlock Holmes stories gutenberg_subjects |> filter(grepl("Holmes, Sherlock", subject)) ``` You can join this with `gutenberg_works()` to download books by subject: ```{r} #| label: join-subjects #| eval: false # Get IDs of detective stories detective_ids <- gutenberg_subjects |> filter(subject == "Detective and mystery stories") |> inner_join(gutenberg_works(), by = "gutenberg_id") |> pull(gutenberg_id) # Download a sample detective_stories <- gutenberg_download( detective_ids[1:5], meta_fields = c("title", "author") ) ``` ### `gutenberg_authors` The `gutenberg_authors` dataset contains author information including aliases and birth/death years: ```{r} #| label: authors gutenberg_authors ``` This can be useful for filtering by author characteristics: ```{r} #| label: filter-authors #| eval: false # Find works by 19th century authors nineteenth_century_gutenberg_authors <- gutenberg_authors |> filter(birthdate >= 1800, birthdate < 1900) |> inner_join(gutenberg_works(), by = "gutenberg_author_id") ``` ## Downloading Books ### Single Book Download a book using its Gutenberg ID with `gutenberg_download()`: ```{r} #| label: download-single #| eval: false persuasion <- gutenberg_download(105, meta_fields = c("title", "author")) ``` ```{r} #| label: download-single-display #| echo: false persuasion <- filter(gutenbergr::sample_books, gutenberg_id == 105) ``` ```{r} #| label: show-persuasion persuasion ``` The result is a tibble with: * `gutenberg_id` - the book's ID * `text` - one row per line of text ### Multiple Books Download multiple books by providing a vector of Gutenberg IDs: ```{r} #| label: download-multiple #| eval: false books <- gutenberg_download(c(105, 109)) ``` ```{r} #| label: download-multiple-display #| echo: false books <- gutenbergr::sample_books ``` ```{r} #| label: show-books books ``` ### Adding Metadata Use the `meta_fields` argument to include additional information: ```{r} #| label: download-with-meta #| eval: false books <- gutenberg_download(c(105, 109), meta_fields = c("title", "author")) ``` ```{r} #| label: show-books-count books |> count(title) ``` ### Downloading from `gutenberg_works()` You can pipe the output of `gutenberg_works()` directly into `gutenberg_download()`: ```{r} #| label: download-pipe #| eval: false # Download all of Aristotle's works with titles aristotle_books <- gutenberg_works(author == "Aristotle") |> gutenberg_download(meta_fields = "title") ``` ## What's Next? Now that you have book texts as tibbles, you can: * Perform text analysis with the [tidytext](https://github.com/juliasilge/tidytext) package * See the [Text Mining Example](text-mining.html) vignette for a complete analysis workflow * Explore the [Natural Language Processing CRAN View](https://CRAN.R-project.org/view=NaturalLanguageProcessing) for more text analysis packages ## Additional Resources * Match Wikipedia data with [WikipediR](https://cran.r-project.org/package=WikipediR) or [wikipediatrend](https://cran.r-project.org/package=wikipediatrend) * Parse author names with [humaniformat](https://cran.r-project.org/package=humaniformat) * Predict gender from names with [gender](https://cran.r-project.org/package=gender)