---
title: "HTML Tables"
author: "Duncan Garmonsway"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{HTML Tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
This vignette for the [unpivotr](https://github.com/nacnudus/unpivotr) package
demonstrates unpivoting html tables of various kinds.
The HTML files are in the package directory at `system.file("extdata",
c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr")`.
```{r, echo = TRUE}
library(dplyr)
library(rvest)
library(htmltools)
library(unpivotr)
```
## Rowspan and colspan examples
If a table has cells merged across rows or columns (or both), then `as_cells()`
does not attempt to fill the cell contents across the rows or columns. This is
different from other packages, e.g. `rvest`. However, if merged cells cause a
table not to be square, then `as_cells()` pads the missing cells with blanks.
### Rowspan
```{r, echo = TRUE}
rowspan <- system.file("extdata", "rowspan.html", package = "unpivotr")
includeHTML(rowspan)
# rvest
rowspan %>%
read_html() %>%
html_table()
# unpivotr
rowspan %>%
read_html() %>%
as_cells()
```
### Colspan
```{r, echo = TRUE}
colspan <- system.file("extdata", "colspan.html", package = "unpivotr")
includeHTML(colspan)
# rvest
colspan %>%
read_html() %>%
html_table()
# unpivotr
colspan %>%
read_html() %>%
as_cells()
```
### Both rowspan and colspan: non-square
```{r, echo = TRUE}
rowandcolspan <- system.file("extdata",
"row-and-colspan.html",
package = "unpivotr")
includeHTML(rowandcolspan)
# rvest
rowandcolspan %>%
read_html() %>%
html_table()
# unpivotr
rowandcolspan %>%
read_html() %>%
as_cells()
```
## Nested example
`as_cells()` never descends into cells. If there is a table inside a cell, then
to parse that table use `html_table` again on that cell.
```{r, echo = TRUE}
nested <- system.file("extdata", "nested.html", package = "unpivotr")
includeHTML(nested)
# rvest parses both tables
nested %>%
read_html() %>%
html_table(fill = TRUE)
# unpivotr
x <-
nested %>%
read_html() %>%
as_cells() %>%
.[[1]]
x
# The html of the table inside a cell
cell <-
x %>%
dplyr::filter(row == 2, col == 2) %>%
.$html
cell
# Parsing the table inside the cell
cell %>%
read_html() %>%
as_cells()
```
## URL example
A motivation for using `unpivotr::as_cells()` is that it extracts more than
just text -- it can extract whatever part of the HTML you need.
Here, we extract URLs.
```{r, echo = TRUE}
urls <- system.file("extdata", "url.html", package = "unpivotr")
includeHTML(urls)
cell_url <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_attr("href")
}
cell_text <- function(x) {
if (is.na(x)) return(NA)
x %>%
read_html %>%
html_nodes("a") %>%
html_text()
}
urls %>%
read_html() %>%
as_cells() %>%
.[[1]] %>%
mutate(text = purrr::map(html, cell_text),
url = purrr::map(html, cell_url)) %>%
tidyr::unnest(text, url)
```