---
title: "Introduction to surveytable"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to surveytable}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

`surveytable` is an R package for conveniently tabulating estimates from **complex surveys**.

* If you deal with survey objects in R (created with `survey::svydesign()`), then this package is for you.
* Works with **complex surveys** (data systems that involve survey design variables, like weights and strata).
* Works with **unweighted** data as well.

# Preliminaries

## Concepts

There are two important concepts that we need to learn and distinguish:

1) A **data frame** is a standard way of storing data in R. A data frame is rectangular data. Variables are in columns, observations are in rows. Example:

```{r}
head(iris)
```

A data frame, in an of itself, cannot represent a complex survey. This is because, just by looking at a data frame, R does not know what the sampling weights are, what the strata are, etc. Even if the variables that represent the sampling weights, etc, are part of the data frame, just by looking at the data frame, R does not know which variable represents the weights or other survey design variables. 

You can get a data frame into R in many different ways. If your data is currently in a comma-separated values (CSV) file, you can use `read.csv()`. If it's in a SAS file, you can use a package like `haven` or [importsurvey](https://cdcgov.github.io/importsurvey/). If it's already in R format, use `readRDS()`, and so on.

2) A **survey object** is an object that describes a survey. It tells R what the sampling weights are, what the strata are, and so on. A data frame can be converted into a survey object using the `survey::svydesign()` function; if a survey uses replicate weights, the `survey::svrepdesign()` function should be used.

Generally speaking, you only need to convert a data frame to a survey object once. After it has been converted, you can save it with `saveRDS()` (or similar). In the future, you can load it with `readRDS()`. You do not need to re-convert a data frame to a survey object every time. 

## NAMCS

Examples in this tutorial use a survey called the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF). NAMCS is "an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists." Note that the unit of observation is visits, not patients – this distinction is important since a single patient can make multiple visits.

The `surveytable` package comes with a data frame of selected variables from NAMCS, called `namcs2019sv_df` (`sv` = selected variables; `df` = data frame). The survey object of this survey is called `namcs2019sv`. 

`namcs2019sv` is the object that we analyze. You really only need `namcs2019sv`. The reason that the package has `namcs2019sv_df` is to illustrate how to convert the data frame to the survey object. 

## More concepts

When importing data from another source, such as SAS or CSV, analysts should be aware of the standard way in which variables are handled in R. 

* Specifically, categorical variables should be stored as `factor`. 
* While true / false variables could be stored as `factor` as well, some programming tasks are easier if they are stored as `logical`. 
* Unknown values should be stored as missing (`NA`). If a variable contains "special values", such as a negative value indicating that the age is missing, those "special values" need to be converted to `NA`.

Variables in `namcs2019sv_df` are already stored correctly. Thus, 

* `AGER` (patient's age group) is a `factor` variable; 
* `PAYNOCHG` (which indicates whether there was no charge for the physician visit) is a `logical` variable; and 
* `AGE` (patient's age in years) is a `numeric` variable.

```{r, results=FALSE, message=FALSE, warning=FALSE}
library(surveytable)
```

```{r}
class(namcs2019sv_df$AGER)
class(namcs2019sv_df$PAYNOCHG)
class(namcs2019sv_df$AGE)
```

## Create a survey object

As seen below, tables produced by `surveytable` are clearer if either the variable names themselves are descriptive, or if the variables have the `"label"` attribute that is descriptive. In `namcs2019sv_df`, all variables already have the `"label"` attribute set. For example, while the variable name `AGE` itself is not very descriptive, the variable does have a more descriptive `"label"` attribute:

```{r}
attr(namcs2019sv_df$AGE, "label")
```

Documentation for the NAMCS survey provides the names of the survey design variables. Specifically, in NAMCS, 

* cluster ID's, also known as primary sampling units (PSU's), are given in `CPSUM`; 
* strata are given in `CSTRATM`; and 
* sampling weights are given in `PATWT`. 

Thus, the `namcs2019sv_df` data frame can be turned into a survey object as follows:

```{r}
mysurvey = survey::svydesign(ids = ~ CPSUM
  , strata = ~ CSTRATM
  , weights = ~ PATWT
  , data = namcs2019sv_df)
```

Tables produced by `surveytable` are clearer if either the name of the survey object is descriptive, or if the object has the `"label"` attribute that is descriptive. Let's set this attribute for `mysurvey`:

```{r}
attr(mysurvey, "label") = "NAMCS 2019 PUF"
```

The `mysurvey` object should now be the same as `namcs2019sv`. Let's verify this:

```{r}
all.equal(namcs2019sv, mysurvey)
```

We have just successfully created a survey object from a data frame. 

# Begin analysis

First, specify the survey object that you'd like to analyze.

```{r, results='asis'}
library(surveytable)
set_survey(namcs2019sv)
```

Check the survey label, survey design variables, and the number of observations to verify that it all looks correct.

For this example, we do want to turn on certain NCHS-specific options, such as identifying low-precision estimates. If you do not care about identifying low-precision estimates, you can skip this command. To turn on the NCHS-specific options:

```{r, results='asis'}
set_opts(mode = "NCHS")
```

## List variables

The `var_list()` function lists the variables in the survey. To avoid unintentionally listing all the variables in a survey, which can be many, the starting characters of variable names are specified. For example, to list the variables that start with the letters `age`, type:

```{r, results='asis'}
var_list("age")
```

The table lists 

* variable name; 
* class, which is the type of variable; and 
* variable label, which is the long name of the variable. 

Common classes are `factor` (categorical variable), `logical` (yes / no variable), and `numeric`.

# Tabulate categorical and logical variables

The main function of the `surveytable` package is `tab()`, which tabulates variables. It operates on categorical and logical variables, and presents both estimated counts, with their standard errors (SEs) and 95% confidence intervals (CIs), and percentages, with their SEs and CIs. For example, to tabulate `AGER`, type:

```{r, results='asis'}
tab("AGER")
```

The table title shows the variable label (the long variable name) and the survey label.

For each level of the variable, the table shows:

* the estimated count, its standard error, and its 95% confidence interval; and 
* the estimated percentage, its standard error, and its 95% confidence interval.

**Low-precision estimates.** Optionally, the `tab()` function, as well as the other tabulation functions that are discussed below, can automatically identify low-precision estimates using algorithms developed at NCHS. For counts, rates, and percentages, the functions flag estimates if, according to the algorithms, they should not be presented, should be reviewed by a clearance official, or should be presented with a footnote. If no estimates are flagged by the checks, the table has a footnote that indicates this. If the checks do identify an estimate, that is denoted in an additional column and in the table footnote.

Turn on this functionality using any of the following: `set_opts(lpe = TRUE)`,
`set_opts(mode = "nchs")`, `set_survey(*, mode = "nchs")`, or `options(surveytable.find_lpe = TRUE)`.

As an example, let's tabulate `PAYNOCHG`: 

```{r, results='asis'}
tab("PAYNOCHG")
```

This table tells us that the estimated number of visits in which there was no charge for the visit has low precision. Intuitively, we can see that the CI for this count estimate is very wide, indicating high uncertainty. 

The CIs that are displayed are the ones that are used by the NCHS presentation standards. Specifically, for counts, the tables show the log Student's t 95% CI, with adaptations for complex surveys; for percentages, they show the 95% Korn and Graubard CI.

**Drop missing values.** Some variables might contain missing values (`NA`). Consider the following variable, which is not part of the actual survey, but was constructed specifically for this example:

```{r, results='asis'}
tab("SPECCAT.bad")
```

To calculate percentages based on the non-missing values only, use the `drop_na` argument:

```{r, results='asis'}
tab("SPECCAT.bad", drop_na = TRUE)
```

The above table gives percentages based only on the knowns, that is, based only on non-`NA` values. 

**Multiple tables.** Multiple tables can be created with a single command: 

```{r, results='asis'}
tab("MDDO", "SPECCAT", "MSA")
```

## Entire population
Estimate the total count for the entire population using the `total()` command:

```{r, results='asis'}
total()
```

## Subsets or interactions

To create a table of `AGER` for each value of the variable `SEX`, type:

```{r, results='asis'}
tab_subset("AGER", "SEX")
```

In addition to giving the long name of the variable being tabulated, the title of each table reflects the value of the subsetting variable (in this case, either `Female` or `Male`).

With the `tab_subset()` command, in each table (that is, in each subset), the percentages add up to 100%.

The `tab_cross()` function is similar -- it crosses or interacts two variables and generates a table using this new variable. Thus, to create a table of the interaction of `AGER` and  `SEX`, type:

```{r, results='asis'}
tab_cross("AGER", "SEX")
```

While the estimated counts produced by `tab_subset()` and `tab_cross()` are the same, the percentages are different. 

* With the `tab_subset()` command, within each table (that is, within each subset), the percentages add up to 100%. 
* On the other hand, with `tab_cross()`, the percentages across the entire population add up to 100%.

# Tabulate numeric variables

The `tab()` and `tab_subset()` functions also work with numeric variables, though with such  variables, the output is different. To tabulate `NUMMED` (number of medications), a numeric variable, type:

```{r, results='asis'}
tab("NUMMED")
```

As before, the table title shows the variable label (the long variable name) and the survey label.

The table shows the percentage of values that are not missing (not `NA`), the mean, the standard error of the mean (SEM), and the standard deviation (SD). 

Subsetting works too:

```{r, results='asis'}
tab_subset("NUMMED", "AGER")
```

# Perform statistical hypothesis testing

The `tab_subset()` function makes it easy to perform hypothesis testing by using the `test` argument. When the argument is `TRUE`, a test of association is performed. In addition, t-tests for all pairs of levels are performed as well.

## Categorical variables

Consider the relationship between `AGER` an `SPECCAT`:

```{r, results='asis'}
tab_subset("AGER", "SPECCAT", test = TRUE)
```

According to these tables, there is an association between physician specialty type and patient age. For instance, for patients under 15 years, there is a statistical difference between primary care physician specialty and medical care specialty. But for older patients, such as in the 45-64 age group, there is no statistical difference between the two specialty types.  

As another example, consider the relationship between `MRI` and `SPECCAT`:

```{r, results='asis'}
tab_subset("MRI", "SPECCAT", test = TRUE)
```

According to these tables, there is no statistical association between MRI and physician specialty. For each of the 3 specialty types, a minority of visits have MRI's. For the visits with MRI's, there was no statistical difference between specialty types. 

As a general rule of thumb, since there is no statistical association between MRI and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate `MRI` without subsetting by `SPECCAT`.

## Numeric variables

The relationship between `NUMMED` and `AGER`:

```{r, results='asis'}
tab_subset("NUMMED", "AGER", test = TRUE)
```

According to these tables, there is an association between the number of medications and age category. `NUMMED` is statistically similar for the "Under 15 years" and "15-24 years" `AGER` categories. It is statistically different for all other pairs of age categories. 

Finally, let's look at the relationship between `NUMMED` and `SPECCAT`:

```{r, results='asis'}
tab_subset("NUMMED", "SPECCAT", test = TRUE)
```

According to these tables, there is no association between the number of medications and physician specialty type. `NUMMED` is statistically similar for all pairs of physician specialties. 

As a general rule of thumb, since there is no statistical association between the number of medications and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate `NUMMED` without subsetting by `SPECCAT`.

## Categorical variables (single variable)

To test whether any pair of `SPECCAT` levels is statistically similar or different, type:

```{r, results='asis'}
tab("SPECCAT", test = TRUE)
```

According to this, surgical and medical care specialties are statistically similar, and are statistically different from primary care.

# Calculate rates

A rate is a ratio of count estimates based on the survey in question divided by population size, which is assumed to be known. For example, the number of physician visits per 100 people in the population is a rate: the number of physician visits is estimated from the `namcs2019sv` survey, while the number of people in the population comes from another source. 

To calculate rates, in addition to the survey, we need a source of information on population size. You would typically use a function such as `read.csv()` to load the population figures and get them into the correct format. The `surveytable` package comes with an object called `uspop2019` that contains several population figures for use in these examples.

Let's examine `uspop2019`:

```{r}
class(uspop2019)
names(uspop2019)
```

The overall population size for the country as a whole is:
```{r}
uspop2019$total
```

Once we have the overall population size, the overall rate is:
```{r, results='asis'}
total_rate(uspop2019$total)
```

To calculate the rates for a particular variable, we need to provide a data frame with a column called `Level` that matches the levels of the variable in the survey, and a column called `Population` that gives the size of the population for that level.

For example, for `AGER`, this data frame as follows:

```{r}
uspop2019$AGER
```

Now that we have the appropriate population figures, the rates table is obtained by typing:

```{r, results='asis'}
tab_rate("AGER", uspop2019$AGER)
```

To calculate the rates for one variable (`AGER`) by another variable (`SEX`), we need population figures in the following format:
```{r}
uspop2019$`AGER x SEX`
```

With this data frame, the rates table is obtained by typing:

```{r, results='asis'}
tab_subset_rate("AGER", "SEX", uspop2019$`AGER x SEX`)
```

# Create or modify variables

In some situations, it might be necessary to modify survey variables, or to create new ones. This section describes how to do this.

**Convert factor to logical.** The variable `MAJOR` (major reason for this visit) has several levels. 

```{r, results='asis'}
tab("MAJOR")
```

Notice that one of the levels is called `"Preventive care"`. Suppose an analyst is only interested in whether or not a visit is a preventive care visit -- they are not interested in the other visit types. They can create a new variable called `Preventive care visits` that is `TRUE` for preventive care visits and `FALSE` for all other types of visits, as follows:

```{r, results='asis'}
var_case("Preventive care visits", "MAJOR", "Preventive care")
tab("Preventive care visits")
```

This creates a logical variable that is `TRUE` for preventive care visits and then tabulates it. When using the `var_case()` function, specify the name of the new logical variable to be created, an existing factor variable, and one or more levels of the factor variable that should be set to `TRUE` in the logical variable. 

Thus, if an analyst is interested in surgery-related visits, which are indicated by two different levels of `MAJOR`, they could type:

```{r, results='asis'}
var_case("Surgery-related visits"
  , "MAJOR"
  , c("Pre-surgery", "Post-surgery"))
tab("Surgery-related visits")
```

**Collapse levels.** The variable `PRIMCARE` (whether the physician is this patient's primary care provider) has levels `Unknown` and `Blank`, among others. 

```{r, results='asis'}
tab("PRIMCARE")
```

To collapse `Unknown` and `Blank` into a single level, type:

```{r, results='asis'}
var_collapse("PRIMCARE", "Unknown if PCP", c("Unknown", "Blank"))
tab("PRIMCARE")
```

**Convert numeric to factor.** The variable `AGE` is numeric.

```{r, results='asis'}
tab("AGE")
```

To create a new variable of age categories based on `AGE`, type:

```{r, results='asis'}
var_cut("Age group"
   , "AGE"
   , c(-Inf, -0.1, 0, 4, 14, 64, Inf)
   , c(NA, "Under 1", "1-4", "5-14", "15-64", "65 and over"))
tab("Age group")
```

In the `var_cut()` command, specify the following information:

* name of the new categorical variable; 
* name of the existing numeric variable;
* cut points -- note that the intervals are inclusive on the right; and  
* category labels.

Be cognizant of any "special values" that the numeric variable might have. In some data systems, negative values indicate unknowns, which should be coded as `NA`. That's what we do here -- any value between `-Inf` and `-0.1` gets coded as missing (`NA`). Though in this particular data, there are no unknowns and no "special values".

**Check whether any variable is true.** For a series of logical variables, you can check whether any of them are `TRUE` using the `var_any()` command. 

A physician visit is considered to be an "imaging services" visit if it had any of a number of imaging services ordered or provided. Imaging services are indicated using logical variables, such as `MRI` and `XRAY`. To create the `Imaging services` variable, type:

```{r, results='asis'}
var_any("Imaging services"
  , c("ANYIMAGE", "BONEDENS", "CATSCAN", "ECHOCARD", "OTHULTRA"
  , "MAMMO", "MRI", "XRAY", "OTHIMAGE"))
tab("Imaging services")
```

**Interact variables.** The `tab_cross()` function creates a table of an interaction of two variables, but it does not save the interacted variable. To create the interacted variable, use the `var_cross()` command:

```{r}
var_cross("Age x Sex", "AGER", "SEX")
```

Specify the name of the new variable as well as names of the two variables to interact.

**Copy a variable.** Create a new variable that is a copy of another variable using `var_copy()`. You can modify the copy, while the original remains unchanged. For example:

```{r, results='asis'}
var_copy("Age group", "AGER")
var_collapse("Age group", "65+", c("65-74 years", "75 years and over"))
var_collapse("Age group", "25-64", c("25-44 years", "45-64 years"))
tab("AGER", "Age group")
```

Here, the `AGER` variable remains unchanged, while the `Age group` variable has fewer categories.


# Save the output

The `tab*` and `total*` functions have an argument called `csv` that specifies the name of a comma-separated values (CSV) file to save the output to. Alternatively, you can name the default CSV output file using the `set_opts()` function. For example, the following directs `surveytable` to send all future output to a CSV file, create some tables, and then turn off sending output to the file:

```r
set_opts(csv = "output.csv")
```

```{r, results='asis'}
tab("MDDO")
set_opts(csv = "")
```

If the tabulation functions are called from within an R Markdown notebook or a Quarto document, they produce HTML or LaTeX tables, as appropriate. This makes it easy to incorporate the output of the `surveytable` package directly into documents, presentations, "shiny" web apps, and other output types.

Finally, the tabulation functions return the tables that they produce. More advanced analysts can use this functionality to integrate `surveytable` into other programming tasks.