vtree
functionvtree
Subsets play an important role in almost any data analysis. Imagine a data set of countries, with variables named population
, continent
, landlocked
, and a variety of other variables representing economic and political characteristics. We might wish to examine subsets of the data set based on the continent
variable. Within each of these subsets, we might wish to examine nested subsets based on the population
variable, for example, countries with populations under 30 million and over 30 million. We might continue to a third level of nesting based on the landlocked
variable. Nested subsets help us to answer questions like the following: Among African countries with a population over 30 million, what percentage are landlocked?
Even in simple situations like this, it can be a chore to keep track of nested subsets and to calculate percentages. And as the number of subsetting variables increases, the number of nested subsets grows rapidly. When there are missing values in the data set, the task becomes even more complicated. For these reasons, it is best not to perform nested subsetting by hand.
Nested subsets arise in many situations. Consider, for example, flow diagrams for clinical studies. In a typical study, participants are a subset of eligible patients who in turn are a subset of patients who were assessed. Because mistakes inevitably arise during manual calculation and transcription, it is not uncommon for flow diagrams in published studies to contain errors. And although the errors that make it to publication are often small, they can occasionally be disastrous.
A solution to the problem of calculating nested subsets and displaying information about them is presented here. The idea is to represent nested subsets of a data set (where the subsets are defined by specific variables in the data set), using a tree structure, which we call a variable tree. vtree
is a flexible tool for calculating and drawing variable trees. It can be used to explore a data set, to construct figures for reports and publications, and to double-check code to ensure that subsetting has been performed correctly.
The examples that follow use a data set of 46 fictitious patients called FakeData
. Based on this data set, the variable tree below depicts subsets defined by Sex
(M or F) nested within subsets defined by disease Severity
(Mild, Moderate, Severe, or NA). Later we’ll see how variable trees with more than two variables are particularly useful.
A variable tree consists of nodes connected by arrows. At the top of the diagram above, the root node of the tree contains all 46 patients. The rest of the nodes are arranged in successive levels, where each level corresponds to a variable. Note that this highlights one difference between variable trees and some other kinds of trees: at each level of a variable tree, regardless of the branch, the nodes represent values of the same variable. In contrast, consider decision trees, which can have splits on different variables at the same level.
Continuing with the variable tree above, the nodes immediately below the root represent values of Severity
and are referred to as the children of the root node. In this case, Severity
was missing (NA) for 6 patients, and there is a node for these patients. Inside each of the nodes, the number of patients is displayed and—except for in the missing value node—the corresponding percentage is also shown. Note that, by default, vtree
displays “valid” percentages, i.e. the denominator used to calculate the percentage is the total number of non-missing values, 40.
The nodes in the next level (which is the final level for this tree) correspond to values of Sex
. These nodes represent males and females within subsets defined by each value of Severity
. In each of these nodes the percentage is calculated in terms of the number of patients in its parent node.
Like any node, a missing-value node can have children. For example, of the 6 patients for whom Severity
is missing, 3 are female and 3 are male. By default, vtree
displays the full missing-value structure of the specified variables in the data frame.
Also by default, vtree
automatically assigns a color palette to each variable. Severity
has been assigned red hues (lightest for Mild, darkest for Severe), while Sex
has been assigned blue hues (light blue for females, dark blue for males). The node representing missing values of Severity
is colored white to draw attention to it.
A tree with two variables is similar to a two-way contingency table. In the example above, Sex
is shown within levels of Severity
. This corresponds to the following contingency table, where the percentages within each column add to 100%. These are called column percentages.
Mild | Moderate | Severe | NA | |
---|---|---|---|---|
F | 11 (58%) | 11 (69%) | 2 (40%) | 3 (50%) |
M | 8 (42%) | 5 (31%) | 3 (60%) | 3 (50%) |
Likewise, a tree with Severity
shown within levels of Sex
corresponds to a contingency table with row percentages.
The contingency table above is more compact than the corresponding variable tree, but some people find the variable tree easier to interpret. When three of more variables are of interest, multi-way contingency tables can be used. These are typically displayed using several two-way tables. In this situation, variable trees are generally easier to interpret.
It is also noteworthy that contingency tables are not always more compact than variable trees. When most cells of a large contingency table are empty (in which case the table is said to be sparse), the corresponding variable tree may be more compact since empty-nodes are not shown.
Variable trees are thus an appealing alternative to multi-way contingency tables and can also be used to display a wide variety of information including:
multi-way intersections (often shown in Venn diagrams),
flow diagrams involving a sequence of inclusion/exclusion steps,
longitudinal events.
vtree
is designed to be quick and easy to use, so that it is convenient for data exploration, but also flexible enough that it can be used to prepare publication-ready figures. To generate a basic variable tree, it is only necessary to provide vtree
with a data frame and some variable names. However extra features make vtree
much more useful. vtree
provides:
control over labeling, colors, legends, line wrapping, text formatting and other customization features;
flexible pruning to remove parts of the tree that are of lesser interest, which is particularly useful when a tree gets large;
display of information about other variables in each node, including a variety of summary statistics;
special displays for indicator variables, patterns of values, and missing value patterns;
support for REDCap checkbox variables; and
features for dichotomizing variables and checking for outliers.
vtree
is built on open-source software: in particular Richard Iannone’s DiagrammeR package, which provides an interface to the Graphviz software using the htmlwidgets framework. A formal description of variable trees follows.
The root node of the variable tree represents the entire data frame. The root node has a child for each observed value of the first variable that was specified. Each of these child nodes represents a subset of the data frame with a specific value of the variable, and is labeled with the number of observations in the subset and the corresponding percentage of the number of observations in the entire data frame. The nth level below the root of the variable tree corresponds to the nth variable specified. Apart from the root node, each node in the variable tree represents the subset of its parent defined by a specific observed value of the variable at that level of the tree, and is labeled with the number of observations in that subset and the corresponding percentage of the number of observations in its parent node.
Note that a node always represents at least one observation. And unlike a contingency table, which can have empty cells, a variable tree has no empty nodes.
vtree
functionConsider a data frame named df
, which includes discrete variables v1
and v2
. In this case, a variable tree can be displayed using the following command:
For additional details about how variables can be specified, see the section on specification of variables below.
Numerous additional parameters can be supplied. For example, by default vtree
produces a horizontal tree (that is, a tree that grows from left to right), but sometimes a vertical tree is preferable. When horiz=FALSE
is specified, vtree
generates a vertical tree.
To display a variable tree for a single variable, use the following command:
Next, consider a vertical variable tree with two variables, Severity
and Sex
. A less colorful display with more spacing can be requested by specifying plain=TRUE
:
At the top, the root node represents the entire data frame. Moving down, each subsequent level of the tree corresponds to a different variable (first Severity
, then Sex
). Within each level, each node represents the subset of its parent node where the variable has a specific value. For example, the level for Severity
has nodes Mild, Moderate, Severe, and NA (which represents missing values). Displayed in each node is the number of observations and (except in the NA node) the conditional percentage, i.e. the number of observations in the node expressed as a percentage of the observations in its parent node.
By default, “valid percentages” are shown, i.e. the denominator is the total number of non-missing values. In the case of Severity
, there are 6 missing values, so the denominator is 46 - 6 , or 40. There are 19 Mild cases, and 19/40 = 0.475 so the percentage shown is 48%. No percentage is shown in the NA node since missing values are not included in the denominator.
Alternatively, if you don’t want to use valid percentages, specify vp=FALSE
, and the denominator will be the total number of observations, including any missing values. In this case, a percentage is shown in each of the nodes, including any NA nodes.
If you don’t wish to see percentages, specify showpct=FALSE
, or if you don’t need to see counts, specify showcount=FALSE
.
To include a legend, specify showlegend=TRUE
. Next to each level of the tree, the variable name is displayed together with color discs and the values they correspond to. For each of the values, overall (marginal) counts are shown, together with percentages.
When the legend is shown, the node labels become redundant, since the colors identify the values of the variables (although the labels may aid readability). If you prefer, you can hide the node labels, by specifying shownodelabels=FALSE
:
The legend shows how colors are assigned to the different values of each variable, and additionally provides marginal (that is, overall) counts and percentages for each variable. Since Severity
is the first variable in the tree—i.e., it is not nested within another variable— the marginal counts and percentages for Severity
are identical to those displayed in the nodes. In contrast, for Sex
, the marginal counts and percentages are different from what is shown in the nodes because the nodes for Sex
are nested with levels of Severity
.
(Unfortunately the NA circle in the legend is oddly sized and positioned due to an issue with the corresponding unicode symbols.)
When a variable tree is large, it can be difficult to display it in a readable way. One approach that helps is to display the tree horizontally and also to put the node labels on the same line as the counts and percentage by specifying sameline=TRUE
. For example, the following results in nodes with single-lines labels such as Moderate, 16 (40%), etc.:
By default, next to each level of the tree, vtree
shows the variable name. These can be removed by specifying showvarnames=FALSE
.
By default, vtree
wraps text onto the next line whenever a space occurs after at least 20 characters. This can be adjusted, for example, to 15 characters, by specifying splitwidth=15
. To disable line splitting, specify splitwidth=Inf
. Text wrapping in the legend is controlled independently. To set the splitting in the legend to 8 characters, specify lsplitwidth=8
. Also note that in the legend, text wrapping can take place not only at spaces, but also at any of the following characters: . - + _ = /
This concludes the mini-tutorial. The rest of this vignette details the many features of vtree
. There is also a section of examples using data from the built-in R datasets package.
Pruning a tree means removing specified nodes along with their descendants. This is useful when a variable tree gets too big, or when you are only interested in certain parts of the tree. For convenience, vtree
provides several different ways to prune a tree, described below.
prune
parameterSuppose you don’t want the tree to include individuals whose disease is Mild or Moderate. You can use the prune
parameter to remove those nodes, and all of their descendants.
The prune
parameter is specified as a list with an element named for each variable you wish to prune. In the example below the list has one element, named Severity
. That element in turn is a vector c("Mild","Moderate")
indicating the values to prune.
Caution: Once a variable tree has been pruned, it is no longer complete. This can sometimes be confusing since not all observations are present at some levels of the tree. It is particularly important to avoid pruning missing value nodes, since this makes it hard to interpret “valid” percentages (i.e. percentages calculated using the number of non-missing observations as denominator).
prunebelow
parameterA disadvantage of the prune
parameter is that in the resulting tree, the counts shown in child nodes may not add up to the counts shown in the parent node. For example in the variable tree above, of a total of 46 patients, 5 have Severe disease and Severity
is unknown for 6. One might wonder what happened to the other 35 patients.
An alternative is to prune below the specified nodes. In this case, this means that the Mild and Moderate nodes will be shown, but not their descendants.
keep
and follow
parametersInstead of specifying the nodes that should be discarded, sometimes it is more convenient to specify the nodes that should be retained. The keep
parameter is used to specify nodes that should not be pruned (all other nodes at that level of the tree will be pruned). The follow
parameter is like the keep
parameter except that no nodes at that level of the tree will be pruned. Instead, those nodes that are not “followed” will be pruned at the next level.
By default, vtree
labels variables and nodes exactly as they are in the data frame. For presentation purposes it is often useful to change these labels.
labelvar
parameterIf Severity
in fact represents severity on day 1, you might want it to appear that way in the variable tree. To do this, use the labelvar
parameter, which is specified as a vector whose element names are variable names. As an example, if Severity
in fact represents initial severity, you can specify labelvar=c(Severity="Initial severity")
.
labelnode
parameterBy default, vtree
labels nodes (except for the root node) using the values of the variable in question. (If the variable is a factor, the levels of the factor are used). Sometimes it is convenient to instead specify custom labels for nodes. You can use the labelnode
argument to relabel the values. For example, you might want to use “Male” and “Female” instead of “M” and “F”. The labelnode
argument argument is specified as a list whose element names are variable names. To substitute New label
for Old label
, the syntax is: "New label"="Old label"
. Thus the full specification is: labelnode=list(Sex=c(Male="M",Female="F"))
.
tlabelnode
parameterSuppose in the example above that Group
A represents children and Group
B represents adults. In Group
A, we would like to use the labels “girl” and “boy”, while in Group
B we would like to use “woman” and “man”. The labelnode
parameter cannot handle this situation because the values of Sex
need to labeled differently in different branches of the tree. The tlabelnode
parameter allows “targeted” node labels.
vtree(FakeData,"Group Sex",horiz=FALSE,
labelnode=list(Group=c(Child="A",Adult="B")),
tlabelnode=list(
c(Group="A",Sex="F",label="girl"),
c(Group="A",Sex="M",label="boy"),
c(Group="B",Sex="F",label="woman"),
c(Group="B",Sex="M",label="man")))
Graphviz
, the open source graph visualization software that provides the basis for vtree
, supports a variety of text formatting (including boldface, colors, etc.). This is used in vtree
to control formatting of text such as node labels.
NOTE: The section after this one shows how to use an easy alternative to HTML-like codes.
Graphviz
implements “HTML-like labels”, including:
<BR/>
means insert a line break<BR ALIGN='LEFT'/>
means make the preceding line left-justified and insert a line break<I> ... </I>
means display text in italics<B> ... </B>
means display text in bold<SUP> ... </SUP>
means display text in superscript, but note that the font size does not change<SUB> ... </SUB>
means display text in subscript but again note that the font size does not change<FONT POINT-SIZE='10'> ... </FONT>
means set font to 10 point<FONT FACE='Times-Roman'> ... </FONT>
means set font to Times-Roman<FONT COLOR='red'> ... </FONT>
means set font to redSee https://www.graphviz.org/doc/info/shapes.html#html for more details.
Note: To use these HTML-like codes, it is necessary to specify HTMLtext=TRUE
.
By default, the vtree
package uses markdown-style codes for text formatting.
\n
means insert a line break\n*l
means make the preceding line left-justified and insert a line break*...*
means display text in italics**...**
means display text in bold^...^
means display text in superscript (using 10 point font)~...~
means display text in subscript (using 10 point font)%%red ...%%
means display text in red (or whichever color is specified)text
parameterSuppose you wish to add the italicized text “Excluding new diagnoses” to any Mild nodes in the tree. The parameter text
lets you add text to nodes. It is specified as a list with an element named for each variable. In the example below the list has one element, named Severity
. That element in turn is a vector c(Mild="\n*Excluding\nnew diagnoses*")
indicating that the Mild node should include additional text using Markdown-style formatting (i.e. there is a linebreak and the asterisks around the text indicate that it should be displayed in italics):
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
text=list(Severity=c(Mild="\n*Excluding\nnew diagnoses*")))
ttext
parameterIn the example above, suppose that new diagnoses are only excluded from Mild cases in Group
B. But the text
parameter is used to add text to all Mild nodes. Thus, in situations like this, the text
parameter is not sufficient. Instead, you can use the ttext
parameter to target exactly which nodes should have the specified text.
The ttext
parameter requires that you specify the full path from the root of the tree to the node in question, along with the text in question. The ttext
parameter is specified as a list so that multiple targeted text strings can be specified at once. For example:
vtree(FakeData,"Group Severity",horiz=FALSE,showvarnames=FALSE,
ttext=list(
c(Group="B",Severity="Mild",text="\n*Excluding\nnew diagnoses*"),
c(Group="A",text="\nSweden"),
c(Group="B",text="\nNorway")))
For convenience, vtree allows you to specify the variable names in a single character string (with the variable names separated by whitespace). If, however, any of the variable names have internal spaces, the variable names must be specified as a vector of character strings.
Additionally, there are several modifiers that can be used to change the way that variables are represented in a tree.
is.na:
If an individual variable name is preceded by is.na:
, that variable will be replaced by a missing value indicator in the variable tree. (This differs from the check.is.na
parameter, described below, which is used to replace all of the specified variables with missing value indicators.)
stem:
In datasets exported from REDCap, checkboxes are represented using multiple variables. The stem:
prefix makes it easier to work with them. This is described in the section on REDCap checkboxes later in this vignette.
tri:
The tri:
prefix is useful for identifying values of a numeric variable that are extreme compared to the other values in a node. Note: Unlike other variable specifications, which take effect at the level of the entire data frame, the tri:
prefix takes effect within each node.
The effect of this variable specification is to trichotomize a numeric variable within a node based on the median and the IQR, both computed within that node. The result is three categories:
“mid”: namely values within plus or minus 1.5×IQR of the median,
“high”: namely values more than 1.5×IQR above the median,
“low”: namely values more than 1.5×IQR below the median.
variable=value
When a variable takes on a large number of different values, it will result in a very large variable tree. One solution is to prune the tree, for example by keeping just one node. An alternative is to specify the value of the variable that is of primary interest. The result will be to dichotomize the variable at that value. For example if Severity=Mild
is specified, the Severity
variable will be dichotomized between Mild
and Not Mild
.
variable<value
, variable>value
These two specifications are used to dichotomize a numeric variable, splitting above and below a specified value. This can be useful for identifying subsets with extreme values.
It is often useful to display information about other variables (apart from those that define the tree) in the nodes of a variable tree. This is particularly useful for numeric variables, which generally cannot be used to build the tree since they have too many distinct values. For example, we might wish to display the mean age for individuals in each node. Or we might wish to list the ID numbers for individuals in each node. The summary
argument can be used to flexibly specify additional information to display.
The summary
parameter is specified as a character string that starts with the name of the variable for which a summary is desired. This is followed by a space, and then the rest of the string specifies what to display. Special codes (see table below) are use to indicate the type of summary desired, control which nodes display the summary, etc. For example %mean%
indicates that the mean of the specified variable should be shown. Thus to display the mean of the numeric variable Score
, you could specify summary="Score \nmean score: %mean%"
. Note that the part of the string following the first space is "\nmean score: %mean%"
. This specifies that in each node, after the usual frequency and percentage, the summary should start on a new line with the words “mean score:” followed by the mean.
The following codes can be used to show summary information:
code | result |
---|---|
%mean% |
mean |
%SD% |
standard deviation |
%min% |
minimum |
%max% |
maximum |
%pX% |
Xth percentile (e.g. p50 means the 50th percentile) |
%median% |
median, i.e. p50 |
%IQR% |
IQR, i.e. p25, p75 |
%npct% |
frequency and percentage of a logical variable. By default “valid percentages” are used. Any missing values are also reported. |
%pct% |
same as %npct% but percentage only (with no parentheses). |
%list% |
list of individual values, separated by commas |
%listlines% |
list of individual values, each on a separate line |
%mv% |
the number of missing values |
%v% |
the name of the variable |
The summary
argument can use any number of these codes, mixed with text and formatting codes.
Sometimes it is useful to display summary information for more than one variable. To do this, specify summary
as a vector of character strings:
vtree(FakeData,"Severity",horiz=FALSE,showvarnames=FALSE,splitwidth=Inf,sameline=TRUE,
summary=c(
"Score \nScore: mean (SD) %mean% (%SD%)",
"Pre \nPre: range %min%, %max%"))
Suppose we want to know the proportion of patients with a viral infection within each severity level. We could display a variable tree for the variables Severity
and Viral
. But that would show a separate node for TRUE and FALSE values of Viral
, and we only need to know the percentage for the TRUE values. If what we’re looking for is simply the number and percentage of patients with viral infection in each severity group, the %npct%
code can be used. This results in a simpler tree:
vtree(FakeData,"Severity",summary="Viral \nViral %npct%",horiz=FALSE,
showvarnames=FALSE,sameline=TRUE)
Note that in each node, “mv” indicates the number of missing values (if any).
The %pct%
code is the same as the %npct%
code except that it does not show the frequency, only the percentage (without parentheses).
It is sometimes convenient to see individual values of a variable in each node. For example it is often convenient to see ID numbers. To do this, use the %list%
code. By default this information will be displayed in each node. When a value occurs more than once in the subset, it will be followed by a count of the number of repetitions in parentheses. The %list%
code separates values by commas. Alternatively, the %listlines%
code can be used to put each value on a new line.
When there are many IDs, it is often convenient to truncate the output. The %trunc=
N%
code specifies that, after N characters, summary information should be truncated with “…”.
%noroot%
, %leafonly%
, %var=
v%
, and %node=
n%
codesBy default, summary information is shown in all nodes. However, it may also be convenient to only show it in certain nodes. The following codes are available:
code | summary information restricted to: |
---|---|
%noroot% |
all nodes except the root |
%leafonly% |
leaf nodes |
%var= v% |
nodes of variable v |
%node= n% |
nodes named n |
As in the specification of variables for structuring a variable tree, there are special ways to specify variables in the summary
parameter. For example, if we wish to know the proportion of patients in each node whose Category
is single, we specify Category=single
in the summary
argument.
Continuous variables such as Score
can be dichotomized using notation such as Score>10
or Score<20
.
runsummary
parameterSometimes it is desirable to show summaries only in nodes with certain characteristics. Consider the following example: suppose that in patients with a viral infection, the presence or absence of two features of the virus (Feature 1 and Feature 2) is recorded. To simulate this situation, let’s make up variables Feature1
and Feature2
but set them to NA if the Viral
variable is FALSE or NA.
set.seed(1234)
FakeData$Feature1 <- rbinom(nrow(FakeData),1,0.5)
FakeData$Feature1[!FakeData$Viral | is.na(FakeData$Viral)] <- NA
FakeData$Feature2 <- rbinom(nrow(FakeData),1,0.5)
FakeData$Feature2[!FakeData$Viral | is.na(FakeData$Viral)] <- NA
Here is a tree for Sex
and Category
showing three summaries in each leaf node: the number and percent of patients with viral infections and number and percent of viruses with Feature 1 and with Feature 2.
vtree(FakeData,"Sex Category",sameline=TRUE,splitwidth=150,varminwidth=c(Category=6),
summary=c(
"Viral \nviral: %npct%%leafonly%",
"Feature1 , F1: %npct%%leafonly%",
"Feature2 , F2: %npct%%leafonly%"))
But note that in two of the nodes there are no viral infections, so in these nodes there are no virus features to describe. The tree would be improved if we could turn off the summaries in nodes where there are no viral infections. We can do this by specifying, for each of the three summaries, a function that returns TRUE or FALSE for each node. The first function always returns TRUE:
The next two functions return TRUE if there is at least one viral infection in the node:
These three functions are specified (in a list) using the runsummary
parameter:
vtree(FakeData,"Sex Category",sameline=TRUE,splitwidth=150,varminwidth=c(Category=6),
summary=c(
"Viral \nviral: %npct%%leafonly%",
"Feature1 , F1: %npct%%leafonly%",
"Feature2 , F2: %npct%%leafonly%"),
runsummary=list(
function(x) TRUE,
function(x) any(x$Viral,na.rm=TRUE),
function(x) any(x$Viral,na.rm=TRUE)))
Each leaf node in a tree provides the frequency of a particular pattern (combination) of values of the variables. For example, in a variable tree for Severity
and Sex
, the leaf nodes correspond to Mild F, Mild M, Moderate F, Moderate M, etc. If these patterns themselves are used as the first variable in a tree, then the tree will be “detangled”; that is, each branch of the tree will represent a unique pattern. A “pattern tree” can be easily produced by specifying pattern=TRUE
:
Pattern trees are easier to read, but they involve a considerable a loss of information, since they only represent the nth-level subsets (where n is the number of variables)..
Note that, by default, the root node is not shown when pattern=TRUE
is specified, because this simplifies the display (in fact, without the root node, it is no longer a tree!). A disadvantage of this is that the total sample size is not shown. The root node can be shown by specifying showroot=TRUE
.
This tree has two other special characteristics. First, note that after the first level (representing pattern
), counts and percentages are not shown, since they are not informative: by definition, all nodes within a branch have the same count. Second, note that in place of arrows, undirected line segments are shown. This is because, without percentages, arrows have no particular significance. That being said, in some cases there is a natural ordering to the variables (as is the case with longitudinal variables). To show arrows, specify seq=TRUE
instead of pattern=TRUE
, and a “sequence” (i.e. an ordered pattern) will be shown.
Summaries can be shown in pattern trees (using the summary
parameter), but they only appear in the pattern node (or the sequence node if seq=TRUE
).
When pattern=TRUE
is specified, the variable tree looks a lot like a table. In fact a data frame containing the information from the pattern tree can be exported by specifying ptable=TRUE
:
## n pct Severity Sex
## 1 2 4 Severe F
## 2 3 7 <NA> F
## 3 3 7 <NA> M
## 4 3 7 Severe M
## 5 5 11 Moderate M
## 6 8 17 Mild M
## 7 11 24 Mild F
## 8 11 24 Moderate F
This compact representation may be convenient for display in a manuscript.
Summaries can be very useful in pattern tables. If a single summary is requested, it appears in the summary_1
variable in the data frame. If additional summaries are requested they appear as summary_2
, summary_3
, etc.
## n pct Severity Sex summary_1 summary_2
## 1 2 4 Severe F 28.0 -0.4
## 2 3 7 <NA> F 6.3 -0.1
## 3 3 7 <NA> M 23.7 -0.9
## 4 3 7 Severe M 44.0 -0.3
## 5 5 11 Moderate M 8.2 -0.7
## 6 8 17 Mild M 6.3 0.2
## 7 11 24 Mild F 15.7 -0.4
## 8 11 24 Moderate F 21.5 0.0
A Venn diagram is used to show all intersections between several sets. When there are more than three sets, Venn diagrams are hard to construct and to read. Additionally, Venn diagrams cannot represent missing values.
One alternative is to use variable trees. In the following example, the variables Ind1
through Ind4
are indicators of set membership (0 = not a member of the set, 1 = member). Convenient settings for such variables are requested by specifying Venn=TRUE
:
Note that for simplicity, node labels are not shown. Instead, dark colors indicate set membership, while light colors indicate non-membership. Also, percentages are not shown by default. Specifying showpct=TRUE
displays percentages.
In contrast to a Venn diagram, which shows all intersections, a variable tree only shows information on specific intersections, determined by the ordering of the variables.
Specifying pattern=TRUE
produces an even simpler representation, since only the full n-way intersections (where n is the number of variables) are represented:
vtree(FakeData,"Ind1 Ind2 Ind3 Ind4",Venn=TRUE,pattern=TRUE,
palette=c(Ind1=1,Ind2=2,Ind3=3,Ind4=4))
Note that in the call to vtree
above, the palette
parameter is specified so that the color palettes match with the preceding variable tree. (Otherwise, pattern
gets palette number 1, and then Ind1
gets palette number 2, and so on.)
Although this tree provides less information than the branching-style tree, it is more easily interpreted. This kind of tree is also useful for investigating incomplete longitudinal data.
check.is.na
parameterThe check.is.na
parameter is used to produce a tree that only shows whether the specified variables are missing or not. By default, pattern=TRUE
is also set when check.is.na=TRUE
. Whereas the variables that vtree
uses to build variable trees are usually categorical, this is a situation where non-categorical variables can be used, because their missingness is represented instead of their actual values.
Specifying ptable=TRUE
produces this information in a data frame:
## n pct MISSING_Severity MISSING_Age MISSING_Pre MISSING_Post
## 1 1 2 not N/A N/A N/A not N/A
## 2 1 2 not N/A not N/A N/A N/A
## 3 1 2 not N/A not N/A N/A not N/A
## 4 1 2 not N/A not N/A not N/A N/A
## 5 2 4 N/A N/A not N/A not N/A
## 6 4 9 N/A not N/A not N/A not N/A
## 7 4 9 not N/A N/A not N/A not N/A
## 8 32 70 not N/A not N/A not N/A not N/A
Note that the columns n
and pct
represent the frequency and percentage of the total number of cases.
It may be useful to identify the ids for these patterns. Here the results are truncated to 15 characters:
## n pct MISSING_Severity MISSING_Age MISSING_Pre MISSING_Post summary_1
## 1 1 2 not N/A N/A N/A not N/A 124
## 2 1 2 not N/A not N/A N/A N/A 118
## 3 1 2 not N/A not N/A N/A not N/A 108
## 4 1 2 not N/A not N/A not N/A N/A 104
## 5 2 4 N/A N/A not N/A not N/A 112, 135
## 6 4 9 N/A not N/A not N/A not N/A 103, 116, 126, ...
## 7 4 9 not N/A N/A not N/A not N/A 105, 119, 128, ...
## 8 32 70 not N/A not N/A not N/A not N/A 101, 102, 106, ...
Consider the following fictitious data about a randomized controlled trial (RCT):
## id eligible randomized group followup analyzed
## 1 001 Eligible Randomized B Followed up Analyzed
## 2 002 Eligible Not randomized <NA> <NA> <NA>
## 3 003 Eligible Randomized A Not followed up <NA>
## 4 004 Eligible Randomized B Followed up Analyzed
## 5 005 Eligible Randomized A Followed up Analyzed
## 6 006 Ineligible <NA> <NA> <NA> <NA>
## 7 007 Eligible Randomized A Followed up Analyzed
## 8 008 Ineligible <NA> <NA> <NA> <NA>
## 9 009 Eligible Randomized A Followed up Analyzed
## 10 0010 Ineligible <NA> <NA> <NA> <NA>
## 11 0011 Eligible Randomized B Followed up Analyzed
## 12 0012 Ineligible <NA> <NA> <NA> <NA>
The CONSORT diagram (http://www.consort-statement.org/) shows the flow of patients through the study starting with those who meet eligibility criteria, then those who are randomized to each group, etc. It is easy to produce a rudimentary version of a CONSORT diagram in vtree
. The key step is to prune branches for those who are not eligible, not randomized, etc. This can be done using the keep
parameter:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
keep=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
Note that this does not include all of the additional information for a full CONSORT diagram (exclusion reasons and counts, as well as numbers of patients who received their allocated interventions, who discontinued intervention, and who were excluded from analysis). It does, however, provide the main flow information.
Additional information can be obtained by examining the nodes for patients in the pruned branches. The follow
parameter allows that:
v7 <- vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility")
Finally, it may be useful to see the identification numbers in each node. This can be done using the summary
parameter with the %list%
code. Since IDs are not as useful in the root note, the %noroot%
code is also specified here:
vtree(FakeRCT,"eligible randomized group followup analyzed",plain=TRUE,
follow=list(eligible="Eligible",randomized="Randomized",followup="Followed up"),
horiz=FALSE,showvarnames=FALSE,title="Assessed for eligibility",
summary="id \nid: %list% %noroot%")
In datasets exported from REDCap, checkboxes (i.e. the boxes where you select all that apply) are represented in a special way. For each item in a checklist, a separate variable is created. Suppose survey respondents were asked to select which flavors of ice cream (Chocolate, Vanilla, Strawberry) they like. Within REDCap, the variable name for this list of checkboxes is IceCream
, but when the dataset is exported, individual variables IceCream___1
(representing Chocolate), IceCream___2
(Vanilla), and IceCream___3
(Strawberry) are created. When the dataset is read into R, the names of the flavours are embedded in the attributes
of these variables.
vtree
includes a feature designed to make REDCap checkbox variables easier to use. Instead of typing:
you can use a special syntax where stem:
precedes the REDCap variable name:
By default, vtree
will also extract the names of the choices and create variables with those names. (This can be disabled by specifying choicechecklist=FALSE
.)
An especially convenient way to display checkbox variables with vtree
is:
vtree
has three additional parameters to access GraphViz attributes: graphattr
(for graph attributes), nodeattr
(node attributes), and edgeatttr
(edge attributes). A full list of Graphviz attributes is available.
For example, two edge attributes are arrowhead
, which specifies the type of arrow, and penwidth
, which specifies the thickness of the edge (in points). (Note that penwidth
is also a node attribute.) To draw a variable tree without any arrows on the edges and with thick edges, use:
The minimum width of nodes (in inches) can be specified using the node attribute width
:
vtree
Specifying getscript=TRUE
lets you capture the DOT script representing a flowchart. Here is an example:
digraph vtree {
graph [layout = dot, compound=true, nodesep=0.1, ranksep=0.5, fontsize=12]
node [fontname = Helvetica, fontcolor = black,shape = rectangle, color = black,margin=0.1]
rankdir=LR;
Node_L0[style=invisible]
Node_L1[label=<<FONT POINT-SIZE="18"><FONT COLOR="#DE2D26"><B>Severity </B></FONT></FONT><BR/>> shape=none margin=0]
edge[style=invis];
Node_L0->Node_L1
edge[style=solid]
Node_1->Node_2 Node_1->Node_3 Node_1->Node_4 Node_1->Node_5
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_2[label=<Mild<BR/>19 (48%)> color=black style="rounded,filled" fillcolor=<#FEE0D2> ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_3[label=<Moderate<BR/>16 (40%)> color=black style="rounded,filled" fillcolor=<#FC9272> ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_4[label=<Severe<BR/>5 (12%)> color=black style="rounded,filled" fillcolor=<#DE2D26> ]
Node_1[label=<46> color=black style="rounded,filled" fillcolor=<#EFF3FF>]
Node_5[label=<NA<BR/>6> color=black style="rounded,filled" fillcolor=<white> ]
}
If you wish to directly edit this code, it can can be pasted into one of these online Graphviz editors:
The R datasets package is loaded in R by default. The following datasets can be used to illustrate vtree
. Note that the variable trees generated by the commands below are not shown. The reader can try these commands to see what the variable trees look like, and experiment with many other possibilities.
The esoph
data set (data from a case-control study of esophageal cancer in Ille-et-Vilaine, France), has 88 different combinations of age group, alcohol consumption, and tobacco consumption. Let’s consider examine the mean number of cases among patients aged 75 and older compared to the rest of the patients:
The HairEyeColor
data set is an array representing a contingency table (also called a crosstab or crosstabulation). Before vtree
can be applied to this data set, it is necessary to convert the table of crosstabulated frequencies to a data frame of cases. For convenience, the vtree
package includes a helper function to do this, called crosstabToCases
. It is adapted from a function listed on the Cookbook for R website
There are a lot of combinations but let’s say we are especially interested in green eyes (versus non-green eyes). We can use the variable specification Eye=Green
to do this:
The Titanic
dataset is a 4-dimensional array of counts. First, let’s convert it to a dataframe of individuals:
We’ll specify sameline=TRUE
so that it fits a bit better:
The mtcars
data set was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
The rownames of the data set contain the names of the cars. Let’s move that information into a column. To do that, we’ll make a slightly altered version of the data frame which we’ll call mt
:
Now let’s look at the mean and standard deviation of horsepower (HP) by number of carburetors, nested within number of gears, and in turn nested within number of cylinders.
We might also like to list the names of cars by number of carburetors nested within number of gears:
The UCBAdmissions
data is consists of aggregate data on applicants to graduate school at Berkeley for the six largest departments in 1973 classified by admission and sex. According to the data set Details, “This data set is frequently used for illustrating Simpson’s paradox, see Bickel et al. (1975). At issue is whether the data show evidence of sex bias in admission practices. There were 2691 male applicants, of whom 1198 (44.5%) were admitted, compared with 1835 female applicants of whom 557 (30.4%) were admitted.” Furthermore, “the apparent association between admission and sex stems from differences in the tendency of males and females to apply to the individual departments (females used to apply more to departments with higher rejection rates).”
First, we’ll convert the crosstab data to a data frame of cases, ucb
:
Next, let’s look at admission rates by Gender, nested within department:
The ChickWeight
data set is from an experiment on the effect of diet on early growth of chicks. Let’s look at the mean weight of chicks at birth (0 days of age) and 4 days of age, nested within type of diet.
The InsectSprays
data set contains counts of insects in agricultural experimental units treated with different insecticides. Let’s look at the mean and standard deviation of those counts by insecticide.
The ToothGrowth
data set contains the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Let’s examine the percentage with length > 20 by dose nested within delivery method:
vtree
uses the DiagrammeR
package (which in turn is built on the open source graph visualization software, Graphviz
).
DiagrammeR (and hence vtree
) automatically renders to HTML using the htmlwidgets framework (for example, in the RStudio Viewer pane, or from R in a browser window). However it is sometimes useful to generate a graphics file. For example, to include a variable tree in a Microsoft Word document, you need to create a PNG file. Another reason to generate a PNG file is that HTML files that use htmlwidgets
can be large, and if they contain several widgets they can be slow to load. The function grVizToPNG
solves this problem by converting a variable tree into a PNG file.
grVizToPNG
functionSuppose you saved the output of a call to vtree
to an object called example1
:
You can use grVizToPNG
to create a PNG file called example1.png
like this:
Notes:
The name of the graphics file (example1.png
) is automatically derived from the name of the object (example1
).
The width
or height
arguments can be used to override the default resolution. For example, specifying width=3000
results in a fairly high-resolution image.
Before creating the PNG file, grVizToPNG
first creates an SVG file. But Microsoft Word cannot handle SVG files, which is why a PNG file must be created.
To keep things tidy, you can also specify a folder (say a subfolder of the working directory) where the PNG and SVG files will be stored. To do this, specify this argument: folder="MyFolder"
.
Suppose you are using R Markdown, and wish to embed the PNG image generated by calling grVizToPNG
into your output (for example a Word document). If you want the image scaled to, say, 3 inches tall, add this code inline (i.e. not in a code chunk):
{ height=3in }
If, in your call to grVizToPNG
, you specified that graphics files should be stored in a subfolder called MyFolder
, use the following code:
{ height=3in }