summarytools
is an R package providing tools to neatly and quickly summarize data, with functions that most R programmers once wished were included in base R. It can also make R a little easier to use for newbies. With a few lines of simple code, you can get a good look at the data at hand.
An emphasis has been put on both what and how results are presented, so that the package can serve both as a data exploration and reporting tool than can be used either on its own for minimal reports, or integrated in a larger set of tools such as RStudio’s for rmarkdown and knitr.
The package is built around four main functions:
fivenum()
and other similar functionsAll summarytools
objects returned by the main functions can be:
Text-based output relies on the pander package, while html output relies on RStudio’s htmltools.
To show what default (console) outputs look like, we’ll first generate a frequency table for iris$Species
.
freq(iris$Species)
Frequencies
Species
Data frame: iris
Type: Factor (unordered)
Freq % Valid % Valid Cum. % Total % Total Cum.
---------------- ------ --------- -------------- --------- --------------
setosa 50 33.33 33.33 33.33 33.33
versicolor 50 33.33 66.67 33.33 66.67
virginica 50 33.33 100.00 33.33 100.00
<NA> 0 0.00 100.00
Total 150 100.00 100.00 100.00 100.00
To get familiar with the output styles, try different values for style=
and see how results look in the console.
When using style='rmarkdown'
with freq()
or descr()
, the generated outputs are ready for markdown rendering. With dfSummary()
, options for style
are “multiline” (default) and “grid”, and plain.ascii=FALSE
must be used to have proper line feeds in multiline cells.
Note: In an .Rmd document with knitr
, always set the chunk option results='asis'
:
```{r, results='asis'}
library(summarytools)
freq(tobacco$smoker, style='rmarkdown')
```
smoker
Data frame: tobacco
Type: Factor (unordered)
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
Yes | 298 | 29.80 | 29.80 | 29.80 | 29.80 |
No | 702 | 70.20 | 100.00 | 70.20 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
The descr()
function accepts both vectors and data frames, in which case it will show statistics for all numerical variables it contains. We’ll use one of the datasets included in the package.
data(exams)
descr(exams[ ,3:5], style='rmarkdown')
Data Frame: exams
N: 30
french | math | geography | |
---|---|---|---|
Mean | 73.94 | 73.54 | 70.04 |
Std.Dev | 10.79 | 9.19 | 10.65 |
Min | 44.80 | 55.60 | 47.20 |
Median | 73.60 | 73.75 | 68.50 |
Max | 94.70 | 93.20 | 96.30 |
MAD | 7.56 | 9.93 | 12.31 |
IQR | 8.50 | 13.35 | 11.90 |
CV | 6.85 | 8.00 | 6.58 |
Skewness | 0.03 | 0.12 | 0.10 |
SE.Skewness | 0.43 | 0.44 | 0.43 |
Kurtosis | 0.45 | -0.58 | -0.03 |
N.Valid | 29.00 | 28.00 | 29.00 |
Pct.Valid | 96.67 | 93.33 | 96.67 |
To rather see variables in rows and stats in columns, use transpose=TRUE
:
descr(exams, style = 'rmarkdown', transpose = TRUE)
Let a few examples speak for themselves. First, a bare-bones cross-tabulation.
with(tobacco, ctable(smoker, diseased, prop = 'n', totals = FALSE))
Cross-Tabulation
smoker * diseased
Data Frame: tobacco
-------- ---------- ----- -----
diseased Yes No
smoker
Yes 125 173
No 99 603
-------- ---------- ----- -----
Then show proportions, by row.
with(tobacco, ctable(smoker, diseased, prop = 'r'))
Cross-Tabulation / Row proportions
smoker * diseased
Data Frame: tobacco
-------- ---------- ------------- ------------- ---------------
diseased Yes No Total
smoker
Yes 125 (41.9%) 173 (58.1%) 298 (100.0%)
No 99 (14.1%) 603 (85.9%) 702 (100.0%)
Total 224 (22.4%) 776 (77.6%) 1000 (100.0%)
-------- ---------- ------------- ------------- ---------------
The type of table generated by ctable() is unfortunately not (yet) supported by rmarkdown
. But we can turn to the render
method to circumvent this:
crosstable <- with(tobacco, ctable(smoker, diseased))
print(crosstable, method='render', footnote = NA)
diseased | |||
---|---|---|---|
smoker | Yes | No | Total |
Yes | 125 ( 41.9% ) | 173 ( 58.1% ) | 298 ( 100.0% ) |
No | 99 ( 14.1% ) | 603 ( 85.9% ) | 702 ( 100.0% ) |
Total | 224 ( 22.4% ) | 776 ( 77.6% ) | 1000 ( 100.0% ) |
This is the most elaborate function of the package. It incorporates elements of freq() and descr(), but goes beyond with its graphs (not yet supported with rmarkdown
) and other attributes.
dfSummary(tobacco, style='grid', plain.ascii = FALSE, graph.col = FALSE)
tobacco
N: 1000
No | Variable | Stats / Values | Freqs (% of Valid) | Valid | Missing |
---|---|---|---|---|---|
1 |
gender [factor] |
|
489 (50.0%) |
978 (97.8%) |
22 (2.2%) |
2 |
age [numeric] |
mean (sd) : 49.6 (18.29) |
63 distinct val. |
975 (97.5%) |
25 (2.5%) |
3 |
age.gr [factor] |
|
258 (26.5%) |
975 (97.5%) |
25 (2.5%) |
4 |
BMI [numeric] |
mean (sd) : 25.73 (4.49) |
974 distinct val. |
974 (97.4%) |
26 (2.6%) |
5 |
smoker [factor] |
|
298 (29.8%) |
1000 (100%) |
0 (0%) |
6 |
cigs.per.day [numeric] |
mean (sd) : 6.78 (11.88) |
37 distinct val. |
965 (96.5%) |
35 (3.5%) |
7 |
diseased [factor] |
|
224 (22.4%) |
1000 (100%) |
0 (0%) |
8 |
disease [character] |
|
36 (16.2%) |
222 (22.2%) |
778 (77.8%) |
9 |
samp.wgts [numeric] |
mean (sd) : 1 (0.08) |
0.86!: 267 (26.7%) |
1000 (100%) |
0 (0%) |
For this one, we can use styles “multiline” (default) or “grid”. We must however specify plain.ascii=FALSE
when using markdown, otherwise the rendered results will be problematic.
Using the file=
parameter with the view()
or print()
functions, we can redirect output into text files. And setting append=TRUE
will append results to an existing text file:
my_summary <- dfSummary(tobacco)
print(my_summary, file = "tobacco.txt", style = "grid") # Creates tobacco.txt
my_stats <- descr(tobacco)
print(my_stats, file="tobacco.txt", append = TRUE) # Appends results to tobacco.txt
As you may have noticed, the style
argument was used when calling the print()
function. We could also have used it when calling the dfSummary()
and descr()
functions, in which case the style would have been written in the objects’ properties. Using this argument with print()
overrides the style that is stored in the object. It is one of several arguments that can be used that way. See the documentation for print()
to know all the details.
summarytools
uses Bootstrap’s stylesheets to generate standalone HTML documents that can be displayed in a Web Browser or in RStudio’s Viewer using the generic print()
function:
print(dfSummary(tobacco), method = 'browser') # Displays results in default Web Browser
print(dfSummary(tobacco), method = 'viewer') # Displays results in RStudio's Viewer
view(dfSummary(tobacco)) # Same as line above -- view() is a wrapper function
Using file=
argument with an .html extension will simply generate an HTML document (without opening it).
print(dfSummary(tobacco), file = '~/Documents/tobacco_summary.html')
Here is a picture of the output:
As with simple text files, you can also append existing HTML reports with additionnal content.
Summarytools functions support the use of by()
, with()
, and lapply()
, at least when used in good measure.
Since objects generated by those native functions have their own class (they are special lists containing summarytools objects), they are not sent to the package’s generic print method automatically. In order to have the best results, the following method is recommended: First, store the object generated by one of the native functions. Then, use view()
either with method='pander'
to show results in console, or omitting the method
argument to see (HTML) results in the Viewer or Browser.
stats <- by(data = exams$geography, INDICES = exams$gender, FUN = descr, style = 'rmarkdown')
view(stats, method = 'pander')
geography
Data Frame: exams
Group: gender = Girl
N: 15
geography | |
---|---|
Mean | 67.27 |
Std.Dev | 8.26 |
Min | 50.40 |
Median | 67.30 |
Max | 78.90 |
MAD | 9.34 |
IQR | 10.20 |
CV | 8.14 |
Skewness | -0.34 |
SE.Skewness | 0.58 |
Kurtosis | -0.90 |
N.Valid | 15.00 |
Pct.Valid | 100.00 |
Group: gender = Boy
N: 15
geography | |
---|---|
Mean | 73.00 |
Std.Dev | 12.35 |
Min | 47.20 |
Median | 71.20 |
Max | 96.30 |
MAD | 11.34 |
IQR | 15.48 |
CV | 5.91 |
Skewness | -0.13 |
SE.Skewness | 0.60 |
Kurtosis | -0.48 |
N.Valid | 14.00 |
Pct.Valid | 93.33 |
There are many things you can do to build elaborate, fine-tuned reports. Let’s mention a few…
caption=
argumentCustom CSS can be added – you can specify custom classes for any table you generate For instance:
view(with(tobacco, ctable(gender, smoker)),
report.title = "Summary of the tobacco sample data frame",
html.table.class = "table table-bordered table-striped table-responsive",
footnote = "Extended use of Bootstrap classes")
what.is()
When developing, we often use a number functions to obtain an object’s properties. what.is()
proposes to lump together the results of such functions (class()
, typeof()
, attributes()
and others).
what.is(iris)
$properties
property value
1 class data.frame
2 typeof list
3 mode list
4 storage.mode list
5 dim 150 x 5
6 length 5
7 is.object TRUE
8 object.type S3
9 object.size 7088 Bytes
$attributes.lengths
names row.names class
5 150 1
$extensive.is
[1] "is.data.frame" "is.list" "is.object" "is.recursive"
[5] "is.unsorted"
rmarkdown
Check the project’s page for more examples; from there you can also submit feature requests or signal problems you might encounter.
To install the package in its development version, use
install.packages('devtools')
library(devtools)
install_github('dcomtois/summarytools', ref='dev-current')
The source of this document is an .Rmd file; knitr
’s chunk option results
has been set to 'asis'
, to make sure formatting is not coming from knitr
itself.