Data often is not perfect. A traditional analyst workflow sees one combing through dataset, column subset by column subset, and investigating flaws and inacurracies such as outliers and missing values. datareportR attempts to simplify this manual process by providing the analyst information about all the columns in a dataset a single glance.
datareportR consists of a small but powerful
function, render_data_report()
, which takes input data and
outputs a “data report” as a pdf or HTML document. Under the hood, it
uses the powerful skimr::skim()
and
diffdf::diffdf()
functions to create the report. See below
for an example on the iris
dataset.
See the below screenshots for a the data report generated on the iris dataset, comparing it to a permuted version. The function call was as follows:
render_data_report(
df_input = iris,
df_input_old = iris_permuted,
save_rmd_dir = getwd(),
save_report_dir = getwd(),
include_skim = TRUE,
include_diffdf = TRUE,
output_format = "html"
)
datareportR can be installed from my r-universe:
install.packages("datareportR", repos = "https://bryantco.r-universe.dev")
datareportR is intended for use on medium-sized data (both in terms of rows and columns). For a graph that compares the time to render the report, see below. The x-axis is number of rows (1,000, 10,000, and 100,000), and the colors represent the numer of features (columns). On average, across a small set of 10 runs, it took a maximum overall of nearly 2 minutes to render the report for a dataset with 100,000 rows and 300 features.
In my personal use, I have used the package, with no issues, to create a data report for a dataset topping out at 85 million rows.
Benchmarks were performed on my personal machine, which has 32 GB of
RAM. These benchmarks are suggestive, but the time to run
datareportR
on your own computer might be different.