Vroom Benchmarks

vroom is a new approach to reading delimited and fixed width data into R.

It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

Therefore you can obtain very rapid input by first performing a fast indexing step and then using the Altrep framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

How it works

The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.

This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded data.table::fread(). Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling head(), tail() x[sample(), ] etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.

Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at https://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The following operations are performed.

  • The data is read
  • print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
  • head()
  • tail()
  • Sampling 100 random rows
  • Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
  • Aggregation of mean fare amount per payment type.
reading package manipulating package altrep memory read print head tail sample filter aggregate total
vroom base TRUE 6.35GB 1.4s 155ms 2ms 1ms 1ms 13.8s 1m 27.8s 1m 43.1s
vroom dplyr TRUE 6.41GB 1.3s 83ms 2ms 1ms 13ms 14.3s 51.2s 1m 6.9s
read.delim base 6.2GB 1m 4.7s 5ms 1ms 1ms 1ms 307ms 732ms 1m 5.8s
readr dplyr 5.18GB 26s 96ms 3ms 1ms 16ms 195ms 394ms 26.7s
vroom dplyr FALSE 4.92GB 15.2s 93ms 2ms 1ms 14ms 829ms 1s 17.2s
data.table data.table 4.69GB 8.6s 24ms 1ms 1ms 1ms 219ms 1s 9.9s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.79GB 2m 1.3s 1.5s 1ms 1ms 2ms 4.7s 37ms 2m 7.5s
readr dplyr 2.82GB 13.2s 64ms 2ms 1ms 16ms 18ms 58ms 13.3s
vroom dplyr FALSE 2.75GB 1.3s 49ms 1ms 1ms 15ms 19ms 49ms 1.5s
vroom base FALSE 2.69GB 1.3s 50ms 1ms 1ms 3ms 6ms 57ms 1.4s
vroom dplyr TRUE 3.29GB 563ms 65ms 2ms 1ms 15ms 36ms 448ms 1.1s
vroom base TRUE 3.28GB 557ms 57ms 1ms 1ms 3ms 29ms 466ms 1.1s
data.table data.table 2.72GB 269ms 14ms 1ms 1ms 4ms 6ms 26ms 316ms

Wide

code: bench/all_numeric-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 14.42GB 9m 54s 141ms 7ms 7ms 10ms 77ms 5ms 9m 54.3s
readr dplyr 5.46GB 57.9s 99ms 3ms 3ms 27ms 18ms 41ms 58s
vroom base FALSE 5.34GB 6.1s 66ms 3ms 3ms 6ms 6ms 7ms 6.2s
vroom dplyr FALSE 5.34GB 5.8s 64ms 3ms 3ms 107ms 15ms 40ms 6s
vroom dplyr TRUE 7.25GB 1.5s 83ms 4ms 4ms 23ms 21ms 99ms 1.7s
vroom base TRUE 7.25GB 1.5s 71ms 5ms 17ms 5ms 9ms 65ms 1.7s
data.table data.table 5.48GB 1.3s 128ms 1ms 1ms 4ms 4ms 4ms 1.4s

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.47GB 1m 49.6s 8ms 1ms 1ms 2ms 28ms 307ms 1m 50s
readr dplyr 4.35GB 1m 6.3s 95ms 2ms 1ms 17ms 21ms 228ms 1m 6.7s
vroom dplyr FALSE 4.3GB 51.8s 50ms 2ms 1ms 16ms 21ms 152ms 52s
data.table data.table 4.73GB 39s 15ms 1ms 1ms 4ms 18ms 150ms 39.2s
vroom base TRUE 3.22GB 566ms 47ms 2ms 1ms 3ms 421ms 3.7s 4.8s
vroom dplyr TRUE 3.21GB 599ms 56ms 1ms 1ms 15ms 399ms 1.5s 2.6s

Wide

code: bench/all_character-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 13.06GB 9m 19.1s 169ms 7ms 7ms 25ms 210ms 55ms 9m 19.6s
readr dplyr 12.21GB 5m 52.6s 207ms 4ms 3ms 29ms 36ms 54ms 5m 52.9s
vroom dplyr FALSE 12.14GB 4m 17.1s 66ms 3ms 3ms 28ms 36ms 38ms 4m 17.2s
data.table data.table 12.64GB 2m 54.3s 139ms 2ms 2ms 28ms 162ms 14ms 2m 54.7s
vroom base TRUE 6.57GB 1.4s 60ms 5ms 4ms 5ms 74ms 383ms 1.9s
vroom dplyr TRUE 6.57GB 1.4s 61ms 5ms 4ms 39ms 82ms 175ms 1.7s

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package manipulating package altrep memory read print head tail sample filter aggregate total
vroom base TRUE 88.3GB 19.8s 3s 1ms 1ms 1ms 6m 50.2s 35m 30.2s 42m 43.2s
vroom dplyr TRUE 88GB 35.3s 2.6s 1ms 1ms 13ms 6m 43s 15m 17.2s 22m 38.1s
readr dplyr 63.5GB 8m 33s 817ms 1ms 1ms 15ms 4.2s 13.4s 8m 51.5s
vroom dplyr FALSE 63.1GB 3m 51.5s 2.2s 2ms 1ms 14ms 11s 7.3s 4m 12s
data.table data.table 59.5GB 1m 41.3s 7ms 1ms 1ms 1ms 1.1s 4.6s 1m 46.9s

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://web.archive.org/web/20150908055439/https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 6.17GB 18m 0.6s 16ms 1ms 2ms 3ms 497ms 90ms 18m 1.2s
readr dplyr 6.19GB 29.4s 47ms 2ms 1ms 17ms 97ms 96ms 29.7s
vroom dplyr FALSE 5.96GB 15.1s 45ms 1ms 1ms 16ms 516ms 94ms 15.8s
vroom dplyr TRUE 4.62GB 4.1s 48ms 2ms 1ms 16ms 779ms 3.2s 8.2s
vroom base TRUE 4.65GB 166ms 56ms 1ms 1ms 7ms 776ms 4.6s 5.6s

Writing delimited files

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.

compression base data.table readr vroom
gzip 3m 18.1s 1m 6.8s 2m 1.1s 1m 12.4s
multithreaded_gzip 1m 39.7s 8.7s 54s 8s
zstandard 1m 37.9s NA 52.9s 12.5s
uncompressed 1m 36.1s 1.5s 51.5s 1.6s

Session and package information

package version date source
base 4.1.0 2021-05-18 local
data.table 1.14.0 2021-02-21 RSPM (R 4.1.0)
dplyr 1.0.6 2021-05-05 RSPM (R 4.1.0)
readr 1.4.0 2020-10-05 RSPM (R 4.1.0)
vroom 1.5.0 2021-05-28 local