Memory usage

One of the reasons that dplyr is fast is that it is very careful about when it makes copies of columns. This vignette describes how this works, and gives you some useful tools for understanding the memory usage of data frames in R.

The first tool we'll use is dplyr::location(). It tells us three things about a data frame:

location(iris)
## <0x7fae6dc445b0>
## Variables:
##  * Sepal.Length: <0x7fae6724be00>
##  * Sepal.Width:  <0x7fae6ae64600>
##  * Petal.Length: <0x7fae6ae7c200>
##  * Petal.Width:  <0x7fae6e22aa00>
##  * Species:      <0x7fae62f759e0>
## Attributes:
##  * names:        <0x7fae6dc44618>
##  * row.names:    <0x7fae62f75c60>
##  * class:        <0x7fae6de53fc8>

It's useful to know the memory address, because if the address changes, then you know R has made a copy. Copies are bad because it takes time to copy a vector. This isn't usually a bottleneck if you have a few thousand values, but if you have millions or tens of millions it starts to take up a significant amount of time. Unnecessary copies are also bad because they take up memory.

R tries to avoid making copies where possible. For example, if you just assign iris to another variable, it continues to the point same location:

iris2 <- iris
location(iris2)
## <0x7fae6dc445b0>
## Variables:
##  * Sepal.Length: <0x7fae6724be00>
##  * Sepal.Width:  <0x7fae6ae64600>
##  * Petal.Length: <0x7fae6ae7c200>
##  * Petal.Width:  <0x7fae6e22aa00>
##  * Species:      <0x7fae62f759e0>
## Attributes:
##  * names:        <0x7fae6dc44618>
##  * row.names:    <0x7fae62f9a340>
##  * class:        <0x7fae6de53fc8>

Rather than carefully comparing long memory locations, we can instead use the dplyr::changes() function to highlights changes between two versions of a data frame. This shows us that iris and iris2 are identical: both names point to the same location in memory.

changes(iris2, iris)
## <identical>

What do you think happens if you modify a single column of iris2? In R 3.1.0 and above, R knows enough to only modify one column and leave the others pointing to the existing location:

iris2$Sepal.Length <- iris2$Sepal.Length * 2
changes(iris, iris2)
## Changed variables:
##              old            new           
## Sepal.Length 0x7fae6724be00 0x7fae6d750c00
## 
## Changed attributes:
##              old            new           
## row.names    0x7fae62fa1590 0x7fae62fa1810

(This was not the case prior to R 3.1.0: R created a deep copy of the entire data frame.)

dplyr is similarly smart

iris3 <- mutate(iris, Sepal.Length = Sepal.Length * 2)
changes(iris3, iris)
## Changed variables:
##              old            new           
## Sepal.Length 0x7fae6e015800 0x7fae6724be00
## 
## Changed attributes:
##              old            new           
## class        0x7fae63fdf128 0x7fae6de53fc8
## names        0x7fae6ad30140 0x7fae6dc44618
## row.names    0x7fae62fc70e0 0x7fae62fc7360

It's smart enough to create only one new column: all the other columns continue to point at their old locations. You might notice that the attributes have still been copied. This has little impact on performance because the attributes are usually short vectors and copying makes the internal dplyr code considerably simpler.

dplyr never makes copies unless it has to:

This means that dplyr lets you work with data frames with very little memory overhead.

data.table takes this idea one step further than dplyr, and provides functions that modify a data table in place. This avoids the need to copy the pointers to existing columns and attributes, and provides speed up when you have many columns. dplyr doesn't do this with data frames (although it could) because I think it's safer to keep data immutable: all dplyr data frame methods return a new data frame, even while they share as much data as possible.