Reference classes in R are very useful for some situations, but using them has a cost. In this document, I’ll explore the costs in memory and speed of standard R reference classes vs. other reference objects which are created in different ways.
library(microbenchmark)
options(microbenchmark.unit = "us")
library(pryr) # For object_size function
library(R6)
Here are a number of ways of creating reference objects in R, starting with the most complicated (standard R reference class) and ending with the simplest (an environment created by a closure).
A_rc <- setRefClass("A_rc",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <<- x,
getx = function() x,
inc = function(n = 1) x <<- x + n
)
)
R6 classes are similar to R’s standard reference objects, but they are simpler.
B_r6 <- R6Class("B_r6",
public = list(
x = NULL,
initialize = function(x = 1) self$x <<- x,
getx = function() x,
inc = function(n = 1) x <<- x + n
)
)
Objects of this type also have an automatically-created self
member:
print(B_r6$new())
#> <B_r6>
#> Public:
#> getx: function
#> inc: function
#> initialize: function
#> self: environment
#> x: 1
By default, a class attribute is added to the objects generated by the simple reference classes. This attribute adds a slight performance penalty because R will use S3 dispatch when using $
on the object.
It’s possible generate objects without the class attribute, by using class=FALSE
:
C_r6_noclass <- R6Class("C_r6_noclass",
public = list(
x = NULL,
initialize = function(x = 1) self$x <<- x,
getx = function() x,
inc = function(n = 1) x <<- x + n
),
class = FALSE
)
This is a variant of the previous type of reference class, but this version has public and private members.
D_r6_priv <- R6Class("D_r6_priv",
private = list(x = NULL),
public = list(
initialize = function(x = 1) private$x <<- x,
getx = function() x,
inc = function(n = 1) x <<- x + n
)
)
Instead of a single self
object which refers to all items in an object, these objects have self
(which refers to the public items) and private
.
print(D_r6_priv$new())
#> <D_r6_priv>
#> Public:
#> getx: function
#> inc: function
#> initialize: function
#> private: environment
#> self: environment
#> Private:
#> x: 1
This is simply an environment with a class attached to it.
E_closure_class <- function(x = 1) {
inc <- function(n = 1) x <<- x + n
getx <- function() x
self <- environment()
class(self) <- "D_closure"
self
}
Even though x
isn’t declared in the function body, it gets captured because it’s an argument to the function.
# Roundabout way to print the contents of a E object
str(as.list.environment(E_closure_class()))
#> List of 4
#> $ self:Class 'D_closure' <environment: 0x7fdf1f10cc68>
#> $ getx:function ()
#> ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 3 11 3 22 11 22 3 3
#> .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fdf1f03f3f0>
#> $ inc :function (n = 1)
#> ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 2 10 2 36 10 36 2 2
#> .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fdf1f03f3f0>
#> $ x : num 1
Objects created this way are very similar to those created by B_r6
. The main difference is that those created by B_r6
contain an initialize
function:
str(as.list.environment(B_r6$new()))
#> List of 5
#> $ self :Classes 'B_r6', 'R6' <environment: 0x7fdf1e602380>
#> $ inc :function (n = 1)
#> ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 6 11 6 37 11 37 6 6
#> .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fdf1f05e740>
#> $ getx :function ()
#> ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 5 12 5 23 12 23 5 5
#> .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fdf1f05e740>
#> $ initialize:function (x = 1)
#> ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 4 18 4 45 18 45 4 4
#> .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fdf1f05e740>
#> $ x : num 1
This is the simplest type of reference object:
F_closure_noclass <- function(x = 1) {
inc <- function(n = 1) x <<- x + n
getx <- function() x
environment()
}
There are two differences between E and F: objects of type F don’t have a class attribute, and they don’t have a self
object.
For all the timings using microbenchmark()
, the results are reported in microseconds, and the most useful value is probably the median column.
How much memory does a single instance of each object take, and how much memory does each additional object take?
# Utility functions for calculating sizes
obj_size <- function(expr, .env = parent.frame()) {
size_n <- function(n = 1) {
objs <- lapply(1:n, function(x) eval(expr, .env))
as.numeric(do.call(object_size, objs))
}
data.frame(one = size_n(1), incremental = size_n(2) - size_n(1))
}
obj_sizes <- function(..., .env = parent.frame()) {
exprs <- as.list(match.call(expand.dots = FALSE)$...)
names(exprs) <- lapply(1:length(exprs),
FUN = function(n) {
name <- names(exprs)[n]
if (is.null(name) || name == "") paste(deparse(exprs[[n]]), collapse = " ")
else name
})
sizes <- mapply(obj_size, exprs, MoreArgs = list(.env = .env), SIMPLIFY = FALSE)
do.call(rbind, sizes)
}
Sizes of each type of object, in bytes:
obj_sizes(
A_rc$new(),
B_r6$new(),
C_r6_noclass$new(),
D_r6_priv$new(),
E_closure_class(),
F_closure_noclass()
)
#> one incremental
#> A_rc$new() 472072 1368
#> B_r6$new() 12040 728
#> C_r6_noclass$new() 12368 672
#> D_r6_priv$new() 12608 840
#> E_closure_class() 10720 624
#> F_closure_noclass() 9288 512
It looks like using a reference class takes up a huge amount of memory, but much of that is shared between reference classes. Adding another object from a different reference class doesn’t require much more memory – around 38KB:
A_rc2 <- setRefClass("A_rc2",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 2) .self$x <<- x,
inc = function(n = 2) x <<- x * n
)
)
# Size of a new A_rc2 object, over and above an A_rc object
as.numeric(object_size(A_rc$new(), A_rc2$new()) - object_size(A_rc$new()))
#> [1] 37688
How much time does it take to create one of these objects? (The median
time is probably the most informative.)
speed <- microbenchmark(
A_rc$new(),
B_r6$new(),
C_r6_noclass$new(),
D_r6_priv$new(),
E_closure_class(),
F_closure_noclass()
)
speed
#> Unit: microseconds
#> expr min lq median uq max neval
#> A_rc$new() 296.00 310.00 321.00 331.00 1,200.00 100
#> B_r6$new() 29.60 32.30 34.80 41.00 56.00 100
#> C_r6_noclass$new() 21.30 23.80 26.10 31.20 62.50 100
#> D_r6_priv$new() 35.20 38.30 41.90 50.60 877.00 100
#> E_closure_class() 1.86 2.94 3.40 3.73 62.90 100
#> F_closure_noclass() 0.79 1.34 1.55 1.85 4.55 100
R reference classes are much slower to instantiate than the other types of classes, with a median of 0.3207 milliseconds per instantiation.
How much time does it take to access a field in an object? First we’ll make some objects:
A <- A_rc$new()
B <- B_r6$new()
C <- C_r6_noclass$new()
D <- D_r6_priv$new()
E <- E_closure_class()
F <- F_closure_noclass()
Getting a value:
microbenchmark(
A_rc = A$x,
B_r6 = B$x,
C_r6_noclass = C$x,
D_r6_priv = D$private$x,
E_closure_class = E$x,
F_closure_noclass = F$x
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> A_rc 9.260 9.890 10.200 10.500 48.60 100
#> B_r6 1.900 2.140 2.370 2.560 35.80 100
#> C_r6_noclass 0.172 0.281 0.339 0.411 9.24 100
#> D_r6_priv 2.040 2.370 2.610 2.860 24.90 100
#> E_closure_class 1.500 1.690 1.960 2.220 7.69 100
#> F_closure_noclass 0.188 0.272 0.339 0.394 1.30 100
Setting a value:
microbenchmark(
A_rc = A$x <- 4,
B_r6 = B$x <- 4,
C_r6_noclass = C$x <- 4,
D_r6_priv = D$private$x <- 4,
E_closure_class = E$x <- 4,
F_closure_noclass = F$x <- 4
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> A_rc 50.600 58.700 60.70 67.30 215.00 100
#> B_r6 2.810 3.510 3.90 4.45 7.66 100
#> C_r6_noclass 0.762 1.060 1.22 1.40 15.30 100
#> D_r6_priv 5.470 6.780 7.44 8.52 41.80 100
#> E_closure_class 2.580 3.000 3.36 3.78 9.70 100
#> F_closure_noclass 0.715 0.933 1.05 1.28 7.08 100
The differences between the pairs C, D, and E, F are due to overhead from the class attribute. Because C and E have a class attribute, R will check whether there is a $
method for its class. All of the objects A, B, D, and E have a class, while C and F do not.
The standard reference class is slowest.
How much overhead is there when calling a method from one of these objects?
microbenchmark(
A_rc = A$getx(),
B_r6 = B$getx(),
C_r6_noclass = C$getx(),
D_r6_priv = D$getx(),
E_closure_class = E$getx(),
F_closure_noclass = F$getx()
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> A_rc 9.950 10.700 11.000 11.400 184.00 100
#> B_r6 2.210 2.500 2.680 3.000 36.00 100
#> C_r6_noclass 0.341 0.480 0.553 0.640 6.16 100
#> D_r6_priv 2.350 2.640 2.880 3.120 8.76 100
#> E_closure_class 1.790 2.200 2.430 2.600 8.43 100
#> F_closure_noclass 0.356 0.484 0.554 0.608 1.64 100
As expected, method call speed is very close to the field access speed – in this case there’s just the small additional overhead of calling a function.
Standard reference classes are the slowest by a large margin.
The difference between the pairs B, C, and E, F is probably due to S3 method lookup for the $
function – there could be a $.myclass
method which would be called for myclass
objects.
self
With standard reference class objects, you can modify fields using the <<-
operator, or by using the self
object. For example, compare the inc()
methods of these two classes:
rc_self <- setRefClass("rc_self",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <- x,
inc = function(n = 1) .self$x <- x + n
)
)
rc_no_self <- setRefClass("rc_no_self",
fields = list(x = "numeric"),
methods = list(
initialize = function(x = 1) .self$x <- x,
inc = function(n = 1) x <<- x + n
)
)
R6 classes are similar, except they use self
instead of .self
:
r6_self <- R6Class("r6_self",
public = list(
x = 1,
inc = function(n = 1) self$x <- x + n
)
)
r6_no_self <- R6Class("r6_no_self",
public = list(
x = 1,
inc = function(n = 1) x <<- x + n
)
)
rc_self_obj <- rc_self$new()
rc_no_self_obj <- rc_no_self$new()
r6_self_obj <- r6_self$new()
r6_no_self_obj <- r6_no_self$new()
microbenchmark(
rc_self = rc_self_obj$inc(),
rc_no_self = rc_no_self_obj$inc(),
r6_self = r6_self_obj$inc(),
r6_no_self = r6_no_self_obj$inc()
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> rc_self 57.50 60.10 61.70 67.20 148.0 100
#> rc_no_self 36.00 37.80 39.00 41.20 245.0 100
#> r6_self 5.42 6.73 7.69 8.28 18.1 100
#> r6_no_self 2.79 3.26 3.76 4.19 11.1 100
Using .self
or self
adds some overhead, which makes sense when you consider how R searches for objects.
When the method accesses x
without using .self
, R first searches in the execution environment but doesn’t find x
there, so it then searches in the parent environment, finds x
there, and assigns the value.
When using .self
, R searches for .self
in the function’s execution environment but doesn’t find it there, so it looks in the parent environment (which also happens to be the object environment, as well as the environment that .self
points to) and does find it there. Then it looks in the .self
environment for x
, and assigns the value.
Additionally, there is some overhead when the environment has a class attribute.
r6_self_obj <- r6_self$new()
r6_no_self_obj <- r6_no_self$new()
r6_self_noclass_obj <- r6_self$new()
class(r6_self_noclass_obj) <- NULL
r6_no_self_noclass_obj <- r6_no_self$new()
class(r6_no_self_noclass_obj) <- NULL
microbenchmark(
r6_self = r6_self_obj$inc(),
r6_no_self = r6_no_self_obj$inc(),
r6_self_noclass = r6_self_noclass_obj$inc(),
r6_no_self_noclass = r6_no_self_noclass_obj$inc()
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> r6_self 5.340 5.620 5.79 6.04 41.00 100
#> r6_no_self 2.560 2.740 2.82 2.96 37.80 100
#> r6_self_noclass 1.070 1.250 1.38 1.51 6.04 100
#> r6_no_self_noclass 0.562 0.682 0.78 0.87 2.98 100
This compares member access time with lists vs. environments, and when the list/environment has a class attribute vs. not having a class. If the class has a class attribute, R will use method lookup for $
, which adds overhead.
list_noclass <- list(x = 10)
list_class <- structure(list(x = 10), class = "foo")
env_noclass <- new.env()
env_noclass$x <- 10
env_class <- structure(new.env(), class = "foo")
env_class$x <- 10
microbenchmark(
list_noclass = list_noclass$x,
list_class = list_class$x,
env_noclass = env_noclass$x,
env_class = env_class$x
)
#> Unit: microseconds
#> expr min lq median uq max neval
#> list_noclass 0.177 0.208 0.245 0.284 33.60 100
#> list_class 1.390 1.460 1.520 1.640 22.80 100
#> env_noclass 0.176 0.226 0.286 0.308 2.67 100
#> env_class 1.450 1.530 1.610 1.740 4.40 100
R6 class objects take less memory and are faster than standard reference class objects. Reference classes do provide additional features, such as type checking of fields, but these aren’t, in my opinion, enough to offset the performance penalty and especially the issues with S4 (which reference classes are built on). Another advantage to R6 objects is that they are simpler and easier to understand than R’s reference class objects.
This document was generated with:
sessionInfo()
#> R version 3.1.1 (2014-07-10)
#> Platform: x86_64-apple-darwin13.1.0 (64-bit)
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] R6_1.0 pryr_0.1 microbenchmark_1.3-0
#>
#> loaded via a namespace (and not attached):
#> [1] codetools_0.2-8 digest_0.6.4 evaluate_0.5.5 formatR_0.10
#> [5] htmltools_0.2.4 knitr_1.6 Rcpp_0.11.2 rmarkdown_0.2.49
#> [9] stringr_0.6.2 tools_3.1.1 yaml_2.1.13