tabplot
We test the speed of tabplot
package with datasets over 1,00,000,000 records.
For this purpose we multiply the diamonds dataset from the ggplot2
package 2,000 times.
This dataset contains 53940 records and 10 variables.
require(ggplot2)
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
data(diamonds)
## add some NA's
is.na(diamonds$price) <- diamonds$cut == "Ideal"
is.na(diamonds$cut) <- (runif(nrow(diamonds)) > 0.8)
n <- nrow(diamonds)
N <- 200L * n
## convert to ff format (not enough memory otherwise)
require(ffbase)
diamondsff <- as.ffdf(diamonds)
nrow(diamondsff) <- N
# fill with identical data
for (i in chunk(diamondsff, by=n)){
diamondsff[i,] <- as.data.frame(diamonds)
}
The preparation step is the most time consuming. Per column, the rank order is determined.
system.time(
p <- tablePrepare(diamondsff)
)
## user system elapsed
## 16.95 3.03 20.31
To focus on the processing time of the tableplot function, the plot
argument is set to FALSE
.
system.time(
tab <- tableplot(p, plot=FALSE)
)
## user system elapsed
## 2.658 0.403 3.072
The following tableplots are samples with respectively 100, 1,000 and 10,000 objects per bin.
system.time(
tab <- tableplot(p, sample=TRUE, sampleBinSize=1e2, plot=FALSE)
)
## user system elapsed
## 0.030 0.017 0.047
system.time(
tab <- tableplot(p, sample=TRUE, sampleBinSize=1e3, plot=FALSE)
)
## user system elapsed
## 0.159 0.051 0.211
system.time(
tab <- tableplot(p, sample=TRUE, sampleBinSize=1e4, plot=FALSE)
)
## user system elapsed
## 1.274 0.043 1.317