In this example, we are going to investigate loading a large data frame. First, we'll generate a large matrix of random numbers and save it as a csv file:
N = 1e5 N = 1e4 m = as.data.frame(matrix(runif(N), ncol = 1000)) write.csv(m, file = "example.csv", row.names = FALSE)
We can read the file the back in again using read.csv()
dd = read.csv("example.csv")
To get a baseline result, time the read.csv()
function call above, e.g.
system.time(read.csv("example.csv"))
We will now look ways of speeding up this step.
colClasses
in
read.csv()
, for example, if we have 1000 columns that all have data
type numeric, then:
r
read.csv(file = "example.csv", colClasses = rep("numeric", 1000))
saveRDS()
and readRDS()
functions:
r
saveRDS(m, file = "example.RData")
readRDS(file = "example.RData")
Compare the speed of read_csv()
from the readr package to read.csv()
How does fread()
from the data.table package compare to the other solutions?
Which of the above give the biggest speed-ups? Are there any downsides to using these techniques? Do your results depend on the number of columns or the number of rows?
d_m
##For fast computers #d_m = matrix(1:1000000, ncol=1000) ##Slower computers d_m = matrix(1:10000, ncol = 100) dim(d_m)
and a data frame d_df
:
d_df = as.data.frame(d_m) colnames(d_df) = paste0("c", seq_along(d_df))
Using the following code, calculate the relative differences between selecting the first column/row of a data frame and matrix.
r
microbenchmark::microbenchmark(times = 1000, unit = "ms", # milliseconds
d_m[1, ], d_df[1, ], d_m[, 1], d_df[, 1])
Can you explain the result? Try varying the number of replications.
When selecting columns in a data frame, there are a few different methods. For example,
r
d_df$c10
d_df[, 10]
d_df[, "c10"]
d_df[, colnames(d_df) == "c10"]
Compare these four methods.
object.size()
function compare the object size of a matrix and data data frame.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.