In this example, we are going to investigate loading a large data frame. First, we'll generate a large matrix of random numbers and save it as a csv file:
N = 1e5 N = 1e4 m = as.data.frame(matrix(runif(N), ncol = 1000)) write.csv(m, file = "example.csv", row.names = FALSE)
We can read the file the back in again using read.csv()
dd = read.csv("example.csv")
To get a baseline result, time the read.csv() function call above, e.g.
system.time(read.csv("example.csv"))
We will now look ways of speeding up this step.
colClasses in
read.csv(), for example, if we have 1000 columns that all have data
type numeric, then:
r
read.csv(file = "example.csv", colClasses = rep("numeric", 1000))saveRDS() and readRDS() functions:
r
saveRDS(m, file = "example.RData")
readRDS(file = "example.RData")Compare the speed of read_csv() from the readr package to read.csv()
How does fread() from the data.table package compare to the other solutions?
Which of the above give the biggest speed-ups? Are there any downsides to using these techniques? Do your results depend on the number of columns or the number of rows?
d_m##For fast computers #d_m = matrix(1:1000000, ncol=1000) ##Slower computers d_m = matrix(1:10000, ncol = 100) dim(d_m)
and a data frame d_df:
d_df = as.data.frame(d_m) colnames(d_df) = paste0("c", seq_along(d_df))
Using the following code, calculate the relative differences between selecting the first column/row of a data frame and matrix.
r
microbenchmark::microbenchmark(times = 1000, unit = "ms", # milliseconds
d_m[1, ], d_df[1, ], d_m[, 1], d_df[, 1])
Can you explain the result? Try varying the number of replications.
When selecting columns in a data frame, there are a few different methods. For example,
r
d_df$c10
d_df[, 10]
d_df[, "c10"]
d_df[, colnames(d_df) == "c10"]
Compare these four methods.
object.size() function compare the object size of a matrix and data data frame.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.