In jr-packages/efficientTutorial: Efficient R Programming

Exercise 1

In this example, we are going to investigate loading a large data frame. First, we'll generate a large matrix of random numbers and save it as a csv file:

N = 1e5
N = 1e4
m = as.data.frame(matrix(runif(N), ncol = 1000))
write.csv(m, file = "example.csv", row.names = FALSE)

We can read the file the back in again using read.csv()

dd = read.csv("example.csv")

To get a baseline result, time the read.csv() function call above, e.g.

system.time(read.csv("example.csv"))

We will now look ways of speeding up this step.

Explicitly define the classes of each column using colClasses in read.csv(), for example, if we have 1000 columns that all have data type numeric, then: r read.csv(file = "example.csv", colClasses = rep("numeric", 1000))
Use the saveRDS() and readRDS() functions: r saveRDS(m, file = "example.RData") readRDS(file = "example.RData")
Compare the speed of read_csv() from the readr package to read.csv()
How does fread() from the data.table package compare to the other solutions?

Which of the above give the biggest speed-ups? Are there any downsides to using these techniques? Do your results depend on the number of columns or the number of rows?

Exercise 2

In this question, we'll compare matrices and data frames. Suppose we have a matrix, d_m

##For fast computers
#d_m = matrix(1:1000000, ncol=1000)
##Slower computers
d_m = matrix(1:10000, ncol = 100)
dim(d_m)

and a data frame d_df:

d_df = as.data.frame(d_m)
colnames(d_df) = paste0("c", seq_along(d_df))

Using the following code, calculate the relative differences between selecting the first column/row of a data frame and matrix. r microbenchmark::microbenchmark(times = 1000, unit = "ms", # milliseconds d_m[1, ], d_df[1, ], d_m[, 1], d_df[, 1]) Can you explain the result? Try varying the number of replications.
When selecting columns in a data frame, there are a few different methods. For example, r d_df$c10 d_df[, 10] d_df[, "c10"] d_df[, colnames(d_df) == "c10"] Compare these four methods.
Using the object.size() function compare the object size of a matrix and data data frame.