Exercise 1

In this example, we are going to investigate loading a large data frame. First, we'll generate a large matrix of random numbers and save it as a csv file:

N = 1e5
N = 1e4
m = as.data.frame(matrix(runif(N), ncol = 1000))
write.csv(m, file = "example.csv", row.names = FALSE)

We can read the file the back in again using read.csv()

dd = read.csv("example.csv")

To get a baseline result, time the read.csv() function call above, e.g.

system.time(read.csv("example.csv"))

We will now look ways of speeding up this step.

  1. Explicitly define the classes of each column using colClasses in read.csv(), for example, if we have 1000 columns that all have data type numeric, then: r read.csv(file = "example.csv", colClasses = rep("numeric", 1000))
  2. Use the saveRDS() and readRDS() functions: r saveRDS(m, file = "example.RData") readRDS(file = "example.RData")
  3. Compare the speed of read_csv() from the readr package to read.csv()

  4. How does fread() from the data.table package compare to the other solutions?

Which of the above give the biggest speed-ups? Are there any downsides to using these techniques? Do your results depend on the number of columns or the number of rows?

Exercise 2

  1. In this question, we'll compare matrices and data frames. Suppose we have a matrix, d_m
##For fast computers
#d_m = matrix(1:1000000, ncol=1000)
##Slower computers
d_m = matrix(1:10000, ncol = 100)
dim(d_m)

and a data frame d_df:

d_df = as.data.frame(d_m)
colnames(d_df) = paste0("c", seq_along(d_df))


jr-packages/efficientTutorial documentation built on Feb. 16, 2020, 7:05 p.m.