knitr::opts_chunk$set(
  comment = "#>",
  collapse = TRUE,
  cache = TRUE, 
  fig.align="center",
  fig.pos="t"
)

Exercises

Consider the following benchmark to evaluate different functions for calculating the cumulative sum of the whole numbers from 1 to 100:

x = 1:100 # initiate vector to cumulatively sum

# Method 1: with a for loop (10 lines)
cs_for = function(x){
  for(i in x){
    if(i == 1){
      xc = x[i]
    } else {
      xc = c(xc, sum(x[1:i]))
    }
  }
  xc
}

# Method 2: with apply (3 lines)
cs_apply = function(x){
  sapply(x, function(x) sum(1:x))
}

# Method 3: 
cumsum(x)
  1. Which method is fastest and how many times faster is it?

    • The cumsum function is the fastest; around 300 times.
  2. Run the same benchmark, but with the results reported in seconds, on a vector of all the whole numbers from 1 to 50,000. Hint: also use the argument neval = 1 so that each command is only run once to ensure the results complete (even with a single evaluation the benchmark may take up to or more than a minute to complete, depending on your system). Does the relative time difference increase or decrease? By how much?

    ```r library("microbenchmark") x = 1:5e4 # initiate vector to cumulatively sum microbenchmark(cs_for(x), cs_apply(x), cumsum(x), times = 1, unit = "s")

    > Unit: seconds

    > expr min lq mean median

    > cs_for(x) 16.753493973 16.753493973 16.753493973 16.753493973

    > cs_apply(x) 4.680862044 4.680862044 4.680862044 4.680862044

    > cumsum(x) 0.000144897 0.000144897 0.000144897 0.000144897

    ```

    The relative times increase, by an three orders of magnitude between the fastest and slowest method. When x = 1:100, cumsum(x) is around 250 times faster than cs_for(x). In the above results, the relative difference is around $17 / 0.0001$.

  3. Test how long the different methods for subsetting the data frame df, presented Chapter 1, take on your computer. Is it faster or slower at subsetting than the computer on which this book was compiled?

    • See chapter 1 for details: if the median time spent was greater than 19.38, 19.64 and 14.48 microseconds for each respective methods, your computer is slower, as is the case in the benchmark results shown below:

    ```r library("microbenchmark") df = data.frame(v = 1:4, name = letters[1:4]) microbenchmark(df[3, 2], df[3, "name"], df$name[3])

    > Unit: microseconds

    > expr min lq mean median uq max neval cld

    > df[3, 2] 21.791 22.5320 26.30720 23.256 24.3980 181.617 100 b

    > df[3, "name"] 21.453 22.8750 25.30676 23.275 24.5845 65.978 100 b

    > df$name[3] 15.474 16.5945 19.25959 17.338 18.1585 58.752 100 a

    ```

  4. Use system.time() and a for() loop to test how long it takes to perform the subsetting operation 50,000 times. Before testing this, do you think it will be more or less than 1 second, for each subsetting method? Hint: the test for the first method is shown below:

    ```r df = data.frame(v = 1:4, name = letters[1:4])

    Test how long it takes to subset the data frame 50,000 times:

    system.time( for(i in 1:50000){ df[3, 2] } ) system.time( for(i in 1:50000){ df[3, "name"] } )

    system.time( for(i in 1:50000){ df$name[3] } )

    ```

    Note that unlike with the cumsum(x) example, the relative times of the different methods does not change here: they are simply replicated many times.

  5. Bonus exercise: try profiling a section of code you have written using profvis. Where are the bottlenecks? Were they where you expected?

    • See chapter 1 for details


akrmenec/efficient-R documentation built on May 28, 2019, 4:53 p.m.