1. Overview of Bonus Material {-}

a. Goals {-}

This Bonus Material provides some introductory worked examples for:

b. Data set {-}

We will use the wolf SNP data (Schweizer et al., 2016) from the Week 11 worked example. The genetic data are individual-based, and are input as allele counts (i.e. 0/1/2) for each locus. We are using a randomly sampled subset of 10,000 single nucleotide polymorphism (SNP) markers from the full data set (which contains 42,587 SNPs).

c. Required R packages {-}

library(LandGenCourse)
#library(microbenchmark)
#library(profvis)
#library(here)
#library(readr)
#library(data.table)
library(feather)
#library(rio)
#library(devtools)
#library(parallel)
#library(doParallel)
#library(knitr)
#library(compiler)

2. Navigating the file system {-}

This Bonus material assumes that you are running R in RStudio within a 'R project'. Therefore, the default workspace when you enter commands in the console should be your R project folder. A discussed in the video, Part 2, the default location when executing code from an R Notebook is the folder where the notebook is stored. If you downloaded this Bonus material from 'LandGenCourse', it should be stored in a folder 'downloads' in your project folder.

More about R projects: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects.

a. Where am I? {-}

Let's see where we are.

getwd()
here::here()
Sys.getenv("HOME")
R.home()

Question: Copy-paste the commands in the chunk above directly into the Console and run them there. Do you get the same paths?

b. Accessing a system file {-}

The example data set is available in the 'extdata' folder of the 'LandGenCoures' package. This is how we import the data in the Week 11 Worked Example:

gen <- read.csv(system.file("extdata", "wolf_geno_samp_10000.csv", 
                            package = "LandGenCourse"), row.names=1)
dim(gen)

Let's unpack this code. What does the function system.file do? Let's compare it to R.home.

R.home()
system.file()

So, system.file goes to the R home and, without additional arguments, locates the folder 'library' with the subfolder 'base'.

What happens when we add the arguments?

system.file(package = "LandGenCourse")
system.file("extdata", "wolf_geno_samp_10000.csv", 
                            package = "LandGenCourse")

Compare the paths to those from the previous chunk!

This is pretty cool! To be honest, I have no clue where such files are stored, and the absolute path would be different anyways on my Mac and on my Windows machine.

c. File manipulation {-}

We can use R's file manipulation functions to do things with the file before even importing the data. Let's check the file size:

myFile <- system.file("extdata", "wolf_geno_samp_10000.csv", 
                            package = "LandGenCourse")
file.size(myFile)
cat("File size: ", file.size(myFile) / 10^6, " MB")

The number returned is in byte, hence we divide by 1 million (10^6) to get megabyte (MB).

Other useful functions include (see help file: '?files'):

What other files are available? The function dir lists all files and subfolders in a specific folder. The question then is which folder to specify?

Question: What do you think the following commands will return?

Give it a try:

dir()
dir(here::here())
dir(system.file(package = "LandGenCourse"))
dir(system.file("extdata", package = "LandGenCourse"))
dir(myFile)

3. Benchmarking file import and export options {-}

See also Chapter 5 in "Efficient R Programming": https://csgillespie.github.io/efficientR/input-output.html

a. Benchmark methods for importing csv files {-}

The file 'myFile' has 2MB and thus a reasonable size to compare the speed of different import and export functions.

Let's benchmark the function read.csv used in Week 11. We use the function microbenchmark from the package microbenchmark to compare the speed of four different functions that can import a 'csv' file.

Note: Here we execute each function only once to save time, typically you would set times = 10 or so. Also, read_csv will print a warning about a missing column name. This is because the first column here contains the row names and does not have a column name. We'll ignore this here, as we can use the first column as an example to compare how character data are being imported.

x = myFile
microbenchmark::microbenchmark(times = 1, unit = "ms", 
          read.csv(x), readr::read_csv(x, show_col_types = FALSE), data.table::fread(x),
          rio::import(x))

Would it be faster if we first loaded the packages so that we could call the functions directly?

Note: you can list several independent commands on the same line by separating the with a semi-colon ';'. Also, the chunk setting message=FALSE here suppresses the warning message from read_csv.

library(readr); library(data.table); library(rio); library(microbenchmark)

microbenchmark(times = 1, unit = "ms", 
          read.csv(x), read_csv(x, show_col_types = FALSE), fread(x), import(x))

Yes, the import was faster when the packages were already loaded.

Overall fread and import were in the order of 50 times faster than read.csv and read_csv! The two had practically the same speed, which is little surprising: for 'csv' files, import uses the function fread.

The beauty of import is that it can handle a wide range of file types (and the list keeps growing): csv, xls, xlsx, html, xml, json, feather, R, RData, rda, rds, psv, tsv, sas7bdat, xpt, sav, dta, xpt, por, rec, mtp, syd, dbf, arff, dif, fwf, csv.gz, CSVY, fst, mat, ods, yml, as well as Fortan files and clipboard imports (Mac and Windows). It recognizes the file type from the extension and uses an appropriate import function.

Note: there is also a fuction Import in the car package that similarly aims to provide an easy way to import various file formats. However, car::Import can be very slow (slower than 'read.csv'), whereas rio::import is fast.

b. Check handling of character data {-}

The functions differ not only in their speed but also in how they handle text data (character or factor?), missing values etc.

The first column in 'myFile' is an ID variable that should be used as row names. Let's compare what the four methods did with this. The following code determines, for each import method, the class of the first column (IDs), and the class (or classes) of the resulting object.

Note: here we use double square brackets to subset the first column. Strictly speaking, we interpret 'gen' as a list of vectors. With a data.frame, we could also access the first column by gen[,1]. However, this would not return what we want for 'tbl' of 'data.table' objects. Always double check.

gen <- read.csv(x); c(class(gen[[1]]), class(gen))
gen <- read_csv(x, show_col_types = FALSE); c(class(gen[[1]]), class(gen))
gen <- fread(x); c(class(gen[[1]]), class(gen))
gen <- import(x); c(class(gen[[1]]), class(gen))

c. Binary files {-}

Binary files are not readable by users (or other software) but provide an efficient way of storing data. Let's compare file size and input/output speed between text files (csv) and different types of binary files (RData, rds, feather). We'll also export the 'csv' file so that we have it in the same location.

First we make sure an output folder exists in the R project:

if(!dir.exists(paste0(here::here(),"/output"))) dir.create(paste0(here::here(),"/output"))
gen <- import(myFile)

export(gen, file.path(here::here(), "output", "gen.csv"))
save(gen, file=file.path(here::here(), "output", "gen.RData"))
saveRDS(gen, file=file.path(here::here(), "output", "gen.rds"))
export(gen, file=file.path(here::here(), "output", "gen.feather"))

c(csv=file.size(file.path(here::here(), "output", "gen.csv")),
  RData=file.size(file.path(here::here(), "output", "gen.RData")),
  rds=file.size(file.path(here::here(), "output", "gen.rds")),
  feather=file.size(file.path(here::here(), "output", "gen.feather")))/10^6

Let's benchmark the import again. We can use the function import for all of them. This is so fast that we can actually do it 10 times.

microbenchmark(times = 10, unit = "ms", 
          csv= import(file.path(here::here(), "output", "gen.csv")),
          RData=import(file.path(here::here(), "output", "gen.RData")),
          rds=import(file.path(here::here(), "output", "gen.rds")),
          feather=import(file.path(here::here(), "output", "gen.feather")))

Look at the column 'mean'. Importing any of the binary files was at least twice as fast as importing the 'csv' file with the underlying function fread (which was already 50 times faster than read.csv).

Here's my recommendation for saving R objects/data efficiently:

Note: the developer of feather does not recommend using it for long-term data storage since its stability with future updates to R or Python can't be guaranteed: https://github.com/wesm/feather/issues/183

Why 'rds' and not 'RData'? In practice, the main advantage of 'rds' is convenience when importing data.

# Let's delete any copy of 'gen' from the workspace:
rm(gen)

# Create object 'myData' in a single step from 'rds' file:
myData <- readRDS(file.path(here::here(), "output", "gen.rds"))

# Two steps when importing 'RData': first, load the stored object:
load(file.path(here::here(), "output", "gen.RData")) 
# then assign to the new object 'myData':
myData <- gen

Note that when you use load, the object name is NOT taken from the file name! This means that you may not know what object you are loading, if the object and file names are different.

Let's test this. Here we save the object 'gen' in file 'gen2.RData', then load it.

# Export 'gen' to a file with a different name 'gen2.RData':
save(gen, file=file.path(here::here(), "output", "gen2.RData"))
rm(gen)

# Load:
load(file.path(here::here(), "output", "gen2.RData")) 

# What is the name of the loaded object?
exists("gen")
exists("gen2")

We see that an object 'gen' exists (TRUE), but an object 'gen2' does not exist (FALSE). The name of the loaded object is thus 'gen'.

d. Should you save your R workspace? {-}

When you close RStudio, you may be asked whether you want to save your workspace. What happens when you do this, and should you do so?

Also, in the vein of reproducible research, do not save multiple copies of your data set. Instead:

e. Compile your functions {-}

A recommended way of keeping your code tidy is to write functions for anything you will do more than once.

To further speed up your code, you can compile your function with 'cmpfun':

myFunction <- function() {
    sum(rnorm(1000))/1000
}
myFunction.cmp <- compiler::cmpfun(myFunction)

microbenchmark::microbenchmark(myFunction(), myFunction.cmp())

Question: Which of the following times are most different between the uncompiled and the compiled versions of this simple function?

In this case, compiling mainly reduced the duration of the longest 25% runs (with longer times than the 75% quartile), which brought down the mean processing time.

4. Profiling your code {-}

a. Named chunks {-}

An simple way to identify parts of your code that may be slow is to:

To name a chunk, click on the wheel symbol at the top right of the grey chunk area and enter a one-word name.

Here's an example of a named chunk: the name 'myChunkName' has been added in the curly brackets {r, myChunkName}. You can add a name manually in the same way.


More generally, this is where chunk options are added in the R Notebook. Here's a long list of chunk options: https://yihui.name/knitr/options/

b. Profiling some lines of code {-}

RStudio has a built-in menu for profiling. Check it out:

dat <- data.frame(
     x = rnorm(5e5), 
     y = rnorm(5e5))   
mean(dat$x)
with(dat, cor(x, y))
lm(y ~ x, data=dat)

You can achieve the same with a call to the funciton 'profvis' of the 'profvis' package:

profvis::profvis({
  dat <- data.frame(
       x = rnorm(5e5), 
       y = rnorm(5e5))   
  mean(dat$x)
  with(dat, cor(x, y))
  lm(y ~ x, data=dat)
})

The results will be opened in a 'Profile' tab. The upper part has two tabs (the lower part is less intuitive to interpret, we'll ignore it here):

The results may depend on the speed of your computer. Check the sample interval at the bottom of the 'Profile' tab: time was estimated by observing every 10 milliseconds what the computer was doing. Some lines lines must have been too fast to be recorded.

c. Converting an R Notebook to an R script {-}

Unfortunately, we can't profile an entire R Notebook (as far as I know). However, we can extract the R code as a script file, then profile the script file.

Copy an R Notebook file (Testfile.Rmd) to the downloads folder. This is just so that we have an example of an .Rmd file to extract the code from.

file.copy(from=system.file("extdata", "Testfile.Rmd", package = "LandGenCourse"),
                     to=file.path(here::here(), "downloads", "Testfile.Rmd"))

Extract the R code from the R Notebook and save it in a script file Testfile.R in the downloads folder. You may adapt the infile and outfile to extract the code from any R Notebook saved in your project.

infile = here::here("downloads/Testfile.Rmd")
outfile = here::here("downloads/Testfile.R")
knitr::purl(infile, outfile)

Let's open the two files and compare them. Note that file.show opens a simple text version, whereas file.edit shows the colored versions commonly displayed by the RStudio editor.

file.show(infile)
file.show(outfile)

#file.edit(infile)
#file.edit(outfile)

Question: Compare the two files (they should be open, check the tabs of the source pane).

d. Profiling an R script {-}

We can now use the function source to read and execute the code in the R script. We use the function Rprof to start and stop the profiling.

#Rprof(filename=here::here("downloads/Rprof.out"))
#source(outfile)
#Rprof(NULL)

With the function summaryRprof, we get a summary by function. First, let's check the total time spent executing the code, and the sample interval (in seconds). Note that profiling works in this way: as the code is executed, R checks at regular time intervals which function is being executed at that very moment. The summaryRprof function then tabulates the number of intervals, and thus the time used, by function.

#summaryRprof(here::here("downloads/Rprof.out"))[c("sampling.time", "sample.interval")] 

The attribute $by.total only lists the top-level functions called (not any functions called internally by that top-level function).

#summaryRprof(here::here("downloads/Rprof.out"))$by.total

Note: The functions are listed by total time, in decreasing order. The two functions eval and source, are related to evaluating (running) the code from the sourced R script file.

If you need to see a summary with more detail, $by.self lists each function used, even internal functions.

If you want to profile memory use rather than processing time, see here: https://developer.r-project.org/memory-profiling.html

5. Creating and executing a Bash R script {-}

a. Run R script directly in the Terminal {-}

Now that we have a stand-alone R script 'myScript.R', we can run it from the command line in the terminal (shell). After navigating to the correct folder, type:

Rscript myScript.R

This will source the file and execute the R code.

b. Create a Bash R script {-}

If you want to run your code on a node or cluster, you may need to take this one step further and include the R code in a bash Rscript. In a bash script, you can add bash commands that govern resource use to submit a job to a node or cluster. Bash scripts are the way of giving instructions to the scheduler of the cluster (e.g. SLURM) for how to manage input and output files.

To execute our R script 'myScript.R' as a Bash script, we need to add a few lines.

Let's modify the previous code and write it into a Bash file. As an additional challenge, our code contains two figures, which won't be written anywhere unless we change the code to write them into a file:

Note: We use single quotes for the file name here, 'my_plot.png', as they are nested within a set of double quotes. R pretty much considers single and double quotes as synonyms, which allows us to nest them either way: '""' or "''".

myPath <- file.path(here::here(), "output/myBashScript.sh")
fileConn <- file(myPath)
writeLines(c("#!/bin/bash",
             "R --slave << EOF",
             "x <- rnorm(100)",
             "mean(x)",
             "png('my_plot.png', height = 400, width = 800)",
             "par(mfrow=c(1,2))", 
             "hist(x)", 
             "qqnorm(x)",
             "dev.off()",
             "EOF"), fileConn)
close(fileConn)
file.show(myPath)

c. Executing a Bash R script {-}

On Mac / Unix / Linux, this is straight-forward:

  1. Open terminal:
    • From the RStudio menu, select 'Tools' > 'Terminal' > 'New Terminal'. This will open an Terminal tab in RStudio. Alternatively, you could select 'Tools' > 'Shell' to open a Shell in a new window outside RStudio.
    • Check the prompt: it should start with the name of your computer, then a colon, then the name of your project folder, then your use name followed '$'.
  2. Navigate to Bash file:
    • Enter ls to list the content of the project folder.
    • Enter cd output to change directory to the subfolder 'output'.
    • Repeat ls to list the content of the 'output' folder. The Bash script 'myBashScript.sh' should be listed there.
  3. Execute Bash file:
    • Enter chmod +x myBashScript.sh to change file permission for the script.
    • Enter ./myBashScript.sh to execute the script.
  4. Find output:
    • The output from mean(x) is printed in the terminal, it should look like this: [1] -0.07731751.
    • This is followed 'null device' and the number 1, which tells us that a graphics device has been closed.
    • Enter ls to list the content of the project folder. The graphics file 'my_plot.png' should now be listed.

Use R again to open the graphics file (the code here first checks whether the file exists):

myPNG <- file.path(here::here(), "output/my_plot.png")
if(file.exists(myPNG))
{
  file.show(myPNG)
}

d. Moving to a node or cluster? {-}

The example bash file and advice in this section have been provided by Hossam Abdel Moniem, thanks!

Here's an annotated example of a bash file that contains instructions for submitting a job to a single node (a single machine with multiple/many cores).

Note: A copy of the file 'BashExample.sh' should also have been copied into the downloads folder inside your project folder.

writeLines(readLines(system.file("extdata", "BashExample.sh", package = "LandGenCourse")))

Instead of including the R code directly in the bash file, the last line here executes an R script with the Rscript command.

Notice the second-last line. Obviously, on the node, R and any relevant packages need to be pre-installed. Different users may need different configurations (different packages or versions) installed, hence each installation has a name, which needs to be specified in the bash script.

Note: Make sure that all packages that you need (and their dependencies), as well as the package unixtools, have been installed on the node or cluster (i.e., they are part of the installation you will be using). Install unixtools with: install.packages("unixtools",,"http://rforge.net/")

Further reading:

e. Store your session info and package versions {-}

If you use any R packages, load them at the beginning of your script file with library. Make sure the packages are installed on the system where you will be running the Bash R script.

A big issue with R is that package updates may make your code break. At least at the end of any project (such as the analyses for a manuscript), save your session information.

Here we use the function session_info from the devtools package (preferred over the R base function sessionInfo). We store the information as an object, 'Session', of class 'session_info' that has two list elements:

Platform:

Session <- devtools::session_info()
Session$platform

Packages: here we display only the first six lines, as the list may be long.

head(Session$packages)

Exporting the session information is a bit tricky because 'Session' is not in tabular format, and what we want to export is the formatted output, not the object itself.

Also, we will add a time stamp to the file name. This achieves two goals: avoid overwriting earlier files, and keep a record of the date of the session information.

today <-format(Sys.Date(), "%Y-%m-%d")
outFile <- paste0("SessionInfo", "_", today, ".txt")
outPath <- file.path(here::here(), "output", outFile)
writeLines(capture.output(devtools::session_info()), outPath)
file.show(outPath)

More advanced ways for handling the problem of package versions to make sure you can run your code in the future without compatibility issues include:

More on the topic: https://timogrossenbacher.ch/2017/07/a-truly-reproducible-r-workflow/

f. Potential Windows issues {-}

Check that Bash is installed and ready to use with RStudio:

If this does not work, try installing 'Git for Windows', which will also install Bash: http://neondataskills.org/setup/setup-git-bash-R

Here's a detailed multi-part tutorial on running R from the command line, with R scripts and Bash R scripts, with some additional information on Windows:

6. Parallelizing code {-}

Note that the following issues may create problems when developing Bash R scripts on Windows that you want to run e.g. on a Linux cluster or another Unix-type system;

a. Replace 'lapply' by 'mclapply' with package 'parallel' {-}

Note: while this will run on a Windows without causing an error, it will only be faster on Mac / Unix.

With the package 'parallel', it is really easy to use all cores of your local machine (as long as you are on a Mac / Unix / Linux system). Let's check the number of cores available:

library(parallel)
detectCores()

Question: How many cores does your machine have?

Note: Here we check whether the operating system is 'Windows', in which case we set nCores = 1. This means that we will use all cores on Mac or Linux, but only one core on Windows. This is to avoid problems on Windows machines.

nCores <- detectCores()
if(Sys.info()[['sysname']]=="Windows") nCores = 1
nCores

NOTE: April 2021: package build does not work with multiple cores, changing nCores to 1.

x <- gen[,-1]
m1 <- lapply(x, mean, na.rm=TRUE)
#m2 <- mclapply(x, mean, na.rm=TRUE, mc.cores=nCores)  # Use this line when running the code yourself
m2 <- mclapply(x, mean, na.rm=TRUE, mc.cores=1)        # Replace this line with the previous line

Let's benchmark four ways of calculating the mean of each column in our example data set 'gen' with 94 rows and 10,000 columns (SNPs):

method1 <- function(x) {colMeans(x, na.rm=TRUE)}
method2 <- function(x) {for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)}
method3 <- function(x) {lapply(x, mean, na.rm=TRUE)}
#method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=nCores)} # Use this line when running the code yourself
method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=1)}  # Replace this line with the previous line

microbenchmark::microbenchmark(times = 10, unit = "ms",
                               method1(x), method2(x), method3(x), method4(x))

Question: Compare the mean

Note: Obviously you might only expect to see a gain in speed if nCores > 1.

b. Replace 'for' by 'foreach' with package 'doParallel' {-}

On Windows, it is easier to use the package 'doParallel' with the function 'foreach'. Here's a detailed introduction: http://127.0.0.1:26758/help/library/doParallel/doc/gettingstartedParallel.pdf.

The following code is commented out to avoid problems when knitting the Notebook. You may uncomment and run it.

library(doParallel)
#cl <- makeCluster(2)
#cl
#registerDoParallel(cl)

Now we adapt the code in two steps:

m1 <- for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)
m2 <- foreach(i = 1:ncol(x)) %do% (mean(x[,i], na.rm=TRUE))
#m3 <- foreach(i = 1:ncol(x)) %dopar% (mean(x[,i], na.rm=TRUE))

Let's benchmark this again. We'll only do 3 replicates this time (uncomment before running this code).

#method1 <- function(x) {colMeans(x, na.rm=TRUE)}
#method2 <- function(x) {for(i in 1:ncol(x)) mean(x[,i], na.rm=TRUE)}
#method3 <- function(x) {lapply(x, mean, na.rm=TRUE)}
#method4 <- function(x) {mclapply(x, mean, na.rm=TRUE, mc.cores=nCores)}
#method5 <- function(x) {foreach(i = 1:ncol(x)) %do% (mean(x[,i], na.rm=TRUE))}
#method6 <- function(x) {foreach(i = 1:ncol(x)) %dopar% (mean(x[,i], na.rm=TRUE))}

#microbenchmark::microbenchmark(times = 3, unit = "ms",method1(x), method2(x), method3(x), method4(x), method5(x), method6(x))

Whether parallelisation is faster depends on the type of task and on your system. In this case, both versions that used parallelisation were actually slower than the sequential code (at least on my system).

Of course, this will not always be the case, it may depend on:

LandGenCourse::detachAllPackages()


hhwagner1/LandGenCourse documentation built on Feb. 17, 2024, 4:42 p.m.