knitr::opts_chunk$set( collapse = TRUE, message = FALSE, warning = FALSE, comment = "#>", fig.align = "center" )
https://rnabioco.github.io/practical-data-analysis/
Read the Guide to RMarkdown for an exhaustive description of the various formats and options for using RMarkdown documents. Note that HTMLs and slides for this class were all made from Rmd.
You can speed up knitting of your Rmds by using caching to store the results from each chunk, instead of rerunning them each time. Note that if you modify the code chunk, previous caching is ignored.
For each chunk, set {r, cache = TRUE}
Or the option can be set globally at top of the document. Like this:
knitr::opts_chunk$set(cache = TRUE)
Run once, save, and load instead of rerunning resource intensive parts.
If you have non-deterministic functions, such as kmeans
, remember to set.seed
, or save and load result objects.
In most cases, what you need is already made into well-documented packages, and you don't have to reinvent the wheel (but sometimes you should?). Depending on where the package is curated, installation is different. Some examples below:
Gviz
- visualize gene modelVennDiagram
- making custom venn diagramsemo
- inserting emojis into Rmd# BiocManager::install("Gviz") # from bioconductor vignette("Gviz") # install.packages("eulerr") # from CRAN plot(eulerr::euler(list(set1 = c("geneA", "geneB", "geneC"), set2 = c("geneC", "geneD")))) # devtools::install_github("hadley/emo") # from github emo::ji("smile")
Advantages of using version control include:
rolling back code if needed
branched development, tackling individual issues/tasks
collaboration, etc
Git was first created by Linus Torvalds for coordinating development of Linux. Read this guide for Getting started and try out the Interactive practice.
```{bash git, eval = FALSE}
ls git status # list changes git blame assignments.Rmd # see who contributed
This can be handled by Rstudio as well (new tab next to `Connections` and `Build`) ### Put your code on GitHub As you write more code, especially as functions and script pipelines, hosting and documenting them on GitHub is great way to make them portable and searchable. Even the free tier of GitHub accounts now has private repositories (repo). If you have any interest in a career in data science/informatics, GitHub is also a common showcase of what (and how well/often) you can code. After some accumulation of code, definitely put your GitHub link on your CV/resume. ### Example repo (RBI) - [valr](https://github.com/rnabioco/valr) Refer to package examples and [guide to building packages](https://r-pkgs.org/index.html) to build your own package and website documentation. ### Asking for help with other packages on GitHub Every package should include README, installation instructions, and a maintained `issues` page where questions and bugs can be reported and addressed. Example: [readr GitHub page](https://github.com/tidyverse/readr/issues) Don't be afraid to file new issues, but very often your problems are already answered in the `closed` section. ## 4. Other tips/best practices ### a. Code and data organization Read this: [A Quick Guide to Organizing Computational Biology Projects](https://doi.org/10.1371/journal.pcbi.1000424) ```bash NEWS.md # markdown document for tracking progress data # data directory for storing raw and processed data dbases # database files downloaded from other places docs # publication documents and random project files your-project.Rproj # Make an Rproject in Rstudio for every project results # store all RMarkdown here src # store useful scripts here
here
, folder structure managementhttps://github.com/jennybc/here_here
here::here() # always points to top-level of current project here::here("vignettes", "class-12.Rmd") # never confused about folder structure
styler
, clean up code readabilityRefer to this style guide often, so you don't have to go back to make the code readable/publishable later.
styler::style_text(" my_fun <- function(x, y, z) { x+ z } ") #> my_fun <- function(x, #> y, #> z) { #> x + #> z #> } # styler::style_file # for an entire file # styler::style_dir # or folder
microbenchmark
and profvis
# example, compare base and readr csv reading functions path_to_file <- system.file("extdata", "gene_tibble.csv", package = "pbda") res <- microbenchmark::microbenchmark( base = read.csv(path_to_file), readr = readr::read_csv(path_to_file), times = 5 ) print(res, signif = 2) microbenchmark:::autoplot.microbenchmark(res) # example, looking at each step of a script library(dplyr) p <- profvis::profvis({ path_to_file <- system.file("extdata", "gene_tibble.csv", package = "pbda") c1 <- readr::read_csv(path_to_file) c1 <- c1 %>% filter(padj < 0.00001) %>% mutate(gene = stringr::str_to_lower(gene)) }) p
shiny
, interactive web app for data explorationMaking an interactive interface to data and plotting is easy in R. Examples and corresponding code can be found at https://shiny.rstudio.com/gallery/.
Save functions in .R files to easily source()
them later. If you have code that does not need input from beginning to end, .R files can be executed from the command line/terminal as well.
```{bash script, eval = FALSE}
Rscript /Users/rf/teach/practical-data-analysis/R/mat.R ```
The R studio community forums are a great resource for asking questions about tidyverse related packages.
StackOverflow provides user-contributed questions and answers on a variety of topics.
For help with bioconductor packages, visit the Bioc support page
Find out if others are having similar issues by searching the issue on the package GitHub page.
Hadley Wickham has written very good (free) ebooks on using R.
For general bioinformatics advice, we found the following text very useful.
Bioinformatics Data Skills by Vince Buffalo
For statistics related to bioinformatics, this free course is excellent:
PH525x series - Biomedical Data Science
For more detailed descriptions of single cell RNA-Seq analysis:
https://scrnaseq-course.cog.sanger.ac.uk/website/index.html
https://satijalab.org/seurat/vignettes.html
Rstudio links to common ones here: Help
-> Cheatsheets
. More are hosted online, such as for regular expressions.
Useful to keep your own stash too.
The RBI fellows hold standing office hours on Thursdays, walk-ins 1-2pm, by appointment 2-4pm, in RC1-South Room 9101. We are happy to help out with coding and RNA/DNA-related informatics questions. Send us an email to let us know you will be stopping by (rbi.fellows@cuanschutz.edu
).
No one writes perfect code. A certain highly-regarded and cited scRNA-seq package drew cell type progression arrows in the wrong direction, for months before the issue was fixed!
If you suspect bugs or mishandled edge cases, go to the package GitHub and search the issues section to see if the problem has been reported or fixed. If not, submit an issue that describes the problem. The reprex package makes it easy to produce well-formatted reproducible examples that demonstrate the problem. Often developers will be thankful for your help with making their software better.
Excel:
Gets unwieldy with large files
Poor plotting options
Hard to do dplyr-like manipulations
Hard to integrate with other informatics tools
Annoying to apply same analysis over multiple files
And random things like this:
{ width=70% }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.