readme.MD

catcher - Drop In Caching for R Functions

Build Status codecov

Author: Chris Hua / chua@wharton.upenn.edu

There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton

Easily speed up repeated expensive operations by saving the results on disk and loading from disk on subsequent runs, with many quality-of-life enhancements for data scientists.

Installation

Install from this repository:

if(!require("devtools")) install.packages("devtools")
devtools::install_github("stillmatic/catcher")

Why catcher

This package is very light but includes a number of important helpers: advanced management of the cache and when to use the cache. It does not require use of databases or any other dependencies. Its core functions are easy to understand and can easily replace existing code without breaking old functionality. This package additionally takes care of all necessary hashing to store files.

Existing R caching packages are either too complicated, e.g. R.cache, or too barebones, e.g. simpleRCache. This package attempts to create a sensible middle ground, with a number of tools essential for data scientists built in. For example, we offer first-class handling of stale cached data and notation of when data was loaded. Existing

In terms of design, this package is essentially a hashtable held solely on disk and 'instantiated' by examining the files in your cache directory.

This package was inspired by some of the work I did on Rbnb while at Airbnb Data Science this summer. Particular thanks are due to Ricardo Bion and Jason Goodman for their earlier work on this problem.

Usage

The main function is cache_op():

cache_op("read.csv", 
  "https://cdn.rawgit.com/Keno/8573181/raw/7e97f56f521d1f49b966e04457687e87da1b062b/gistfile1.txt", 
  header = T)

This is equivalent to:

read.csv("https://cdn.rawgit.com/Keno/8573181/raw/7e97f56f521d1f49b966e04457687e87da1b062b/gistfile1.txt", 
  header = T)

You can also use this library to create wrapped versions of your own functions. Take for example a hypothetical function sql that connects to your SQL database:

sql <- function(query, user = Sys.getEnv("db_auth_user"), pass = Sys.getEnv("db_auth_pass")) {
  # ...
}

sql("SELECT * FROM dim_users LIMIT 100")

You can wrap this function easily with the following:

sql_c <- function(query, ...) {
  catcher::cache_op("sql", query, ...)
}

sql_c("SELECT * FROM dim_users LIMIT 100")

Easily reproducible proof of concept:

sin_c <- function(query, ...) { 
  catcher::cache_op("sin", query, ...)
}

sin_c(pi)
# reading from cache created at 2016-09-03 14:03:32; 0.05 days old.
# [1] 1.224606e-16

Options

The function cache_op() takes a few additional arguments which customize the behavior of the caching.

When wrapping your own custom functions with caches, you can modify these parameters however you want. Let's say you have some function get_ga() that gathers Google Analytics data for the last week, so you want to redownload the data if it's been more than 7 days since the last update. Then, you would initiate it as:

get_ga_c <- function(query, ...) { 
  catcher::cache_op("get_ga", max_lifetime = 7, query, ...)
}

# or, if you want to expose catcher's options:

get_ga_c <- function(query, max_lifetime = 7, ...) {
  catcher::cache_op("get_ga", max_lifetime = max_lifetime, query, ...)
}


stillmatic/catcher documentation built on May 30, 2019, 5:44 p.m.