cache: Apply a database caching layer to an arbitrary R function.

Description Usage Arguments Value Examples

Description

Requesting data from an API or performing queries can sometimes yield dataframe outputs that are easily cached according to some primary key. This function makes it possible to cache the output along the primary key, while only using the uncached function on those records that have not been computed before.

Usage

1
2
3
cache(uncached_function, key, salt, con, prefix = deparse(uncached_function),
  env = "cache", batch_size = 100, safe_columns = FALSE,
  blacklist = NULL)

Arguments

uncached_function

function. The function to cache.

key

character. A character vector of primary keys. If key is unnamed, the user guarantees that uncached_function has these as formal arguments and that it returns a data.frame containing columns with at least those names. For example, if we are caching a function that looks like function(author) { ... }, we expect its output to be data.frames containing an "author" column with one record for each author. In this situation, key = "author". Otherwise if key is a named length 1 vector, the name shall match the uncached_function key argument, the value shall be matched to at least one of the columns the returned data.frame contains.

salt

character. The names of the formal arguments of uncached_function for which a unique value at calltime should use a different database table. In other words, if uncached_function has arguments id, x, y, but different kinds of data.frames (i.e., ones with different types and/or column names) will be returned depending on the value of x or y, then we can set salt = c("x", "y") to use a different database table for each combination of values of x and y. For example, if x and y are only allowed to be TRUE or FALSE, with potentially four different kinds of data.frame outputs, then up to four tables would be created.

con

SQLConnection or character. Database connection object, or character path to database.yml file. In the latter case, you will have to specify an env parameter that determines the environment used for the database.yml file.

prefix

character. Database table prefix. A different prefix should be used for each cached function so that there are no table collisions. Optional, but highly recommended. By default, the deparsed name of the uncached_function parameter.

env

character. The environment of the database connection if con is a yaml cofiguration file. By default, "cache".

batch_size

integer. Usually, the uncached operation is slow (or we would not have to cache it!). However, fetching data from the database is fast. To handle this dichotomy, the batch_size parameter gives the ability to control the chunks in which to compute and cache the uncached operation. This makes it more robust to failures, and ensures fetching of uncached data is partially stored even when errors occur midway through the process. The default is 100.

Note that the batchman package should be installed for batching to take effect.

safe_columns

logical or function. If safe_columns = TRUE and a caching call would add additional columns for an already existing cache with already existing columns, the function will instead crash. If safe_columns is a function, that function will be called. The function must return /codeTRUE for this to work. Also the function will be called with no arguments. This is mainly so you can write your own error message. If safe_columns is /codeFALSE, the additional columns will be added. Defaults FALSE.

blacklist

list. Any elements in this list will be blocked from caching. This is useful for implementing a conditional cache or adding more safety around your caching layer. Defaults to NULL, no blacklist.

Value

A function with a caching layer that does not call uncached_function with already computed records, but retrieves those results from an underlying database table.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
## Not run: 
# These examples assume you have a database connection object
# (as specified in the DBI package) in a local variable `con`.

# Imagine we have a function that returns a data.frame of information
# about IMDB titles through their API. It takes an integer vector of
# IDs and returns a data.frame with an "id" column, with one row for
# each title. (for example, 111161 would correspond to
# http://www.imdb.com/title/tt111161/ which is The Shawshank Redemption).
amazon_info <- function(id) {
  # Call external API.
}

# Sending HTTP requests to Amazon and waiting for the response is
# computationally intensive, so if we ask for some IDs that have
# already been computed in the past, it would be useful to not
# make additional HTTP requests for those records. For example,
# we may want to do some processing on all Amazon titles. However,
# new records are created each day. Instead of parsing all
# the historical records on each execution, we would like to only
# parse new records; old records would be retrieved from a database
# table that had the same column names as a typical output data.frame
# of the `amazon_info` function.
cached_amazon_info <- cachemeifyoucan::cache(amazon_info, key = 'id', con = con)

# By using the `cache` function, we are asking for the following:
#   (1) If we call `cached_amazon_info` with a vector of integer IDs,
#       take the subset of IDs that have already been returned from
#       a previous call to `cached_amazon_info`. Retrieve the data.frame
#       for these records from an underlying database table.
#   (2) The remaining IDs (those we have never passed to `cached_amazon_info`)
#       should be fed to the base `amazon_info` function as if we had
#       called it with this subset. This will yield another data.frame that
#       was computed using live HTTP requests.
# The `cached_amazon_info` function will return the union (rbind) of these
# two data sets as one single data set, as if we had called `amazon_info`
# by itself. It will also cache the second data set so another identical
# call to `cached_amazon_info` will not trigger any additional HTTP requests.

###
# Salts
###

# Imagine our `amazon_info` function is slightly more complicated:
# instead of always returning the same information about film titles,
# it has an additional parameter `type` that controls whether we
# want info about the filmography or about the reviews. The output
# of this function will still be data.frame's with an `id` column
# and one row for each title, but the other columns can be different
# now depending on the `type` parameter.
amazon_info2 <- function(id, type = 'filmography') {
  if (identical(type, 'filmography')) { return(amazon_info(id)) }
  else { return(review_amazon_info(id)) } # Assume we have this other function
}

# If we wish to cache `amazon_info2`, we need to use different underlying
# database tables depending on the given `type`. One table may have
# columns like `num_actors` or `film_length` and the other may have
# column such as `num_reviews` and `avg_rating`.
cached_amazon_info2 <- cachemeifyoucan::cache(amazon_info2, key = 'id',
  salt = 'type', con = con)

# We have told the caching layer to use the `type` parameter as the "salt".
# This means different values of `type` will use different underlying
# database tables for caching. It is up to the user to construct a
# function like `amazon_info2` well so that it always returns a data.frame
# with exactly the same column names if the `type` parameter is held fixed.
# The salt should usually consist of a collection of parameters (typically
# only one, `type` as in this example) that have a small number of possible
# values; otherwise, many database tables would be created for different
# values of the salt. Consider the following example.

bad_amazon_filmography <- function(id, actor_id) {
  # Given a single actor_id and a vector of title IDs,
  # return information about that actor's role in the film.
}
bad_cached_amazon_filmography <-
  cachemeifyoucan::cache(bad_amazon_filmography, key = 'id',
    salt = 'actor_id', con = con)

# We will now be creating a separate table each time we call
# `bad_amazon_filmography` for a different actor!

###
# Prefixes
###

# It is very important to give the function you are caching a prefix:
# when it is stored in the database, its table name will be the prefix
# combined with some string derived from the values in the salt.

cached_review_amazon_info <- cachemeifyoucan::cache(review_amazon_info,
  key = 'id', con = con)

# Remember our `review_amazon_info` function from an earlier example?
# If we attempted to cache it without a prefix while also caching
# the vanilla `amazon_info` function, the same database table would be
# used for both functions! Since function representation in R is complex
# and there is no good way in general to determine whether two functions
# are identical, it is up to the user to determine a good prefix for
# their function (usually the function's name) so that it does not clash
# with other database tables.

cached_amazon_info <- cachemeifyoucan::cache(amazon_info,
  prefix = 'amazon_info', key = 'id', con = con)
cached_review_amazon_info <- cachemeifyoucan::cache(review_amazon_info,
  prefix = 'review_amazon_info', key = 'id', con = con)

# We will now use different database tables for these two functions.

###
# force.
###

# `force.` is a reserved argument for the to-be-cached function. If
# it is specified to be `TRUE`, the caching layer will forcibly
# repopulate the database tables for the given ids. The default value
# is `FALSE`.

cached_amazon_info <- cachemeifyoucan::cache(amazon_info,
  prefix = 'amazon_info', key = 'id', con = con)
cached_amazon_info(c(10, 20), force. = TRUE) # Will forcibly repopulate.

###
# Advanced features
###

# We can use multiple primary keys and salts.
grab_sql_table <- function(table_name, year, month, dbname = 'default') {
  # Imagine we have some function that given a table name
  # and a database name returns a data.frame with aggregate
  # information about records created in that table from a
  # given year and month (e.g., ensuring each table has a
  # created_at column). This function will return a data.frame
  # with one record for each year-month pair, with at least
  # the columns "year" and "month".
}

cached_sql_table <- cachemeifyoucan::cache(grab_sql_table,
  key = c('year', 'month'), salt = c('table_name', 'dbname'), con = con,
  prefix = 'sql_table')

# We would like to use a separate table to cache each combination of
# table_name and dbname. Note that the character vector passed into
# the `salt` parameter has to exactly match the names of the formal
# arguments in the initial function, and must also be the name of
# the columns returned by the data.frame. If these do not agree,
# you can wrap your function. For example, if the data.frame returned
# has 'mth' and 'yr' columns, you could instead cache the wrapper:
wrap_sql_table <- function(table_name, yr, mth, dbname = 'default') {
  grab_sql_table(table_name = table_name, year = yr, month = mth, dbname = dbname)
}

###
# Debugging option `cachemeifyoucan.debug`
###

Sometimes it might be interesting to take a look at the underlying database
tables for debugging purposes. However, the contents of the database are
somewhat obfuscated. If you set `cachemeifyoucan.debug` option to TRUE will
every time you execute a cached function you will see some additional metadata
printed out, helping you navigate the database. An example output looks like this:

Using table name: amazon_data_c3204c0a47beb9238a787058d4f03834
Shard dimensions:
  shard1_f8e8e2b41ac5c783d0954ce588f220fc: 45 rows * 308 columns
11 cached keys
5 uncached keys


## End(Not run)

robertzk/cachemeifyoucan documentation built on May 27, 2019, 10:34 a.m.