Description Usage Arguments Value Examples
Requesting data from an API or performing queries can sometimes yield dataframe outputs that are easily cached according to some primary key. This function makes it possible to cache the output along the primary key, while only using the uncached function on those records that have not been computed before.
1 2 3 |
uncached_function |
function. The function to cache. |
key |
character. A character vector of primary keys. If |
salt |
character. The names of the formal arguments of |
con |
SQLConnection or character. Database connection object, or
character path to database.yml file. In the latter case, you will have to
specify an |
prefix |
character. Database table prefix. A different prefix should
be used for each cached function so that there are no table collisions.
Optional, but highly recommended. By default, the deparsed name of the
|
env |
character. The environment of the database connection if con
is a yaml cofiguration file. By default, |
batch_size |
integer. Usually, the uncached operation is slow
(or we would not have to cache it!). However, fetching data from the
database is fast. To handle this dichotomy, the Note that the batchman package should be installed for batching to take effect. |
safe_columns |
logical or function. If safe_columns = |
blacklist |
list. Any elements in this list will be blocked from caching.
This is useful for implementing a conditional cache or adding more safety around
your caching layer. Defaults to |
A function with a caching layer that does not call
uncached_function
with already computed records, but retrieves
those results from an underlying database table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | ## Not run:
# These examples assume you have a database connection object
# (as specified in the DBI package) in a local variable `con`.
# Imagine we have a function that returns a data.frame of information
# about IMDB titles through their API. It takes an integer vector of
# IDs and returns a data.frame with an "id" column, with one row for
# each title. (for example, 111161 would correspond to
# http://www.imdb.com/title/tt111161/ which is The Shawshank Redemption).
amazon_info <- function(id) {
# Call external API.
}
# Sending HTTP requests to Amazon and waiting for the response is
# computationally intensive, so if we ask for some IDs that have
# already been computed in the past, it would be useful to not
# make additional HTTP requests for those records. For example,
# we may want to do some processing on all Amazon titles. However,
# new records are created each day. Instead of parsing all
# the historical records on each execution, we would like to only
# parse new records; old records would be retrieved from a database
# table that had the same column names as a typical output data.frame
# of the `amazon_info` function.
cached_amazon_info <- cachemeifyoucan::cache(amazon_info, key = 'id', con = con)
# By using the `cache` function, we are asking for the following:
# (1) If we call `cached_amazon_info` with a vector of integer IDs,
# take the subset of IDs that have already been returned from
# a previous call to `cached_amazon_info`. Retrieve the data.frame
# for these records from an underlying database table.
# (2) The remaining IDs (those we have never passed to `cached_amazon_info`)
# should be fed to the base `amazon_info` function as if we had
# called it with this subset. This will yield another data.frame that
# was computed using live HTTP requests.
# The `cached_amazon_info` function will return the union (rbind) of these
# two data sets as one single data set, as if we had called `amazon_info`
# by itself. It will also cache the second data set so another identical
# call to `cached_amazon_info` will not trigger any additional HTTP requests.
###
# Salts
###
# Imagine our `amazon_info` function is slightly more complicated:
# instead of always returning the same information about film titles,
# it has an additional parameter `type` that controls whether we
# want info about the filmography or about the reviews. The output
# of this function will still be data.frame's with an `id` column
# and one row for each title, but the other columns can be different
# now depending on the `type` parameter.
amazon_info2 <- function(id, type = 'filmography') {
if (identical(type, 'filmography')) { return(amazon_info(id)) }
else { return(review_amazon_info(id)) } # Assume we have this other function
}
# If we wish to cache `amazon_info2`, we need to use different underlying
# database tables depending on the given `type`. One table may have
# columns like `num_actors` or `film_length` and the other may have
# column such as `num_reviews` and `avg_rating`.
cached_amazon_info2 <- cachemeifyoucan::cache(amazon_info2, key = 'id',
salt = 'type', con = con)
# We have told the caching layer to use the `type` parameter as the "salt".
# This means different values of `type` will use different underlying
# database tables for caching. It is up to the user to construct a
# function like `amazon_info2` well so that it always returns a data.frame
# with exactly the same column names if the `type` parameter is held fixed.
# The salt should usually consist of a collection of parameters (typically
# only one, `type` as in this example) that have a small number of possible
# values; otherwise, many database tables would be created for different
# values of the salt. Consider the following example.
bad_amazon_filmography <- function(id, actor_id) {
# Given a single actor_id and a vector of title IDs,
# return information about that actor's role in the film.
}
bad_cached_amazon_filmography <-
cachemeifyoucan::cache(bad_amazon_filmography, key = 'id',
salt = 'actor_id', con = con)
# We will now be creating a separate table each time we call
# `bad_amazon_filmography` for a different actor!
###
# Prefixes
###
# It is very important to give the function you are caching a prefix:
# when it is stored in the database, its table name will be the prefix
# combined with some string derived from the values in the salt.
cached_review_amazon_info <- cachemeifyoucan::cache(review_amazon_info,
key = 'id', con = con)
# Remember our `review_amazon_info` function from an earlier example?
# If we attempted to cache it without a prefix while also caching
# the vanilla `amazon_info` function, the same database table would be
# used for both functions! Since function representation in R is complex
# and there is no good way in general to determine whether two functions
# are identical, it is up to the user to determine a good prefix for
# their function (usually the function's name) so that it does not clash
# with other database tables.
cached_amazon_info <- cachemeifyoucan::cache(amazon_info,
prefix = 'amazon_info', key = 'id', con = con)
cached_review_amazon_info <- cachemeifyoucan::cache(review_amazon_info,
prefix = 'review_amazon_info', key = 'id', con = con)
# We will now use different database tables for these two functions.
###
# force.
###
# `force.` is a reserved argument for the to-be-cached function. If
# it is specified to be `TRUE`, the caching layer will forcibly
# repopulate the database tables for the given ids. The default value
# is `FALSE`.
cached_amazon_info <- cachemeifyoucan::cache(amazon_info,
prefix = 'amazon_info', key = 'id', con = con)
cached_amazon_info(c(10, 20), force. = TRUE) # Will forcibly repopulate.
###
# Advanced features
###
# We can use multiple primary keys and salts.
grab_sql_table <- function(table_name, year, month, dbname = 'default') {
# Imagine we have some function that given a table name
# and a database name returns a data.frame with aggregate
# information about records created in that table from a
# given year and month (e.g., ensuring each table has a
# created_at column). This function will return a data.frame
# with one record for each year-month pair, with at least
# the columns "year" and "month".
}
cached_sql_table <- cachemeifyoucan::cache(grab_sql_table,
key = c('year', 'month'), salt = c('table_name', 'dbname'), con = con,
prefix = 'sql_table')
# We would like to use a separate table to cache each combination of
# table_name and dbname. Note that the character vector passed into
# the `salt` parameter has to exactly match the names of the formal
# arguments in the initial function, and must also be the name of
# the columns returned by the data.frame. If these do not agree,
# you can wrap your function. For example, if the data.frame returned
# has 'mth' and 'yr' columns, you could instead cache the wrapper:
wrap_sql_table <- function(table_name, yr, mth, dbname = 'default') {
grab_sql_table(table_name = table_name, year = yr, month = mth, dbname = dbname)
}
###
# Debugging option `cachemeifyoucan.debug`
###
Sometimes it might be interesting to take a look at the underlying database
tables for debugging purposes. However, the contents of the database are
somewhat obfuscated. If you set `cachemeifyoucan.debug` option to TRUE will
every time you execute a cached function you will see some additional metadata
printed out, helping you navigate the database. An example output looks like this:
Using table name: amazon_data_c3204c0a47beb9238a787058d4f03834
Shard dimensions:
shard1_f8e8e2b41ac5c783d0954ce588f220fc: 45 rows * 308 columns
11 cached keys
5 uncached keys
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.