track-package: Overview of track package

Description Details List of basic functions and common calling patterns Complete list of functions and common calling patterns Note Author(s) References See Also Examples

Description

The track package sets up a link between R objects in memory and files on disk so that objects are automatically saved to files when they are changed. R objects in files are read in on demand and do not consume memory prior to being referenced. The track package also tracks times when objects are created and modified, and caches some basic characteristics of objects to allow for fast summaries of objects.

Each object is stored in a separate RData file using the standard format as used by save(), so that objects can be manually picked out of or added to the track database if needed. The track database is a directory usually named rdatadir that contains a RData file for each object and several housekeeping files that are either plain text or RData files.

Tracking works by replacing a tracked variable by an activeBinding, which when accessed looks up information in an associated 'tracking environment' and reads or writes the corresponding RData file and/or gets or assigns the variable in the tracking environment. In the default mode of operation, R variables that are accessed are stored in memory for the duration of the top level task (i.e., in one expression evaluated from the prompt.) A callback that is called each time a top-level-task completes does three major things:

The track package also provides a self-contained incremental history saving function that writes the most recent command to the file .Rincr_history at the end of each top-level task, along with a time stamp that does not appear in the interactive history. The standard history functionality (savehistory/loadhistory) in R writes the history only at the end of the session. Thus, if the R session terminates abnormally, history is lost.

Details

There are four main reasons to use the track package:

There is an option to control whether tracked objects are cached in memory as well as being stored on disk. By default, objects are cached in memory for the duration of a top-level task. To save time when working with collections of objects that will all fit in memory, turn on caching with and turn off cache-flushing track.options(cache=TRUE, cachePolicy="none"), or start tracking with track.start(..., cache=TRUE, cachePolicy="none"). A possible future improvement is to allow conditional and/or more intelligent caching of objects. Some data that would be needed for this is already collected in access counts and times that are recorded in the tracking summary.

Here is a brief example of tracking some variables in the global environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
> library(track)
> # By default, track.start() uses/creates a db in the dir
> # 'rdatadir' in the current working directory; supply arg
> # dir= to change.
> track.start()
> x <- 123                  # Variable 'x' is now tracked
> y <- matrix(1:6, ncol=2)  # 'y' is assigned & tracked
> z1 <- list("a", "b", "c")
> z2 <- Sys.time()
> track.summary(size=F)     # See a summary of tracked vars
            class    mode extent length            modified TA TW
x         numeric numeric    [1]      1 2007-09-07 08:50:58  0  1
y          matrix numeric  [3x2]      6 2007-09-07 08:50:58  0  1
z1           list    list  [[3]]      3 2007-09-07 08:50:58  0  1
z2 POSIXt,POSIXct numeric    [1]      1 2007-09-07 08:50:58  0  1
> # (TA="total accesses", TW="total writes")
> ls(all=TRUE)
[1] "x"  "y"  "z1" "z2"
> track.stop(pos=1)              # Stop tracking
> ls(all=TRUE)
character(0)
>
> # Restart using the tracking dir -- the variables reappear
> track.start() # Start using the same tracking dir again ("rdatadir")
> ls(all=TRUE)
[1] "x"  "y"  "z1" "z2"
> track.summary(size=F)
            class    mode extent length            modified TA TW
x         numeric numeric    [1]      1 2007-09-07 08:50:58  0  1
y          matrix numeric  [3x2]      6 2007-09-07 08:50:58  0  1
z1           list    list  [[3]]      3 2007-09-07 08:50:58  0  1
z2 POSIXt,POSIXct numeric    [1]      1 2007-09-07 08:50:58  0  1
> track.stop(pos=1)
>
> # the files in the tracking directory:
> list.files("rdatadir", all=TRUE)
[1] "."                    ".."
[3] "filemap.txt"          ".trackingSummary.rda"
[5] "x.rda"                "y.rda"
[7] "z1.rda"               "z2.rda"
>

There are several points to note:

List of basic functions and common calling patterns

For straightforward use of the track package, only a single call to track.start() need be made to start automatically tracking the global environment. If it is desired to save untrackable variables at the end of the session, track.stop() should be called before calling save.image() or q('yes'), because track.stop() will ensure that tracked variables are saved to disk and then remove them from the global environment, leaving save.image() to save only the untracked or untrackable variables. The basic functions used in automatic tracking are as follows:

For the non-automatic mode, four other functions cover the majority of common usage:

Complete list of functions and common calling patterns

The track package provides many additional functions for controlling how tracking is performed (e.g., whether or not tracked variables are cached in memory), examining the state of tracking (show which variables are tracked, untracked, orphaned, masked, etc.) and repairing tracking environments and databases that have become inconsistent or incomplete (this may result from resource limitiations, e.g., being unable to write a save file due to lack of disk space, or from manual tinkering, e.g., dropping a new save file into a tracking directory.)

The functions that can be used to set up and take down tracking are:

Functions for tracking and stopping tracking variables:

Functions for getting status of tracking and summaries of variables:

The remaining functions allow the user to more closely manage variable tracking, but are less likely to be of use to new users.

Functions for getting status of tracking and summaries of variables:

Functions for managing tracking and tracked variables:

Functions used internally as part of auto-tracking (generally not called by the user when auto-tracking is running):

Lower-level functions for managing tracking and tracked variables ( generally not called by the user when auto-tracking is running):

Functions for recovering from errors (caused by bugs or by multiple sessions updating bookkeeping data):

Design and internals of tracking:

Note

Some special kinds of objects don't work properly if referenced as active bindings and/or stored in a save file. One example is RODBC connections. To make it easy to work with such objects, two ways of excluding variables from automatic tracking are provided: the autoTrackExcludePattern option (a vector regular expressions: variables whose name match one of these will not be tracked); and the autoTrackExcludeClass option (a vector of class names: variables whose class matches one of these will not be tracked). New values can be added to these options as follows:

1
2
track.options(autoTrackExcludePattern="regexp")
track.options(autoTrackExcludeClass="classname")

Tracking is not particularly suitable for storing objects that contain environments, because those environments and their contents will be fully written out in the saved file (in a live R session, environments are references, and there can be multiple references to one environment.) Functions are one of the most common objects that contain environments, which can contain data objects local to the function (e.g., see the examples in the R FAQ in the section "Lexical scoping" under "What are the differences between R and S?" http://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping). Additionally, the results of some modeling functions contain environments, e.g., lm holds several references to the environment that contains the data. When an lm object is save'ed, the environment containing the data, and all the other objects in that environment, can be saved in the same file. To work with large data objects and modeling functions, consider first creating a tracking database that contains the data objects. Then, in a different R session (which can be running at the same time), use track.attach to attach the db of data objects at pos=2 on the search list. When working in this way, the data objects will only be kept in memory when being used, and modeling functions that record environments in their results can be successful used (though beware of modeling functions that store large amounts of data in their results.) Alternatively, use modeling functions that do not store references to environments. The utility function show.envs from the track package will show what environments are referenced within an object (though it is not guaranteed to find them all.)

Author(s)

Tony Plate <tplate@acm.org>

References

Roger D. Peng. Interacting with data using the filehash package. R News, 6(4):19-24, October 2006. http://cran.r-project.org/doc/Rnews

David E. Brahm. Delayed data packages. R News, 2(3):11-12, December 2002. http://cran.r-project.org/doc/Rnews

See Also

Design of the track package.

Potential future features of the track package.

Documentation for save and load (in 'base' package).

Documentation for makeActiveBinding and related functions (in 'base' package).

Inspriation from the packages g.data and filehash.

Description of the facility (addTaskCallback) for adding a callback function that is called at the end of each top-level task (each time R returns to the prompt after completing a command): http://developer.r-project.org/TaskHandlers.pdf.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
##############################################################
# Warning: running this example will cause variables currently
# in the R global environment to be written to .RData files
# in a tracking database on the filesystem under R's temporary
# directory, and will cause the variables to be removed from
# the R global environment.
# It is recommended to run this example with a fresh R session
# with no important variables in the global environment.
##############################################################

library(track)
# Start tracking the global environment using a tmp directory
# Default tracking db dir is 'rdatadir' in the current working
# directory; omit the dir= argument to use this.
if (!is.element('tmpenv', search())) attach(new.env(), name='tmpenv', pos=2)
assign('tmpdatadir', pos='tmpenv', value=file.path(tempdir(), 'rdatadir1'))
track.start(dir=tmpdatadir)
a <- 1
b <- 2
ls()
track.status()
track.summary()
track.info()
track.stop()
# Variables are now gone because default action of track.stop()
# is to not read all tracked variables into memory (this could
# exhaust memory and/or be very time consuming).
ls()
# bring them back
track.start(dir=tmpdatadir)
ls()
# It is possible to keep tracked vars after stopping tracking:
track.stop(keepVars=TRUE)
ls()

track documentation built on May 2, 2019, 10:22 a.m.