Description Details List of basic functions and common calling patterns Complete list of functions and common calling patterns Note Author(s) References See Also Examples
The track package sets up a link between R objects in memory and files on disk so that objects are automatically saved to files when they are changed. R objects in files are read in on demand and do not consume memory prior to being referenced. The track package also tracks times when objects are created and modified, and caches some basic characteristics of objects to allow for fast summaries of objects.
Each object is stored in a separate RData file using the standard
format as used by save()
, so that objects can be manually
picked out of or added to the track database if needed. The
track database is a directory usually named rdatadir
that
contains a RData file for each object and several housekeeping files
that are either plain text or RData files.
Tracking works by replacing a tracked variable by an
activeBinding
, which when accessed looks up information in an
associated 'tracking environment' and reads or writes the corresponding
RData file and/or gets or assigns the variable in the tracking
environment. In the default mode of operation, R variables that are
accessed are stored in memory for the duration of the top level task
(i.e., in one expression evaluated from the prompt.) A callback that is
called each time a top-level-task completes does three major things:
detects newly created or deleted variables, and adds or removes from the tracking database as appropriate, and
writes changed variables to the database, and
deletes cached objects from memory.
The track package also provides a self-contained incremental history
saving function that writes the most recent command to the file
.Rincr_history
at the end of each top-level task, along with a
time stamp that does not appear in the interactive history. The standard
history functionality (savehistory/loadhistory) in R writes the history
only at the end of the session. Thus, if the R session terminates
abnormally, history is lost.
There are four main reasons to use the track
package:
conveniently handle many moderately-large objects that would
collectively exhaust memory or be inconvenient to manage in
files by manually using save()
, load()
, and/or
save.image()
.
have changed or newly created objects saved automatically at the end of each top-level command, which ensures objects are preserved in the event of accidental or abnormal termination of the R session, and which also makes startup and saving much faster when many large objects in the global environment must be loaded or saved.
keep track of creation and modification times on objects
get fast summaries of basic characteristics of objects - class, size, dimension, etc.
There is an option to control whether tracked objects are cached in
memory as well as being stored on disk. By default, objects are cached
in memory for the duration of a top-level task. To save time when
working with collections of objects that will all fit in memory, turn on
caching with and turn off cache-flushing
track.options(cache=TRUE, cachePolicy="none")
, or start tracking with
track.start(..., cache=TRUE, cachePolicy="none")
. A possible future
improvement is to allow conditional and/or more intelligent caching of
objects. Some data that would be needed for this is already collected
in access counts and times that are recorded in the tracking summary.
Here is a brief example of tracking some variables in the global environment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | > library(track)
> # By default, track.start() uses/creates a db in the dir
> # 'rdatadir' in the current working directory; supply arg
> # dir= to change.
> track.start()
> x <- 123 # Variable 'x' is now tracked
> y <- matrix(1:6, ncol=2) # 'y' is assigned & tracked
> z1 <- list("a", "b", "c")
> z2 <- Sys.time()
> track.summary(size=F) # See a summary of tracked vars
class mode extent length modified TA TW
x numeric numeric [1] 1 2007-09-07 08:50:58 0 1
y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1
z1 list list [[3]] 3 2007-09-07 08:50:58 0 1
z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1
> # (TA="total accesses", TW="total writes")
> ls(all=TRUE)
[1] "x" "y" "z1" "z2"
> track.stop(pos=1) # Stop tracking
> ls(all=TRUE)
character(0)
>
> # Restart using the tracking dir -- the variables reappear
> track.start() # Start using the same tracking dir again ("rdatadir")
> ls(all=TRUE)
[1] "x" "y" "z1" "z2"
> track.summary(size=F)
class mode extent length modified TA TW
x numeric numeric [1] 1 2007-09-07 08:50:58 0 1
y matrix numeric [3x2] 6 2007-09-07 08:50:58 0 1
z1 list list [[3]] 3 2007-09-07 08:50:58 0 1
z2 POSIXt,POSIXct numeric [1] 1 2007-09-07 08:50:58 0 1
> track.stop(pos=1)
>
> # the files in the tracking directory:
> list.files("rdatadir", all=TRUE)
[1] "." ".."
[3] "filemap.txt" ".trackingSummary.rda"
[5] "x.rda" "y.rda"
[7] "z1.rda" "z2.rda"
>
|
There are several points to note:
The global environment is the default environment for tracking – it is possible to track variables in other environments, but that environment must be supplied as an argument to the track functions.
By default, newly created or deleted variables are automatically
added to or removed from the tracking database. This feature can be
disabled by supplying auto=FALSE
to track.start()
, or by
calling track.auto(FALSE)
.
When tracking is stopped, all tracked variables are saved on disk and will be no longer accessible until tracking is started again.
The objects are stored each in their own file in the
tracking dir, in the
format used by save()
/load()
(RData files).
For straightforward use of the track package, only a single
call to track.start()
need be made to start automatically tracking the global
environment. If it is desired to save untrackable variables at the end
of the session, track.stop()
should be called before calling
save.image()
or q('yes')
, because track.stop()
will
ensure that tracked variables are saved to disk and then remove them
from the global environment, leaving save.image()
to save only
the untracked or untrackable variables. The basic functions used in
automatic tracking are as follows:
track.start(dir=...)
: start tracking
the global environment, with files saved in dir
(the default is rdatadir
).
track.summary()
: print a summary of
the basic characteristics of tracked variables: name, class, extent,
and creation, modification and access times.
track.info()
: print a summary of
which tracking databases are currently active.
track.stop(pos=, all=)
: stop tracking.
Any unsaved tracked variables are saved to disk. Unless
keepVars=TRUE
is supplied, all tracked variables
become unavailable until tracking starts again.
track.attach(dir=..., pos=)
: attach an existing
tracking database to the search list at the specified position. The
default when attaching at positions other than 1 is to use readonly
mode, but in non-readonly mode, changes to variables in the attached
environment will be automatically saved to the database.
track.rescan(pos=)
: rescan a tracking directory
that was attached by track.attach()
at a position other than 1,
and that is preferably readonly.
For the non-automatic mode, four other functions cover the majority of common usage:
track.start(dir=..., auto=TRUE/FALSE)
: start tracking
the global environment, with files saved in dir
track(x)
: start tracking x
-
x
in the global environment is replaced by an active binding
and x
is saved in its corresponding file in the tracking
directory and, if caching is on, in the tracking environment
track(x <- value)
: start tracking x
track(list=c('x', 'y'))
: start tracking
specified variables
track(all=TRUE)
: start tracking all
untracked variables in the global environment
untrack(x)
: stop tracking variable x
-
the R object x
is put back as an ordinary object in the global environment
untrack(all=TRUE)
: stop tracking all
variables in the global environment (but tracking is still set up)
untrack(list=...)
: stop tracking specified variables
track.remove(x)
: completely remove all
traces of x
from the global environment, tracking environment
and tracking directory. Note that if variable x
in the global
environment is tracked,
remove(x)
will make x
an "orphaned" variable:
remove(x)
will just remove the active binding from the global
environment, and leave x
in the tracked environment and on
file, and x
will reappear after restarting tracking.
The track
package provides many additional functions for
controlling how tracking is performed (e.g., whether or not tracked variables
are cached in memory), examining the state of tracking (show which
variables are tracked, untracked, orphaned, masked, etc.) and repairing
tracking environments and databases that have become inconsistent or incomplete
(this may result from resource limitiations, e.g., being unable to
write a save file due to lack of disk space, or from manual tinkering,
e.g., dropping a new save file into a tracking directory.)
The functions that can be used to set up and take down tracking are:
track.start(dir=...)
: start tracking,
using the supplied directory
track.stop()
: stop tracking
(any unsaved tracked variables are saved to disk and all tracked variables
become unavailable until tracking starts again)
track.dir()
: return the path of the
tracking directory
Functions for tracking and stopping tracking variables:
track(x)
track(var <- value)
track(list=...)
track(all=TRUE)
: start tracking variable(s)
track.load(file=...): load some objects from
a RData file into the tracked environment
untrack(x, keep.in.db=FALSE)
untrack(list=...)
untrack(all=TRUE)
: stop tracking variable(s) -
value is left in place, and optionally, it is also left in the the database
Functions for getting status of tracking and summaries of variables:
track.summary()
: return a data
frame containing a summary of the basic characteristics of tracked
variables: name, class, extent, and creation, modification and access times.
track.status()
: return a data frame
containing information about the tracking status of variables: whether
they are saved to disk or not, etc.
track.info()
: return a data frame
containing information about which tracking dbs are currently active.
track.mem()
: return a data frame
containing information about number of objects and memory usage in
tracking dbs.
env.is.tracked()
: tell whether an
environment is currently tracked
The remaining functions allow the user to more closely manage variable tracking, but are less likely to be of use to new users.
Functions for getting status of tracking and summaries of variables:
tracked()
: return the names of tracked variables
untracked()
: return the names of
untracked variables
untrackable()
: return the names of
variables that cannot be tracked
track.unsaved()
: return the names of
variables whose copy on file is out-of-date
track.orphaned()
: return the
names of once-tracked variables that have lost their active binding
(should not happen)
track.masked()
: return the names of
once-tracked variables whose active binding has been overwritten by an
ordinary variable (should not happen)
Functions for managing tracking and tracked variables:
track.options()
: examine and set
options to control tracking
track.load()
: load variables from a
saved RData file into the tracking session
track.copy()
and track.move()
: copy
or move variables from one tracking db to another
track.rename()
rename variables in a tracking db
track.rescan()
: reload variable
values from disk (can forget all cached vars, remove no-longer existing tracked vars)
track.auto()
: turn auto-tracking on or off
Functions used internally as part of auto-tracking (generally not called by the user when auto-tracking is running):
track.sync()
: write unsaved variables to disk, and
remove excess objects from memory. This function can be called by the
user if they wish to remove excess objects from memory during a
memory-intensive top-level command.
track.sync.callback()
: calls track.sync()
,
this function is installed as a
task callback (to be called each time a top-level task is completed,
see taskCallback
). This function is
not exported by the track package.
track.auto.monitor()
: an additional callback that
monitors the existence of the callback to track.sync.callback
and re-instates it if missing. This function is
not exported by the track package.
Lower-level functions for managing tracking and tracked variables ( generally not called by the user when auto-tracking is running):
track.remove()
: completely remove all
traces of a tracked variable
track.save()
: write unsaved variables to disk
track.flush()
: write unsaved variables to disk, and remove from memory
track.forget()
: delete cached
versions without saving to file (the object saved in the file
will be retrieved next time the variable is accessed)
Functions for recovering from errors (caused by bugs or by multiple sessions updating bookkeeping data):
track.rebuild()
: rebuild tracking
information from objects in memory or on disk
Design and internals of tracking:
See help page track.design
Some special kinds of objects don't work properly if referenced as
active bindings and/or stored in a save file. One example is RODBC
connections. To make it easy to work with such objects, two ways of
excluding variables from automatic tracking are provided: the
autoTrackExcludePattern
option (a vector regular expressions:
variables whose name match one of these will not be tracked); and the
autoTrackExcludeClass
option (a vector of class names:
variables whose class matches one of these will not be tracked). New
values can be added to these options as follows:
1 2 | track.options(autoTrackExcludePattern="regexp")
track.options(autoTrackExcludeClass="classname")
|
Tracking is not particularly suitable for storing objects that contain
environments, because those environments and their contents will be
fully written out in the saved file (in a live R session, environments
are references, and there can be multiple references to one
environment.) Functions are one of the most common objects that contain
environments, which can contain data objects local to the function
(e.g., see the examples in the R FAQ in the section "Lexical scoping"
under "What are the differences between R and S?"
http://cran.r-project.org/doc/FAQ/R-FAQ.html#Lexical-scoping).
Additionally, the results of some modeling functions contain
environments, e.g., lm
holds several references to the
environment that contains the data. When an lm
object is
save
'ed, the environment containing the data, and all the other
objects in that environment, can be saved in the same file. To work
with large data objects and modeling functions, consider first creating
a tracking database that contains the data objects. Then, in a
different R session (which can be running at the same time), use
track.attach
to attach the db of data objects at pos=2
on
the search list. When working in this way, the data objects will only
be kept in memory when being used, and modeling functions that record
environments in their results can be successful used (though beware of
modeling functions that store large amounts of data in their results.)
Alternatively, use modeling functions that do not store references to
environments. The utility function show.envs
from the
track
package will show what environments are referenced
within an object (though it is not guaranteed to find them all.)
Tony Plate <tplate@acm.org>
Roger D. Peng. Interacting with data using the filehash package. R News, 6(4):19-24, October 2006. http://cran.r-project.org/doc/Rnews
David E. Brahm. Delayed data packages. R News, 2(3):11-12, December 2002. http://cran.r-project.org/doc/Rnews
Design of the track
package.
Potential future features of the track
package.
Documentation for save
and load
(in 'base' package).
Documentation for makeActiveBinding
and related
functions (in 'base' package).
Inspriation from the packages g.data
and
filehash
.
Description of the facility
(addTaskCallback
) for adding a
callback function that is called at the end of each top-level task (each
time R returns to the prompt after completing a command):
http://developer.r-project.org/TaskHandlers.pdf.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ##############################################################
# Warning: running this example will cause variables currently
# in the R global environment to be written to .RData files
# in a tracking database on the filesystem under R's temporary
# directory, and will cause the variables to be removed from
# the R global environment.
# It is recommended to run this example with a fresh R session
# with no important variables in the global environment.
##############################################################
library(track)
# Start tracking the global environment using a tmp directory
# Default tracking db dir is 'rdatadir' in the current working
# directory; omit the dir= argument to use this.
if (!is.element('tmpenv', search())) attach(new.env(), name='tmpenv', pos=2)
assign('tmpdatadir', pos='tmpenv', value=file.path(tempdir(), 'rdatadir1'))
track.start(dir=tmpdatadir)
a <- 1
b <- 2
ls()
track.status()
track.summary()
track.info()
track.stop()
# Variables are now gone because default action of track.stop()
# is to not read all tracked variables into memory (this could
# exhaust memory and/or be very time consuming).
ls()
# bring them back
track.start(dir=tmpdatadir)
ls()
# It is possible to keep tracked vars after stopping tracking:
track.stop(keepVars=TRUE)
ls()
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.