data.table inherits from
data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by
A[B] syntax in R where
A is a matrix and
B is a 2-column matrix. Since a
data.table is a
data.frame, it is compatible with R functions and packages that only accept
The 10 minute quick start guide to
data.table may be a good place to start:
vignette("datatable-intro"). Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation:
vignette("datatable-faq"). If you have read and searched these documents and the help page below, please feel free to ask questions on datatable-help or the Stack Overflow data.table tag. To report a bug please type:
Please check the homepage for up to the minute news.
Tip: one of the quickest ways to learn the features is to type
example(data.table) and study the output at the prompt.
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)
## S3 method for class 'data.table'
x[i, j, by, keyby, with = TRUE,
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
mult = "all",
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
which = FALSE,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL,
on = NULL # join without setting keys, new feature from v1.9.6+
data.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
... is a
TRUE will retain the rownames of that object in a column named
Character vector of one or more column names which is passed to
setkey. It may be a single comma separated string such as
key="x,y,z", or a vector of names such as
Integer, logical or character vector, single column numeric
matrix, expression of column names,
integer and logical vectors work the same way they do in
[.data.frame. Other than
NAs in logical
i are treated as
FALSE and a single
NA logical is not recycled to match the number of rows, as it is in
character is matched to the first column of
expression is evaluated within the frame of the
data.table (i.e. it sees column names as if they are variables) and can evaluate to any of the other types.
i is a
x must have a key.
i is joined to
x's key and the rows in
x that match are returned. An equi-join is performed between each column in
i to each column in
x's key; i.e., column 1 of
i is matched to the 1st column of
x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If
i has fewer columns than
x's key then not all of x's key columns will be joined to (a common use case) and many rows of
x will (ordinarily) match to each row of
i has more columns than
x's key, the columns of
i not involved in the join are included in the result. If
i also has a key, it is
i's key columns that are used to match to
x's key columns (column 1 of
i's key is joined to column 1 of
x's key, column 2 of
i's key to column 2 of
x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of
x's key are joined to in order, either from column 1 onwards of
i is unkeyed, or from column 1 onwards of
i's key. In code, the number of join columns is determined by
min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)).
All types of 'i' may be prefixed with
!. This signals a not-join or not-select should be performed. Throughout
data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present. See examples.
i is an expression of column names that evaluates to
list, a join is performed. We call this a self join.
i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.
A single column name, single expresson of column names,
list() of expressions of column names, an expression or function call that evaluates to
data.table which are
lists, too), or (when
with=FALSE) a vector of names or positions to select.
j is evaluated within the frame of the
data.table; i.e., it sees column names as if they are variables. Use
j=list(...) to return multiple columns and/or expressions of columns. A single column or single expression returns that type, usually a vector. See the examples.
A single unquoted column name, a
list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.
list() of expressions is evaluated within the frame of the
data.table (i.e. it sees column names as if they are variables). The
data.table is then grouped by the
j is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups.
j=list(...) may be omitted when there is just one expression, for convenience, typically a single expression such as
by contains the first
n columns of
x's key, we call this a keyed by. In a keyed by the groups appear contiguously in RAM and memory is copied in bulk internally, for extra speed. Otherwise, we call it an ad hoc by. Ad hoc by is still many times faster than
tapply, for example, but just not as fast as keyed by when datasets are very large, in particular when the size of each group is large. Not to be confused with
keyby= defined below.
i is a
j for the groups in 'DT' that each row in
i joins to. That is, you can join (in
i) and aggregate (in
j) simultaneously. We call this grouping by each i. It is particularly memory efficient as you don't have to materialise the join result only to aggregate later. Please refer to this Stackoverflow answer for a more detailed explanation until we roll out vignettes.
Advanced: When grouping, symbols
.GRP may be used in the
j expression, defined as follows.
.SD is a
data.table containing the Subset of
x's Data for each group, excluding any columns used in
.BY is a
list containing a length 1 vector for each item in
by. This can be useful when
by is not known in advance. The
by variables are also available to
j directly by name; useful for example for titles of graphs if
j is a plot command, or to branch with
if() depending on the value of a group variable.
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by
.N is the number of rows in
x matched to, for each row of
i, regardless of whether
0. It is renamed to
N (no dot) in the result (otherwise a column called
".N" could conflict with the
.N variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g.,
.I is an integer vector equal to
seq_len(nrow(x)). While grouping, it holds for each item in the group, it's row location in
x. This is useful to subset in
DT[, .I[which.max(somecol)], by=grp].
.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.
.GRP are read only. Their bindings are locked and attempting to assign to them will generate an error. If you wish to manipulate
.SD before returning it, take a
copy(.SD) first (see FAQ 4.5). Using
:= in the
.SD is reserved for future use as a (tortuously) flexible way to update
DT by reference by group (even when groups are not contiguous in an ad hoc by).
Advanced: In the
X[Y,j] form of grouping, the
j expression sees variables in
X first, then
Y. We call this join inherited scope. If the variable is not in
Y then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.
An ad-hoc-by or keyed-by (just as
by= defined above) but with an additional
setkey() run on the
by columns of the result afterwards, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted. Out loud we read
keyby= as by= then setkey. Otherwise, 'by=' can be relied on to return the groups in the order they appear in the data.
j is evaluated within the frame of
x; column names can be used as variables. When
j is a character vector of column names or a numeric vector of column positions to select, and the value returned is always a
with=FALSE is often useful in
data.table to select columns dynamically.
match. When a row in
i has no match to
nomatch=NA (default) means
NA is returned for
x's non-join columns for that row of
0 means no rows will be returned for that row of
i. The default value (used when
nomatch is not supplied) can be changed from
When multiple rows in
x match to the row in
mult controls which are returned:
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If
i's row matches to all but the last
x join column, and its value in the last
i join column falls in a gap (including after the last observation in
x for that group), then the prevailing value in
x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in
x's key, the last key column is a date (or time, or datetime) and all the columns of
x's key are joined to. A common idiom is to select a contemporaneous regular time series (
dts) across a set of identifiers (
DT has a 2-column key (id,date) and
CJ stands for cross join.
roll is a positive number, this limits how far values are carried forward.
roll=TRUE is equivalent to
roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use
-Inf for unlimited roll back.
"nearest", the nearest value is joined to.
A logical vector length 2 (a single logical is recycled). When rolling forward (e.g.
roll=TRUE) if a value is past the last observation within each group defined by the join columns,
rollends=TRUE will roll the last value forwards.
rollends=TRUE will roll the first value backwards if the value is before it. If
rollends=FALSE the value of
i must fall in a gap in
x but not after the end or before the beginning of the data, for that group defined by all but the last join column. When
roll is a finite number, that limit is also applied when rolling the ends.
TRUE returns the row numbers of
i matches to.
NA returns the row numbers of
i that have no match in
x. By default
FALSE and the rows in
x that match are returned.
Advanced. Specifies the columns of
x included in
.SD. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g.,
TRUE turns on status and information messages to the console. Turn this on by default using
options(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.
FALSE prevents joins that would result in more than
nrow(x)+nrow(i) rows. This is usually caused by duplicate values in
i's join columns, each of which join to the same group in 'x' over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in
data.table: where every row in
i joins to every row in
nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.
Never used by
data.table. Do not use. It needs to be here because
data.table inherits from
A named atomic vector of column names indicating which columns in
i should be joined to which columns in
data.table builds on base R functionality to reduce 2 types of time :
programming time (easier to write, read, debug and maintain)
It combines database like operations such as
by and provides similar joins that
merge provides but faster. This is achieved by using R's column based ordered in-memory
eval within the environment of a
[.data.table mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
[.data.frame, compound queries can be concatenated on one line; e.g.,
# sum(v) by colA then return the 6 largest which are under 300
j expression does not have to return data; e.g.,
# produce a set of plots (likely to pdf) returning no data
Z) can be joined in many ways; e.g.,
data.table is a
list of vectors, just like a
data.frame. However :
it never has rownames. Instead it may have one key of one or more columns. This key can be used for row indexing instead of rownames.
it has enhanced functionality in
[.data.table for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
list is a
data.table columns may be type
list. Columns of type
list can contain mixed types. Each item in a column of type
list may be different lengths. This is true of
Several methods are provided for
merge and others.
check.names are supplied they must be written in full because R does not allow partial argument names after '
...'. For example,
data.table(DF,keep=TRUE) will create a
TRUE and this is correct behaviour;
data.table(DF,keep.rownames=TRUE) was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see NEWS for 1.6.3, and
IDateTime instead. IDateTime has methods to convert to and from POSIXlt.
data.table homepage: http://datatable.r-forge.r-project.org/
User reviews: http://crantastic.org/packages/data-table
## Not run:
example(data.table) # to run these examples at the prompt
## End(Not run)
DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a) # TRUE
is.list(DF) # TRUE
is.list(DT) # TRUE
is.data.frame(DT) # TRUE
DT # 2nd row
DT[,v] # v column (as vector)
DT[,list(v)] # v column (as data.table)
DT[2:3,sum(v)] # sum(v) over rows 2 and 3
DT[2:5,cat(v,"\n")] # just for j's side effect
DT[c(FALSE,TRUE)] # even rows (usual recycling)
DT[,2,with=FALSE] # 2nd column
colNum = 2
DT[,colNum,with=FALSE] # same
setkey(DT,x) # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x") # same (v in setkeyv stands for vector)
setkeyv(DT,v) # same
# key(DT)<-"x" # copies whole table, please use set* functions instead
DT["a"] # binary search (fast)
DT[x=="a"] # same; i.e. binary search (fast)
DT[,sum(v),by=x] # keyed by
DT[,sum(v),by=key(DT)] # same
DT[,sum(v),by=y] # ad hoc by
DT["a",sum(v)] # j for one group
DT[c("a","b"),sum(v),by=.EACHI] # j for two groups
X = data.table(c("b","c"),foo=c(4,2))
DT[X] # join
DT[X,sum(v),by=.EACHI] # join and eval j for each row in i
DT[X,mult="first"] # first row of each group
DT[X,mult="last"] # last row of each group
DT[X,sum(v)*foo,by=.EACHI] # join inherited scope
setkey(DT,x,y) # 2-column key
setkeyv(DT,c("x","y")) # same
DT["a"] # join to 1st column of key
DT[.("a")] # same, .() is an alias for list()
DT[list("a")] # same
DT[.("a",3)] # join to 2 columns
DT[.("a",3:6)] # join 4 rows (2 missing)
DT[.("a",3:6),nomatch=0] # remove missing
DT[.("a",3:6),roll=TRUE] # rolling join (locf)
DT[,sum(v),by=.(y%%2)] # by expression
DT[,.SD,by=x] # 2nd row of each group
DT[,tail(.SD,2),by=x] # last 2 rows of each group
DT[,lapply(.SD,sum),by=x] # apply through columns by group
by=.(x,y%%2)] # by 2 expressions
DT[,sum(v),x][V1<20] # compound query
DT[,sum(v),x][order(-V1)] # ordering results
print(DT[,z:=42L]) # add new column by reference
print(DT[,z:=NULL]) # remove column by reference
print(DT["a",v:=42L]) # subassign to existing v column by reference
print(DT["b",v2:=84L]) # subassign to new column by reference (NA padded)
DT[,m:=mean(v),by=x] # add new column by reference by group
# NB: postfix  is shortcut to print()
DT[,.SD[which.min(v)],by=x] # nested query by group
DT[!.("a")] # not join
DT[!"a"] # same
DT[!2:4] # all rows other than 2:4
DT[x!="b" | y!=3] # not yet optimized, currently vector scans
DT[!.("b",3)] # same result but much faster
# new feature: 'on' argument, from v1.9.6+
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), mul=6:8)
DT1[DT2, on=c(x="x")] # join on columns 'x'
DT1[DT2, on="x"] # same as above
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x"] # using by=.EACHI
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x", nomatch=0L] # using by=.EACHI
# Follow r-help posting guide, support is here (*not* r-help) :
## Not run:
test.data.table() # over 1,300 low level tests
update.packages() # keep up to date
## End(Not run)