data.table: Enhanced data.frame
In data.table: Extension of data.frame

Description Usage Arguments Details Note References See Also Examples

data.table inherits from data.frame. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame, it is compatible with R functions and packages that only accept data.frame.

The 10 minute quick start guide to data.table may be a good place to start: vignette("datatable-intro"). Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: vignette("datatable-faq"). If you have read and searched these documents and the help page below, please feel free to ask questions on datatable-help or the Stack Overflow data.table tag. To report a bug please type: bug.report(package="data.table").

Please check the homepage for up to the minute news.

Tip: one of the quickest ways to learn the features is to type example(data.table) and study the output at the prompt.

data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)

## S3 method for class 'data.table'
x[i, j, by, keyby, with = TRUE,
  nomatch = getOption("datatable.nomatch"),                   # default: NA_integer_
  mult = "all",
  roll = FALSE,
  rollends = if (roll=="nearest") c(TRUE,TRUE)
             else if (roll>=0) c(FALSE,TRUE)
             else c(TRUE,FALSE),
  which = FALSE,
  .SDcols,
  verbose = getOption("datatable.verbose"),                   # default: FALSE
  allow.cartesian = getOption("datatable.allow.cartesian"),   # default: FALSE
  drop = NULL,
  rolltolast = FALSE   # deprecated
  ]

`...`	Just as `...` in `data.frame`. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
`keep.rownames`	If `...` is a `matrix` or `data.frame`, `TRUE` will retain the rownames of that object in a column named `rn`.
`check.names`	Just as `check.names` in `data.frame`.
`key`	Character vector of one or more column names which is passed to `setkey`. It may be a single comma separated string such as `key="x,y,z"`, or a vector of names such as `key=c("x","y","z")`.
`x`	A `data.table`.
`i`	Integer, logical or character vector, expression of column names, `list` or `data.table`. integer and logical vectors work the same way they do in `[.data.frame`. Other than `NA`s in logical `i` are treated as `FALSE` and a single `NA` logical is not recycled to match the number of rows, as it is in `[.data.frame`. character is matched to the first column of `x`'s key. expression is evaluated within the frame of the `data.table` (i.e. it sees column names as if they are variables) and can evaluate to any of the other types. When `i` is a `data.table`, `x` must have a key. `i` is joined to `x` using `x`'s key and the rows in `x` that match are returned. An equi-join is performed between each column in `i` to each column in `x`'s key; i.e., column 1 of `i` is matched to the 1st column of `x`'s key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If `i` has fewer columns than `x`'s key then not all of x's key columns will be joined to (a common use case) and many rows of `x` will (ordinarily) match to each row of `i`. If `i` has more columns than `x`'s key, the columns of `i` not involved in the join are included in the result. If `i` also has a key, it is `i`'s key columns that are used to match to `x`'s key columns (column 1 of `i`'s key is joined to column 1 of `x`'s key, column 2 of `i`'s key to column 2 of `x`'s key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of `x`'s key are joined to in order, either from column 1 onwards of `i` when `i` is unkeyed, or from column 1 onwards of `i`'s key. In code, the number of join columns is determined by `min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i))`. All types of 'i' may be prefixed with `!`. This signals a not-join or not-select should be performed. Throughout `data.table` documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present. See examples. Advanced: When `i` is an expression of column names that evaluates to `data.table` or `list`, a join is performed. We call this a self join. Advanced: When `i` is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.
`j`	A single column name, single expresson of column names, `list()` of expressions of column names, an expression or function call that evaluates to `list` (including `data.frame` and `data.table` which are `list`s, too), or (when `with=FALSE`) a vector of names or positions to select. `j` is evaluated within the frame of the `data.table`; i.e., it sees column names as if they are variables. Use `j=list(...)` to return multiple columns and/or expressions of columns. A single column or single expression returns that type, usually a vector. See the examples.
`by`	A single unquoted column name, a `list()` of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names. The `list()` of expressions is evaluated within the frame of the `data.table` (i.e. it sees column names as if they are variables). The `data.table` is then grouped by the `by` and `j` is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. `j=list(...)` may be omitted when there is just one expression, for convenience, typically a single expression such as `sum(colB)`; e.g., `DT[,sum(colB),by=colA]`. When `by` contains the first `n` columns of `x`'s key, we call this a keyed by. In a keyed by the groups appear contiguously in RAM and memory is copied in bulk internally, for extra speed. Otherwise, we call it an ad hoc by. Ad hoc by is still many times faster than `tapply`, for example, but just not as fast as keyed by when datasets are very large, in particular when the size of each group is large. Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in `i`. When `i` is a `data.table`, `DT[i,j]` evaluates `j` for each row of `i`. We call this by without by or grouping by i. Hence, the self join `DT[data.table(unique(colA)),j]` is identical to `DT[,j,by=colA]`. Advanced: When grouping by `by` or by `i`, symbols `.SD`, `.BY`, `.N`, `.I` and `.GRP` may be used in the `j` expression, defined as follows. `.SD` is a `data.table` containing the Subset of `x`'s Data for each group, excluding any columns used in `by` (or `keyby`). `.BY` is a `list` containing a length 1 vector for each item in `by`. This can be useful when `by` is not known in advance. The `by` variables are also available to `j` directly by name; useful for example for titles of graphs if `j` is a plot command, or to branch with `if()` depending on the value of a group variable. `.N` is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by `i`, `.N` is the number of rows in `x` matched to, for each row of `i`, regardless of whether `nomatch` is `NA` or `0`. It is renamed to `N` (no dot) in the result (otherwise a column called `".N"` could conflict with the `.N` variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g., `DT[,list(total=.N),by=a]`. `.I` is an integer vector length `.N` holding the row locations in `x` for this group. This is useful to subset in j; e.g. DT[,.I[which.max(somecol)],by=grp]. `.GRP` is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc. `.SD`, `.BY`, `.N`, `.I` and `.GRP` are read only. Their bindings are locked and attempting to assign to them will generate an error. If you wish to manipulate `.SD` before returning it, take a `copy(.SD)` first (see FAQ 4.5). Using `:=` in the `j` of `.SD` is reserved for future use as a (tortuously) flexible way to update `DT` by reference by group (even when groups are not contiguous in an ad hoc by). Advanced: In the `X[Y,j]` form of grouping, the `j` expression sees variables in `X` first, then `Y`. We call this join inherited scope. If the variable is not in `X` or `Y` then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.
`keyby`	An ad hoc by just as `by` but with an additional `setkey()` on the `by` columns of the result, for convenience. Not to be confused with a keyed by as defined above.
`with`	By default `with=TRUE` and `j` is evaluated within the frame of `x`. The column names can be used as variables. When `with=FALSE`, `j` is a vector of names or positions to select.
`nomatch`	Same as `nomatch` in `match`. When a row in `i` has no match to `x`'s key, `nomatch=NA` (default) means `NA` is returned for `x`'s non-join columns for that row of `i`. `0` means no rows will be returned for that row of `i`. The default value (used when `nomatch` is not supplied) can be changed from `NA` to `0` using `options(datatable.nomatch=0)`.
`mult`	When multiple rows in `x` match to the row in `i`, `mult` controls which are returned: `"all"` (default), `"first"` or `"last"`.
`roll`	Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If `roll=TRUE` and `i`'s row matches to all but the last `x` join column, and its value in the last `i` join column falls in a gap (including after the last observation in `x` for that group), then the prevailing value in `x` is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in `x`'s key, the last key column is a date (or time, or datetime) and all the columns of `x`'s key are joined to. A common idiom is to select a contemporaneous regular time series (`dts`) across a set of identifiers (`ids`): `DT[CJ(ids,dts),roll=TRUE]` where `DT` has a 2-column key (id,date) and `CJ` stands for cross join. When `roll` is a positive number, this limits how far values are carried forward. `roll=TRUE` is equivalent to `roll=+Inf`. When `roll` is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use `-Inf` for unlimited roll back. When `roll` is `"nearest"`, the nearest value is joined to.
`rollends`	A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. `roll=TRUE`) if a value is past the last observation within each group defined by the join columns, `rollends[2]=TRUE` will roll the last value forwards. `rollends[1]=TRUE` will roll the first value backwards if the value is before it. If `rollends=FALSE` the value of `i` must fall in a gap in `x` but not after the end or before the beginning of the data, for that group defined by all but the last join column. When `roll` is a finite number, that limit is also applied when rolling the ends.
`which`	`TRUE` returns the row numbers of `x` that `i` matches to. `NA` returns the row numbers of `i` that have no match in `x`. By default `FALSE` and the rows in `x` that match are returned.
`.SDcols`	Advanced. Specifies the columns of `x` included in `.SD`. May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., `DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350]`.
`verbose`	`TRUE` turns on status and information messages to the console. Turn this on by default using `options(datatable.verbose=TRUE)`. The quantity and types of verbosity may be expanded in future.
`allow.cartesian`	`FALSE` prevents joins that would result in more than `max(nrow(x),nrow(i))` rows. This is usually caused by duplicate values in `i`'s join columns, each of which join to the same group in 'x' over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in `data.table`: where every row in `i` joins to every row in `x` (a `nrow(x)*nrow(i)` row result). 'cartesian' is just meant in a 'large multiplicative' sense.
`drop`	Never used by `data.table`. Do not use. It needs to be here because `data.table` inherits from `data.frame`. See `vignette("datatable-faq")`.
`rolltolast`	Deprecated. Setting `rolltolast=TRUE` is converted to `roll=TRUE;rollends=FALSE` for backwards compatibility.

data.table builds on base R functionality to reduce 2 types of time :

programming time (easier to write, read, debug and maintain)
compute time

It combines database like operations such as subset, with and by and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame structure, eval within the environment of a list, the [.data.table mechanism to condense the features, and compiled C to make certain operations fast.

The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.

As with [.data.frame, compound queries can be concatenated on one line; e.g.,

1 2	DT[,sum(v),by=colA][V1<300][tail(order(V1))] # sum(v) by colA then return the 6 largest which are under 300

The j expression does not have to return data; e.g.,

1 2	DT[,plot(colB,colC),by=colA] # produce a set of plots (likely to pdf) returning no data

Multiple data.tables (e.g. X, Y and Z) can be joined in many ways; e.g.,

    X[Y][Z]
    X[Z][Y]
    X[Y[Z]]
    X[Z[Y]]

A data.table is a list of vectors, just like a data.frame. However :

it never has rownames. Instead it may have one key of one or more columns. This key can be used for row indexing instead of rownames.
it has enhanced functionality in [.data.table for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.

Since a list is a vector, data.table columns may be type list. Columns of type list can contain mixed types. Each item in a column of type list may be different lengths. This is true of data.frame, too.

Several methods are provided for data.table, including is.na, na.omit, t, rbind, cbind, merge and others.

If keep.rownames or check.names are supplied they must be written in full because R does not allow partial argument names after '...'. For example, data.table(DF,keep=TRUE) will create a column called "keep" containing TRUE and this is correct behaviour; data.table(DF,keep.rownames=TRUE) was intended.

POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see NEWS for 1.6.3, and IDateTime instead. IDateTime has methods to convert to and from POSIXlt.

data.table homepage: http://datatable.r-forge.r-project.org/
User reviews: http://crantastic.org/packages/data-table
http://en.wikipedia.org/wiki/Binary_search
http://en.wikipedia.org/wiki/Radix_sort

data.frame, [.data.frame, as.data.table, setkey, J, SJ, CJ, merge.data.table, tables, test.data.table, IDateTime, unique.data.table, copy, :=, alloc.col, truelength, rbindlist

## Not run: 
example(data.table)  # to run these examples at the prompt
## End(Not run)

DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a)      # TRUE
is.list(DF)                # TRUE
is.list(DT)                # TRUE

is.data.frame(DT)          # TRUE

tables()

DT[2]                      # 2nd row
DT[,v]                     # v column (as vector)
DT[,list(v)]               # v column (as data.table)
DT[2:3,sum(v)]             # sum(v) over rows 2 and 3
DT[2:5,cat(v,"\n")]        # just for j's side effect
DT[c(FALSE,TRUE)]          # even rows (usual recycling)

DT[,2,with=FALSE]          # 2nd column
colNum = 2
DT[,colNum,with=FALSE]     # same

setkey(DT,x)               # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x")            # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v)              # same
# key(DT)<-"x"             # copies whole table, please use set* functions instead

DT["a"]                    # binary search (fast)
DT[x=="a"]                 # vector scan (slow)

DT[,sum(v),by=x]           # keyed by
DT[,sum(v),by=key(DT)]     # same
DT[,sum(v),by=y]           # ad hoc by

DT["a",sum(v)]             # j for one group
DT[c("a","b"),sum(v)]      # j for two groups

X = data.table(c("b","c"),foo=c(4,2))
X

DT[X]                      # join
DT[X,sum(v)]               # join and eval j for each row in i
DT[X,mult="first"]         # first row of each group
DT[X,mult="last"]          # last row of each group
DT[X,sum(v)*foo]           # join inherited scope

setkey(DT,x,y)             # 2-column key
setkeyv(DT,c("x","y"))     # same

DT["a"]                    # join to 1st column of key
DT[J("a")]                 # same. J() stands for Join, an alias for list()
DT[list("a")]              # same
DT[.("a")]                 # same. In the style of package plyr.
DT[J("a",3)]               # join to 2 columns
DT[.("a",3)]               # same
DT[J("a",3:6)]             # join 4 rows (2 missing)
DT[J("a",3:6),nomatch=0]   # remove missing
DT[J("a",3:6),roll=TRUE]   # rolling join (locf)

DT[,sum(v),by=list(y%%2)]  # by expression
DT[,.SD[2],by=x]           # 2nd row of each group
DT[,tail(.SD,2),by=x]      # last 2 rows of each group
DT[,lapply(.SD,sum),by=x]  # apply through columns by group

DT[,list(MySum=sum(v),
         MyMin=min(v),
         MyMax=max(v)),
    by=list(x,y%%2)]       # by 2 expressions

DT[,sum(v),x][V1<20]       # compound query
DT[,sum(v),x][order(-V1)]  # ordering results

print(DT[,z:=42L])         # add new column by reference
print(DT[,z:=NULL])        # remove column by reference
print(DT["a",v:=42L])      # subassign to existing v column by reference
print(DT["b",v2:=84L])     # subassign to new column by reference (NA padded)

DT[,m:=mean(v),by=x][]     # add new column by reference by group
                           # NB: postfix [] is shortcut to print()

DT[,.SD[which.min(v)],by=x][]  # nested query by group

DT[!J("a")]                # not join
DT[!"a"]                   # same
DT[!2:4]                   # all rows other than 2:4
DT[x!="b" | y!=3]          # multiple vector scanning approach, slow
DT[!J("b",3)]              # same result but much faster


# Follow r-help posting guide, support is here (*not* r-help) :
# datatable-help@lists.r-forge.r-project.org
# or
# http://stackoverflow.com/questions/tagged/data.table

## Not run: 
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")

test.data.table()          # over 1,200 low level tests

update.packages()          # keep up to date

## End(Not run)