data.table parlance, all
set* functions change their input
by reference. That is, no copy is made at all, other than temporary
working memory, which is as large as one column.. The only other
operator that modifies input by reference is
:=. Check out the
See Also section below for other
setkey() sorts a
data.table and marks it as sorted (with an
sorted). The sorted columns are the key. The key can be any
columns in any order. The columns are sorted in ascending order always. The table
is changed by reference and is therefore very memory efficient.
key() returns the
data.table's key if it exists, and
if none exist.
haskey() returns a logical
FALSE depending on whether
data.table has a key (or not).
1 2 3 4 5 6 7 8
The columns to sort by. Do not quote the column names. If
A character vector (only) of column names.
Output status and information.
TRUE changes the order of the data in RAM. FALSE adds a secondary key a.k.a. index.
setkey reorders (or sorts) the rows of a data.table by the columns
provided. In versions
integer columns, a modified version
of base's counting sort is implemented, which allows negative values as well. It
is extremely fast, but is limited by the range of integer values being <= 1e5. If
that fails, it falls back to a (fast) 4-pass radix sort for integers, implemented
based on Pierre Terdiman's and Michael Herf's code (see links below). Similarly,
a very fast 6-pass radix order for columns of type
double is also implemented.
This gives a speed-up of about 5-8x compared to
and all internal
sort operations. Fast radix sorting is also
The sort is stable; i.e., the order of ties (if any) is preserved, in both
<= 1.8.10, for columns of type
the sort is attempted with the very fast
"radix" method in
sort.list. If that fails, the sort reverts to the default
order. For character vectors,
takes advantage of R's internal global string cache and implements a very efficient
order, also exported as
In v1.7.8, the
key<- syntax was deprecated. The
<- method copies
the whole table and we know of no way to avoid that copy without a change in
R itself. Please use the
set* functions instead, which make no copy at
setkey accepts unquoted column names for convenience, whilst
setkeyv accepts one vector of column names.
The problem (for
data.table) with the copy by
key<- (other than
being slower) is that R doesn't maintain the over allocated truelength, but it
looks as though it has. Adding a column by reference using
:= after a
key<- was therefore a memory overwrite and eventually a segfault; the
over allocated memory wasn't really there after
now have an attribute
.internal.selfref to catch and warn about such copies.
This attribute has been implemented in a way that is friendly with
For the same reason, please use the other
set* functions which modify
objects by reference, rather than using the
<- operator which results
in copying the entire object.
It isn't good programming practice, in general, to use column numbers rather
than names. This is why
setkeyv only accept column names.
If you use column numbers then bugs (possibly silent) can more easily creep into
your code as time progresses if changes are made elsewhere in your code; e.g., if
you add, remove or reorder columns in a few months time, a
setkey by column
number will then refer to a different column, possibly returning incorrect results
with no warning. (A similar concept exists in SQL, where
"select * from ..."
is considered poor programming style when a robust, maintainable system is
required.) If you really wish to use column numbers, it's possible but
deliberately a little harder; e.g.,
The input is modified by reference, and returned (invisibly) so it can be used
in compound statements; e.g.,
setkey(DT,a)[J("foo")]. If you require a
copy, take a copy first (using
copy() may also
sometimes be useful before
:= is used to subassign to a column by
Despite its name,
invokes a counting sort in R, not a radix sort. See do_radixsort in
src/main/sort.c. A counting sort, however, is particularly suitable for
sorting integers and factors, and we like it. In fact we like it so much
data.table contains a counting sort algorithm for character vectors
using R's internal global string cache. This is particularly fast for character
vectors containing many duplicates, such as grouped data in a key column. This
means that character is often preferred to factor. Factors are still fully
supported, in particular ordered factors (where the levels are not in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Type 'example(setkey)' to run these at prompt and browse output DT = data.table(A=5:1,B=letters[5:1]) DT # before setkey(DT,B) # re-orders table and marks it sorted. DT # after tables() # KEY column reports the key'd columns key(DT) keycols = c("A","B") setkeyv(DT,keycols) # rather than key(DT)<-keycols (which copies entire table) DT = data.table(A=5:1,B=letters[5:1]) DT2 = DT # does not copy setkey(DT2,B) # does not copy-on-write to DT2 identical(DT,DT2) # TRUE. DT and DT2 are two names for the same keyed table DT = data.table(A=5:1,B=letters[5:1]) DT2 = copy(DT) # explicit copy() needed to copy a data.table setkey(DT2,B) # now just changes DT2 identical(DT,DT2) # FALSE. DT and DT2 are now different tables