Description Usage Arguments Details Value Note See Also Examples
chmatch
returns a vector of the positions of (first) matches of its first argument in its second. Both arguments must be character vectors.
%chin%
is like %in%
, but for character vectors.
1 2 3 4 |
x |
character vector: the values to be matched, or the values to be ordered or grouped |
table |
character vector: the values to be matched against. |
nomatch |
the value to be returned in the case when no match is found. Note that it is coerced to integer. |
Fast versions of match
, %in%
and order
, optimised for character vectors. chgroup
groups together duplicated values but retains the group order (according the first appearance order of each group), efficiently. They have been primarily developed for internal use by data.table, but have been exposed since that seemed appropriate.
Strings are already cached internally by R (CHARSXP
) and that is utilised by these functions. No hash table is built or cached, so the first call is the same speed as subsequent calls. Essentially, a counting sort (similar to base::sort.list(x,method="radix")
, see setkey
) is implemented using the (almost) unused truelength of CHARSXP as the counter. Where R has used truelength of CHARSXP (where a character value is shared by a variable name), the non zero truelengths are stored first and reinstated afterwards. Each of the ch*
functions implements a variation on this theme. Remember that internally in R, length of a CHARSXP is the nchar of the string and DATAPTR is the string itself.
Methods that do build and cache a hash table (such as the fastmatch package) are much faster on subsequent calls (almost instant) but a little slower on the first. Therefore chmatch
may be particularly suitable for ephemeral vectors (such as local variables in functions) or tasks that are only done once. Much depends on the length of x
and table
, how many unique strings each contains, and whether the position of the first match is all that is required.
It may be possible to speed up fastmatch's hash table build time by using the technique in data.table
, and we have suggested this to its author. If successful, fastmatch would then be fastest in all cases.
As match
and %in%
. chorder
and chgroup
return an integer index vector.
The name charmatch
was taken by charmatch
, hence chmatch
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | # Please type 'example(chmatch)' to run this and see timings on your machine
# N is set small here (1e5) because CRAN runs all examples and tests every night, to catch
# any problems early as R itself changes and other packages run.
# The comments here apply when N has been changed to 1e7.
N = 1e5
u = as.character(as.hexmode(1:10000))
y = sample(u,N,replace=TRUE)
x = sample(u)
# With N=1e7 ...
system.time(a <- match(x,y)) # 4.8s
system.time(b <- chmatch(x,y)) # 0.9s Faster than 1st fmatch
identical(a,b)
if (fastmatchloaded<-suppressWarnings(require(fastmatch))) {
print(system.time(c <- fmatch(x,y))) # 2.1s Builds and caches hash
print(system.time(c <- fmatch(x,y))) # 0.00s Uses hash
identical(a,c)
}
system.time(a <- x %in% y) # 4.8s
system.time(b <- x %chin% y) # 0.9s
identical(a,b)
if (fastmatchloaded) {
match <- fmatch # fmatch is drop in replacement
print(system.time(c <- match(x,y))) # 0.00s
print(system.time(c <- x %in% y)) # 4.8s %in% still prefers base::match
# Anyone know how to get %in% to use fmatch (without masking %in% too)?
rm(match)
identical(a,c)
}
# Different example with more unique strings ...
u = as.character(as.hexmode(1:(N/10)))
y = sample(u,N,replace=TRUE)
x = sample(u,N,replace=TRUE)
system.time(a <- match(x,y)) # 34.0s
system.time(b <- chmatch(x,y)) # 6.4s
identical(a,b)
if (fastmatchloaded) {
print(system.time(c <- fmatch(x,y))) # 7.9s
print(system.time(c <- fmatch(x,y))) # 4.0s
identical(a,c)
}
|
user system elapsed
0.004 0.000 0.003
user system elapsed
0.000 0.004 0.002
[1] TRUE
Loading required package: fastmatch
user system elapsed
0.001 0.000 0.003
user system elapsed
0 0 0
[1] TRUE
user system elapsed
0.003 0.000 0.003
user system elapsed
0.002 0.000 0.001
[1] TRUE
user system elapsed
0 0 0
user system elapsed
0.002 0.000 0.003
[1] TRUE
user system elapsed
0.006 0.000 0.006
user system elapsed
0.003 0.000 0.003
[1] TRUE
user system elapsed
0.003 0.000 0.003
user system elapsed
0.000 0.000 0.001
[1] TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.