index: Indexing vectors

Description Usage Arguments Details Value Theory Benefits Limitations Open questions Note Author(s) References See Also Examples

View source: R/index.R

Description

Indexing allows to extract small fractions of large vectors very quickly.

Usage

1
2
3
4
5
6
 index(x, uni = NULL, batch = NULL, verbose = FALSE)
 ## S3 method for class 'index'
c(...)
 indexAddTree(obj, batch = NULL)
 indexDelTree(obj)
 indexAutobatch(n, batch = 64)

Arguments

x

a vector (currently only character)

uni

set to TRUE or FALSE to save checking for duplicates (default NULL for checking)

batch

set to preferred batch size to influence RAM – speed – trade-off (default NULL for indexAutobatch)

verbose

set to TRUE to report timing of index creation (default FALSE)

n

number of elements to build a tree on

obj

an object of class ‘index’

...

objects of class ‘index’

Details

Basic functions creating, modifying and removing indices
index create an index object and build a tree
c.index concatenate index objects (currently not tuned)
indexAutobatch calculates the optimal index resolution (size of a leaf) given number of elements and desired batch size (default 64)
indexAddTree build or rebuild tree (Calloc)
indexDelTree remove tree (Free)
rm removing index object removes tree at next garbage collection gc
Index information information, printing and retrieving all values
indexNodes returns number of tree nodes
indexBytes returns indes size in bytes
print.index prints index info and optionally tree
str.index removes class and calls NextMethod("str")
length.index identical to length of original vector
names.index currently forbidden
names<-.index currently forbidden
Basic access information, printing and retrieving all values
sort.index identical to sort of original vector, but much faster
order.index identical to order of original vector, but much faster
[.index index[] returns original vector, subsetting works identical to susetting original vector [ (via NextMethod)
[<-.index currently forbidden
is.na.index identical to is.na of original vector, but much faster
Low level search low level search functions return positions in index order (sorted)
indexFind finding exact values in index
indexFindlike finding values in index that begin like search value (character indices only)
Mid level search mid level search functions return positions in index order (sorted)
indexFindInterval finding a sequence of exact or approximate values
indexMatch finding positions of vector of search values
High level search high level search functions return positions in original order (unsorted)
indexEQ index EQual value
indexNE index NotEqual value
indexLT index LowerThan value
indexGT index GreaterThan value
indexLE index LowerEqual value
indexGE index GreaterEqual value
High level operators high level operators return logical vectors in original order (unsorted)
==.index index EQual value
!=.index index NotEqual value
<.index index LowerThan value
>.index index GreaterThan value
<=.index index LowerEqual value
>=.index index GreaterEqual value
match high level matching
match.index use this to match in an index instead of match
match.rindex use this to match in an rindex instead of match

Value

An object of class ‘index’, i.e. a list with components

val

sorted vector of values

pos

integer vector of original positions of sorted values

n

number of values (including NAs)

nNA

number of NAs

batch

resolution of tree

uni

logical flagging the index as unique (TRUE) or non-unique (FALSE)

tree

external pointer to C-tree

Theory

Linear search has O(n) time complexity. By using an index tree, search time can be reduced to O(log(n)). An index that can be used with any R-vector x of length n needs to store the original values together with the original positions, i.e. list(val=x, pos=order(x)), thus requires – strongly simplified – 2*n RAM. If we store this information in a binary tree, each value pair {val,pos} is stored in its own node together with two pointers, the memory requirements are – strongly simplified – 4*n RAM. The b-tree stores more than one value and two pointers in one node and thus minimizes the number of nodes that need to be read (from disk). However, used in RAM, b-trees increase the search time because they impurify logarithmic search across nodes with linear search within nodes. By contrast, the t-tree is optimized for RAM: it stores only two pointers but many sorted values within each node: for branching only min and max values need to be searched, linear search is only required at the final leaf node. However, realizing a t-tree within R's memory model requires additional overhead for the SEXPREC data structures. We avoid that by defining a static read only tree (and save implementation of insert and delete operations). The b*tree and t*tree versions of the mentioned indices reduce the size of the search nodes by storing the data itself in the leafnodes only. This leads to some redundant storage but speeds up search. Some implementations connect the leafnodes by extra pointers to speed up linear search. If we take this principle to the extreme, we can save these extra pointers and merge all leave nodes into one single big leaf. By doing so we loose the ability to update the index, but we gain a static read-only tree structure that supports very fast linear search as well as logarithmic search.

1
2
3
4
5
6
7
           .
         /   \
       .       .
     /   \   /   \   C-tree
   _________________  R-val
   _________________  R-pos
  

We implement this efficiently by storing the sorted values vector together with the order positions as standard R (SEXPREC) objects and add a pure C tree that is built from pointered struct nodes. The leaf nodes do use integer addressing instead of pointers to identify the associated part of the SEXPREC vector (pointers can't be used because the R garbage collector may move the vector). The topnode is linked into R using an external pointer. The tree itself can be removed explicitely from memory using indexDelTree. If the index object containing the external pointer is deleted, the tree will be freed from memory at the next garbage collection through its finalization method. If the index object does not contain a valid external pointer to the tree – e.g. when loading an index object from disk – the tree will be quickly transparently rebuild or can be build explicitely via indexAddTree.

Benefits

Limitations

Open questions

Note

For each FOOindexFOO related to the ‘index’ class a function FOOrindexFOO exists related to the ‘rindex’ class. The ‘rindex’ class is a pure R-prototype and is kept for regression-testing using binregtest from package regtest. The regression tests are in dontshow sections in the examples of this help. If you run example(index) the regression tests will (unavoidably) trigger warnings.

Author(s)

Jens Oehlschl<e4>gel

References

Tobin J. Lehman, Michael J. Carey (1986) A Study of Index Structures for Main Memory Database Management Systems. Proceedings of the 12th International Conference on Very Large Data Bases, 294 – 303.
Pfaff, Ben (2004). An Introduction to Binary Search Trees and Balanced Trees, Libavl Binary Search Tree Library. Free Software Foundation, Inc.

See Also

order, sort , match

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
  #library(rindex)
  x <- sample(c(rep("c", 5), paste("d", 1:5, sep=""), c(letters[c(1,2,5,6)], NA)))

  cat("\n")
  cat("creating an index from atomic vector\n")
  i <- index(x)
  i
  cat("creating an index by combining indices\n")
  i2 <- c(i,i)
  i2
  cat("if the index (or index$tree) is removed, the C-tree is removed at the next garbage collection\n")
  cat("the index tree can also removed and created explicitely\n")
  i <- indexDelTree(i)
  i
  i <- indexAddTree(i, batch=3)
  print(i, tree=TRUE)
  indexNodes(i)
  indexBytes(i)

  cat("\n")
  cat("extracting the original vector\n")
  i[]
  cat("subsetting works as expected\n")
  i[1:3]
  cat("accessing the sorted data is much faster\n")
  sort(i)[1:3]
  cat("accessing the ordering is also much faster (order.index is not dispatched since order is not yet generic)\n")
  order.index(i)[1:3]

  identical(is.na(i),is.na(x))
  identical(length(i),length(x))

  cat("\n")
  cat("LOW LEVEL SEARCH returns position in SORTED VECTOR\n")
  cat("low level search for position of lowest instance of value in the index\n")
  indexFind(i, "c")
  cat("low level search for position of highest instance of value in the index\n")
  indexFind(i, "c", findlow=FALSE)
  cat("low level search for position of lowest instance beginning like search value\n")
  indexFindlike(i,"d")
  cat("low level search for position of highest instance beginning like search value\n")
  indexFindlike(i,"d",findlow=FALSE)

  cat("\n")
  cat("MID LEVEL SEARCH also returns position in SORTED VECTOR\n")
  cat("mid level search for a set of values\n")
  indexMatch(i,c("c","f"), findlow=TRUE)  # giving parameter findlow= suppresses the warning issued on non-unique indices
  sort(i)[indexMatch(i,c("c","f"), findlow=TRUE)]
  i[indexMatch(i,c("c","f"), findlow=TRUE, what="pos")]
  indexMatch(i,c("c","f"), findlow=TRUE, what="val")
  indexMatch(i,c("c","f"), findlow=TRUE, what="pos")
  indexMatch(i,c("c","f"), findlow=FALSE, what="pos")

  cat("mid level search for interval of values\n")
  indexFindInterval(i,"b","f")
  cat("by default the searched endpoints are included\n")
  sort(i)[indexFindInterval(i,"b","f")]
  cat("but they can be excluded\n")
  sort(i)[indexFindInterval(i,"b","f",high.include=FALSE)]
  cat("by default the searched endpoints need not to be present\n")
  sort(i)[indexFindInterval(i,"a1","e1")]
  cat("but this can be required\n")
  sort(i)[indexFindInterval(i,"a1","e1",low.exact=TRUE)]
  cat("each of the searched endpoints can be defined via indexFindlike\n")
  sort(i)[indexFindInterval(i,"c","d",FUN=indexFindlike)]


  cat("\n")
  cat("HIGH LEVEL SEARCH returns POSITION(s) IN ORIGINAL VECTOR but in SEQUENCE OF INDEX\n")
  indexEQ(i,"d3")
  indexNE(i,"d3")
  indexLT(i,"d3")
  indexLE(i,"d3")
  indexGT(i,"d3")
  indexGE(i,"d3")
  cat("searching for several values returns a list\n")
  indexEQ(i,c("b","c","z",NA))
  indexEQ(i,c("b","c","z",NA), what="val")


  cat("\n")
  cat("HIGH LEVEL OPERATORS returns TRUE/FALSE AT ORIGINAL VECTOR POSITIONS\n")
  i=="d3"
  i!="d3"
  i<"d3"
  i<="d3"
  i>"d3"
  i>="d3"

  cat("HIGH LEVEL match.index and  behave as expected\n")
  match(c("b","c","z",NA), x)
  match.index(c("b","c","z",NA), i)



## Not run: 

   cat("function timefactor helps with timing\n")

   n <-  1000000
   x <- sample(1:n)
   names(x) <- paste("a", x, sep="")
   d <- data.frame(x=as.vector(x), row.names=names(x))

   nsub  <- 100
   i <- sample(1:n, nsub)
   ni <- names(x)[i]

   ind <- index(names(x), verbose=TRUE)
   ind

   # test vectors
   cat("character susetting is by magnitude slower than integer subsettting\n")
   timefactor( x[ni] , x[i] , 10, 10000)
   cat("character susetting is approx as slow as matching\n")
   timefactor( x[ni] , x[match(ni,names(x))] , 10, 10)
   cat("for small fractions of n indexing is much faster\n")
   timefactor( x[ni] , x[indexMatch(ind,ni)] , 10, 100)

   # test dataframes
   cat("character susetting is by magnitude slower than integer subsettting\n")
   timefactor( d[ni,] , d[i,] , 1, 100)
   cat("obvious implementation problem (in R-2.5.1 subsetting is much slower than via matching)\n")
   timefactor( d[ni,] , d[match(ni,rownames(d)),] , 1, 1)
   cat("for small fractions of n indexing is much faster\n")
   timefactor( d[ni,] , d[indexMatch(ind,ni),] , 1, 10)


## End(Not run)

rindex documentation built on Sept. 1, 2018, 1:04 a.m.

Related to index in rindex...