hash.size | R Documentation |
Compute minimum hash size to reduce collision rate
hash.size(df)
df |
|
To reduce collision rate, the hash size should be
equal or superior to the nearest power of two to the number
of unique values in the input data.frame
.
The value computed is a theorical minimum hash size. It just means that in the best situation it may be possible that all computed hash can be stored with this hash size.
Intuitively, if the distribution of hash generated by the algorithm
was perfect, when the computed size is used, each permutation of bits
of the hash vector would correspond to one unique original value of
your data.frame
.
Because a bit value is {0,1}
, the computed size is 2^x
with a x
big enough to have a hash vector containing each
original value.
In real life, there will be some collisions if the computed
size is used because the distribution of hash is not
perfect. However, the hashing algorithm Murmur3
is known to
minimize the number of collisions and is also very performant.
The only known solution to have zero collision is to build a dictionnary of values, and for each new value to hash, check in the dictionnary if the hash value already exists. It is not performant at all.
If you increase the computed size (by multiplying it by 2^x
,
it is up to you to choose a x
), you will reduce the collision rate.
If you use a value under the computed size,
there is a 100
There is a trade-off between collision rate and memory used to store hash. Machine learning algorithms usually deal well with collisions when the rate is reasonable.
The hash size of feature hashing as a positive integer.
Michael Benesty
data(ipinyou)
#First try with a size of 2^10
mat1 <- hashed.model.matrix(~., ipinyou.train, 2^10, create.mapping = TRUE)
#Extract mapping
mapping1 <- hash.mapping(mat1)
#Rate of collision
mean(duplicated(mapping1))
#Second try, the size is computed
size <- hash.size(ipinyou.train)
mat2 <- hashed.model.matrix(~., ipinyou.train, size, create.mapping = TRUE)
#Extract mapping
mapping2 <- hash.mapping(mat2)
#Rate of collision
mean(duplicated(mapping2))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.