R/hash.size.R

Defines functions hash.size

Documented in hash.size

#'@title Compute minimum hash size to reduce collision rate
#'
#'@importFrom magrittr %>%
#'
#'@param df \code{data.frame} with data to hash
#'
#'@return The hash size of feature hashing as a positive integer.
#'
#'@details
#'To reduce collision rate, the hash size should be 
#'equal or superior to the nearest power of two to the number
#'of unique values in the input \code{data.frame}.
#'
#'The value computed is a theorical minimum hash size.
#'It just means that in the best situation it may be 
#'possible that all computed hash can be stored with
#'this hash size.
#'
#'Intuitively, if the distribution of hash generated by the algorithm
#'was perfect, when the computed size is used, each permutation of bits
#'of the hash vector would correspond to one unique original value of
#'your \code{data.frame}. 
#'
#'Because a bit value is \code{\{0,1\}}, the computed size is \code{2^x}
#'with a \code{x} big enough to have a hash vector containing each
#'original value.
#'
#'In real life, there will be some collisions if the computed
#'size is used because the distribution of hash is not
#'perfect. However, the hashing algorithm \code{Murmur3} is known to
#'minimize the number of collisions and is also very performant.
#'
#'The only known solution to have zero collision is to build a
#'dictionnary of values, and for each new value to hash, check in the
#'dictionnary if the hash value already exists. It is not performant at all.
#'
#'If you increase the computed size (by multiplying it by \code{2^x}, 
#'it is up to you to choose a \code{x}), you will reduce the collision rate.
#'If you use a value under the computed size,
#'there is a 100% chance of collisions.
#'
#'There is a trade-off between collision rate and memory
#'used to store hash. Machine learning algorithms usually deal
#'well with collisions when the rate is reasonable.
#'
#'@examples
#'data(ipinyou)
#'
#'#First try with a size of 2^10
#'mat1 <- hashed.model.matrix(~., ipinyou.train, 2^10, create.mapping = TRUE)
#'
#'#Extract mapping
#'mapping1 <- hash.mapping(mat1)
#'#Rate of collision
#'mean(duplicated(mapping1))
#'
#'#Second try, the size is computed
#'size <- hash.size(ipinyou.train)
#'mat2 <- hashed.model.matrix(~., ipinyou.train, size, create.mapping = TRUE)
#'
#'#Extract mapping
#'mapping2 <- hash.mapping(mat2)
#'#Rate of collision
#'mean(duplicated(mapping2))
#'
#'@author Michael Benesty
#'@export
hash.size <- function(df) {
   sapply(df, function(x) unique(x) %>% length) %>% sum %>% log2 %>% ceiling %>% 2^.
}

# Avoid error messages during CRAN check.
# The reason is that these variables are never declared
# They are mainly column names inferred by Data.table...
globalVariables(c("."))
wush978/FeatureHashing documentation built on Oct. 23, 2022, 10:16 a.m.