hash.size: Compute minimum hash size to reduce collision rate
In FeatureHashing: Creates a Model Matrix via Feature Hashing with a Formula Interface

hash.size

R Documentation

Compute minimum hash size to reduce collision rate

Description

Compute minimum hash size to reduce collision rate

Usage

hash.size(df)

Arguments

`df`	`data.frame` with data to hash

Details

To reduce collision rate, the hash size should be equal or superior to the nearest power of two to the number of unique values in the input data.frame.

The value computed is a theorical minimum hash size. It just means that in the best situation it may be possible that all computed hash can be stored with this hash size.

Intuitively, if the distribution of hash generated by the algorithm was perfect, when the computed size is used, each permutation of bits of the hash vector would correspond to one unique original value of your data.frame.

Because a bit value is {0,1}, the computed size is 2^x with a x big enough to have a hash vector containing each original value.

In real life, there will be some collisions if the computed size is used because the distribution of hash is not perfect. However, the hashing algorithm Murmur3 is known to minimize the number of collisions and is also very performant.

The only known solution to have zero collision is to build a dictionnary of values, and for each new value to hash, check in the dictionnary if the hash value already exists. It is not performant at all.

If you increase the computed size (by multiplying it by 2^x, it is up to you to choose a x), you will reduce the collision rate. If you use a value under the computed size, there is a 100

There is a trade-off between collision rate and memory used to store hash. Machine learning algorithms usually deal well with collisions when the rate is reasonable.

Value

The hash size of feature hashing as a positive integer.

Author(s)

Michael Benesty

Examples

data(ipinyou)

#First try with a size of 2^10
mat1 <- hashed.model.matrix(~., ipinyou.train, 2^10, create.mapping = TRUE)

#Extract mapping
mapping1 <- hash.mapping(mat1)
#Rate of collision
mean(duplicated(mapping1))

#Second try, the size is computed
size <- hash.size(ipinyou.train)
mat2 <- hashed.model.matrix(~., ipinyou.train, size, create.mapping = TRUE)

#Extract mapping
mapping2 <- hash.mapping(mat2)
#Rate of collision
mean(duplicated(mapping2))

FeatureHashing documentation built on May 29, 2024, 8:19 a.m.

FeatureHashing index

README.md FeatureHashing Sentiment Analysis via R, FeatureHashing and XGBoost

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

FeatureHashing
Creates a Model Matrix via Feature Hashing with a Formula Interface

hash.size: Compute minimum hash size to reduce collision rate
In FeatureHashing: Creates a Model Matrix via Feature Hashing with a Formula Interface

Compute minimum hash size to reduce collision rate

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to hash.size in FeatureHashing...

R Package Documentation

Browse R Packages

We want your feedback!

FeatureHashing Creates a Model Matrix via Feature Hashing with a Formula Interface

hash.size: Compute minimum hash size to reduce collision rate In FeatureHashing: Creates a Model Matrix via Feature Hashing with a Formula Interface

Compute minimum hash size to reduce collision rate

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to hash.size in FeatureHashing...

R Package Documentation

Browse R Packages

We want your feedback!

FeatureHashing
Creates a Model Matrix via Feature Hashing with a Formula Interface

hash.size: Compute minimum hash size to reduce collision rate
In FeatureHashing: Creates a Model Matrix via Feature Hashing with a Formula Interface