prune_embeddings: Subset embedding matrix using vocab terms
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description Usage Arguments Details Examples

View source: R/embed.R

prune_embeddings() subsets a (commonly large) pre-trained word-vector matrix into a smaller, one vector per term, embedding matrix.

1 2	prune_embeddings(vocab, embeddings, nbuckets = attr(vocab, "nbuckets"), max_in_bucket = 30)

`vocab`	`data.frame` obtained from a call to `vocab()`.
`embeddings`	embeddings matrix. The terms dimension must be named. If both `colnames()` and `rownames()` are non-null, dimension with more elements is considered term-dimension.
`nbuckets`	How many unknown buckets to create for unknown terms (terms in corpus not present in the vocabulary).
`max_in_bucket`	At most this many embedding vectors will be averaged into each unknown or missing bucket (see details). Lower number results in faster processing. For large `nbuckets` this number might not be reached due to the finiteness of the `embeddings` vocabulary, or even result in `0` embeddings being hashed into a bucket producing `[0 0 ...]` embeddings for some buckets.

prune_embeddings() is commonly used in conjunction with sequence generators (tix_mat(), tix_seq() and tix_df()). When a term in a corpus is not present in a vocabulary (unknown), it is hashed into one of the nbuckets buckets. Embeddings which are hashed into same bucket are averaged to produce the embedding for that bucket. Maximum number of embeddings to average per bucket is controled with max_in_bucket parameter.

Similarly, when a term from the vocabulary is not present in the embedding matrix (aka missing) max_in_bucket embeddings are averaged to produce the missing embedding. Different buckets are used for "missing" and "unknown" embeddings because nbuckets can be 0.

corpus <-
   list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
        b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
              "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))

v <- vocab(corpus)
v2 <- prune_vocab(v, max_terms = 7, nbuckets = 2)
enames <- c("the", "quick", "brown", "fox", "jumps")
emat <- matrix(rnorm(50), nrow = 5, dimnames = list(enames, NULL))
prune_embeddings(v2, emat)
prune_embeddings(v2, t(emat)) # automatic detection of the orientation

vembs <- prune_embeddings(v2, emat)
all(vembs[enames, ] == emat[enames, ])

vspinu/mlvocab documentation built on June 11, 2021, 7:37 a.m.

vspinu/mlvocab index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

prune_embeddings: Subset embedding matrix using vocab terms
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description

Usage

Arguments

Details

Examples

Related to prune_embeddings in vspinu/mlvocab...

R Package Documentation

Browse R Packages

We want your feedback!

vspinu/mlvocab Vocabulary and Corpus Preprocessing for Natural Language Pipelines

prune_embeddings: Subset embedding matrix using vocab terms In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines

Description

Usage

Arguments

Details

Examples

Related to prune_embeddings in vspinu/mlvocab...

R Package Documentation

Browse R Packages

We want your feedback!

vspinu/mlvocab
Vocabulary and Corpus Preprocessing for Natural Language Pipelines

prune_embeddings: Subset embedding matrix using vocab terms
In vspinu/mlvocab: Vocabulary and Corpus Preprocessing for Natural Language Pipelines