View source: R/preprocessing.R

make_sampling_table | R Documentation |

Generates a word rank-based probabilistic sampling table.

```
make_sampling_table(size, sampling_factor = 1e-05)
```

`size` |
Int, number of possible words to sample. |

`sampling_factor` |
The sampling factor in the word2vec formula. |

Used for generating the `sampling_table`

argument for `skipgrams()`

.
`sampling_table[[i]]`

is the probability of sampling the word i-th most common
word in a dataset (more common words should be sampled less frequently, for balance).

The sampling probabilities are generated according to the sampling distribution used in word2vec:

`p(word) = min(1, sqrt(word_frequency / sampling_factor) / (word_frequency / sampling_factor))`

We assume that the word frequencies follow Zipf's law (s=1) to derive a numerical approximation of frequency(rank):

`frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`

where `gamma`

is the Euler-Mascheroni constant.

An array of length `size`

where the ith entry is the
probability that a word of rank i should be sampled.

The word2vec formula is: p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor))

Other text preprocessing:
`pad_sequences()`

,
`skipgrams()`

,
`text_hashing_trick()`

,
`text_one_hot()`

,
`text_to_word_sequence()`

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.