Collocations: Collocations model.

Description Usage Format Fields Usage Methods Arguments Examples

Description

Creates Collocations model which can be used for phrase extraction.

Usage

1

Format

R6Class object.

Fields

collocation_stat

data.table with collocations(phrases) statistics. Useful for filtering non-relevant phrases

Usage

For usage details see Methods, Arguments and Examples sections.

1
2
3
4
5
6
7
model = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0,
                         lfmd_min = -Inf, llr_min = 0, sep = "_")
model$partial_fit(it, ...)
model$fit(it, n_iter = 1, ...)
model$transform(it)
model$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)
model$collocation_stat

Methods

$new(vocabulary = NULL, collocation_count_min = 50, sep = "_")

Constructor for Collocations model. For description of arguments see Arguments section.

$fit(it, n_iter = 1, ...)

fit Collocations model to input iterator it. Iterating over input iterator it n_iter times, so hierarchically can learn multi-word phrases. Invisibly returns collocation_stat.

$partial_fit(it, ...)

iterates once over data and learns collocations. Invisibly returns collocation_stat. Workhorse for $fit()

$transform(it)

transforms input iterator using learned collocations model. Result of the transformation is new itoken or itoken_parallel iterator which will produce tokens with phrases collapsed into single token.

$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)

filter out non-relevant phrases with low score. User can do it directly by modifying collocation_stat object.

Arguments

model

A Collocation model object

n_iter

number of iteration over data

pmi_min, gensim_min, lfmd_min, llr_min

minimal scores of the corresponding statistics in order to collapse tokens into collocation:

  • pointwise mutual information

  • "gensim" scores - https://radimrehurek.com/gensim/models/phrases.html adapted from word2vec paper

  • log-frequency biased mutual dependency

  • Dunning's logarithm of the ratio between the likelihoods of the hypotheses of dependence and independence

See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8101&rep=rep1&type=pdf, http://www.aclweb.org/anthology/I05-1050 for details. Also see data in model$collocation_stat for better intuition

it

An input itoken or itoken_parallel iterator

vocabulary

text2vec_vocabulary - if provided will look for collocations consisted of only from vocabulary

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library(text2vec)
data("movie_review")

preprocessor = function(x) {
  gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x))
}
sample_ind = 1:100
tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind]))
it = itoken(tokens, ids = movie_review$id[sample_ind])
system.time(v <- create_vocabulary(it))
v = prune_vocabulary(v, term_count_min = 5)

model = Collocations$new(collocation_count_min = 5, pmi_min = 5)
model$fit(it, n_iter = 2)
model$collocation_stat

it2 = model$transform(it)
v2 = create_vocabulary(it2)
v2 = prune_vocabulary(v2, term_count_min = 5)
# check what phrases model has learned
setdiff(v2$term, v$term)
# [1] "main_character"  "jeroen_krabb"    "boogey_man"      "in_order"
# [5] "couldn_t"        "much_more"       "my_favorite"     "worst_film"
# [9] "have_seen"       "characters_are"  "i_mean"          "better_than"
# [13] "don_t_care"      "more_than"       "look_at"         "they_re"
# [17] "each_other"      "must_be"         "sexual_scenes"   "have_been"
# [21] "there_are_some"  "you_re"          "would_have"      "i_loved"
# [25] "special_effects" "hit_man"         "those_who"       "people_who"
# [29] "i_am"            "there_are"       "could_have_been" "we_re"
# [33] "so_bad"          "should_be"       "at_least"        "can_t"
# [37] "i_thought"       "isn_t"           "i_ve"            "if_you"
# [41] "didn_t"          "doesn_t"         "i_m"             "don_t"

# and same way we can create document-term matrix which contains
# words and phrases!
dtm = create_dtm(it2, vocab_vectorizer(v2))
# check that dtm contains phrases
which(colnames(dtm) == "jeroen_krabb")

Example output

   user  system elapsed 
  0.160   0.004   0.171 
INFO [2018-07-10 14:02:39] iteration 1 - found 42 collocations
INFO [2018-07-10 14:02:40] iteration 2 - found 46 collocations
        prefix    suffix n_i n_j n_ij       pmi      lfmd     gensim rank_pmi
 1:     jeroen     krabb   5   5    5 11.565197 -11.56520   0.000000        1
 2:    special   effects   8   7    7 10.887125 -11.27242 541.107143        2
 3:     boogey       man   5  25    5  9.243269 -13.88713   0.000000        3
 4: could_have      been   8  23    7  9.138369 -12.95607 161.010870        4
 5:        hit       man  11  25    7  8.591193 -13.56835 110.189091        5
 6:     sexual    scenes  13  19    6  8.523721 -14.08061  61.340081        6
 7:       each     other   8  30    5  8.302163 -14.82823   0.000000        7
 8:       main character  11  27    5  7.994734 -15.13566   0.000000        8
 9:         my  favorite  50   6    5  7.980235 -15.15016   0.000000        9
10:      don_t      care  34  10    5  7.767113 -15.29818   0.000000       10
11:         we        re  37  23    8  7.154110 -14.62014  53.411281       11
12:         at     least  78  16   11  7.061155 -13.79423  72.841346       12
13:        isn         t  12 115   12  7.041635 -13.56269  76.852899       13
14:     couldn         t   5 115    5  7.041635 -16.08876   0.000000       14
15:        don         t  34 115   34  7.041635 -10.55769 112.373146       15
16:      doesn         t  17 115   17  7.041635 -12.55769  92.998465       16
17:       didn         t  14 115   14  7.041635 -13.11791  84.695031       17
18:     better      than  22  27    5  6.994734 -16.13566   0.000000       18
19:  there_are      some  13  58    6  6.881118 -15.65811  19.645889       19
20:       have      been  74  29   15  6.726582 -13.23389  70.601118       20
21:       more      than  42  27    7  6.547275 -15.61227  26.721340       21
22:       look        at  12  78    5  6.338689 -16.79171   0.000000       22
23:          t      care 115  10    6  6.304670 -16.29966  13.174783       23
24:     should        be  17  91    8  6.291868 -15.48238  29.381383       24
25:      could      have  21  74    8  6.285355 -15.48890  29.249035       25
26:      those       who  21  70    7  6.172880 -15.98666  20.613605       26
27:       must        be  14  91    6  6.156938 -16.44739  11.892465       27
28:     people       who  26  70    7  5.864758 -16.29478  16.649451       28
29:        you        re 108  23    9  5.778601 -15.65580  24.397746       29
30:       they        re  62  23    5  5.731295 -17.39910   0.000000       30
31:       much      more  36  42    5  5.646811 -17.48358   0.000000       31
32:          i     loved 311   6    6  5.606355 -16.99797   8.119507       32
33:          i         m 311  18   17  5.523892 -14.07543  32.478028       33
34:         in     order 286   6    5  5.464220 -17.66617   0.000000       34
35:       have      seen  74  24    5  5.414638 -17.71576   0.000000       35
36:        can         t  38 115   12  5.378670 -15.22566  24.269336       36
37:          i        am 311   9    7  5.243785 -16.91576  10.826009       37
38:          i   thought 264  19   12  5.147217 -15.39201  20.672049       38
39:         if       you  52 108   13  5.132238 -15.24113  21.582621       39
40:          i      mean 311   7    5  5.120928 -18.00947   0.000000       40
41:          i        ve 311  17   12  5.103854 -15.50047  20.059958       41
42:      worst      film  13 170    5  5.099223 -18.03117   0.000000       42
43: characters       are  21 106    5  5.088816 -18.04158   0.000000       43
44:      there       are  56 106   13  5.052290 -15.32108  20.419137       44
45:         so       bad  96  39    8  5.016761 -16.75749  12.140224       45
46:      would      have  38  74    6  5.014707 -17.58962   5.387980       46
        prefix    suffix n_i n_j n_ij       pmi      lfmd     gensim rank_pmi
    rank_lfmd rank_gensim
 1:         3          31
 2:         2           1
 3:        11          32
 4:         5           2
 5:         9           4
 6:        13          10
 7:        15          33
 8:        16          34
 9:        17          35
10:        20          46
11:        14          11
12:        10           8
13:         8           7
14:        30          36
15:         1           3
16:         4           5
17:         6           6
18:        31          37
19:        28          23
20:         7           9
21:        26          15
22:        36          38
23:        33          25
24:        23          13
25:        24          14
26:        29          20
27:        34          27
28:        32          24
29:        27          16
30:        39          39
31:        40          40
32:        38          29
33:        12          12
34:        42          41
35:        43          42
36:        18          17
37:        37          28
38:        22          19
39:        19          18
40:        44          43
41:        25          22
42:        45          44
43:        46          45
44:        21          21
45:        35          26
46:        41          30
    rank_lfmd rank_gensim
 [1] "have_seen"       "don_t_care"      "i_mean"          "my_favorite"    
 [5] "better_than"     "more_than"       "worst_film"      "jeroen_krabb"   
 [9] "main_character"  "boogey_man"      "they_re"         "couldn_t"       
[13] "in_order"        "look_at"         "much_more"       "characters_are" 
[17] "each_other"      "would_have"      "sexual_scenes"   "i_loved"        
[21] "have_been"       "must_be"         "there_are_some"  "you_re"         
[25] "people_who"      "hit_man"         "there_are"       "i_am"           
[29] "special_effects" "could_have_been" "those_who"       "we_re"          
[33] "should_be"       "so_bad"          "at_least"        "can_t"          
[37] "isn_t"           "i_thought"       "i_ve"            "if_you"         
[41] "didn_t"          "doesn_t"         "i_m"             "don_t"          
[1] 51

text2vec documentation built on Jan. 12, 2018, 1:04 a.m.