token_count: Count Fixed Tokens

Description Usage Arguments Value Examples

View source: R/token_count.R

Description

Count the occurrence of tokens within a vector of strings. This function differs from term_count in that term_count is regex based, allowing for fuzzy matching. This function only searches for lower cased tokens (words, number sequences, or punctuation). This counting function is faster but less flexible.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
token_count(
  text.var,
  grouping.var = NULL,
  token.list,
  stem = FALSE,
  keep.punctuation = TRUE,
  pretty = ifelse(isTRUE(grouping.var), FALSE, TRUE),
  group.names,
  meta.sep = "__",
  meta.names = c("meta"),
  ...
)

Arguments

text.var

The text string variable.

grouping.var

The grouping variable(s). Default NULL generates one word list for all text. Also takes a single grouping variable or a list of 1 or more grouping variables. If TRUE an id variable is used with a seq_along the text.var.

token.list

A list of named character vectors of tokens. Search will combine the counts for tokens supplied that are in the same vector. Tokens are defined as "^([a-z' ]+|[0-9.]+|[[:punct:]]+)$" and should conform to this standard. 'codetoken_count can be used in a hierarchical fashion as well; that is a list of tokens that can be passed and counted and then a second (or more) pass can be taken with a new set of tokens on only those rows/text elements that were left untagged (count rowSums is zero). This is accomplished by passing a list of lists of search tokens. See Examples for the hierarchical tokens section for a demonstration.

stem

logical. If TRUE the search is done after the terms have been stemmed.

keep.punctuation

logical. If TRUE the punctuation marks are considered as tokens.

pretty

logical. If TRUE pretty printing is used. Pretty printing can be turned off globally by setting options(termco_pretty = FALSE).

group.names

A vector of names that corresponds to group. Generally for internal use.

meta.sep

A character separator (or character vector of separators) to break up the term list names (tags) into that will generate an merge table attribute on the output that has the supplied tags and meta + sub tags as dictated by the separator breaks.

meta.names

A vector of names corresponding to the meta tags generated by meta.sep.

...

Other arguments passed to q_dtm.

Value

Returns a tibble object of term counts by grouping variable. Has all of the same features as a term_count object, meaning functions that work on a term_count object will operate on a a token_count object as well.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
token_list <- list(
    person = c('sam', 'i'),
    place = c('here', 'house'),
    thing = c('boat', 'fox', 'rain', 'mouse', 'box', 'eggs', 'ham'),
    no_like = c('not like')
)

token_count(sam_i_am, grouping.var = TRUE, token.list = token_list)
token_count(sam_i_am, grouping.var = NULL, token.list = token_list)

## Not run: 
x <- presidential_debates_2012[["dialogue"]]

bigrams <- frequent_ngrams(x, gram.length = 2)$collocation
bigram_model <- token_count(x, TRUE, token.list = as_term_list(bigrams))
as_dtm(bigram_model)

if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, lexicon, textshape)

token_list <- lexicon::nrc_emotions %>%
    textshape::column_to_rownames() %>%
    t() %>%
    textshape::as_list()

presidential_debates_2012 %>%
     with(token_count(dialogue, TRUE, token_list))

presidential_debates_2012 %>%
     with(token_count(dialogue, list(person, time), token_list))

presidential_debates_2012 %>%
     with(token_count(dialogue, list(person, time), token_list)) %>%
     plot()

## End(Not run)

## hierarchical tokens
token_list <- list(
    list(
        person = c('sam', 'i')
    ),
    list(
        place = c('here', 'house'),
        thing = c('boat', 'fox', 'rain', 'mouse', 'box', 'eggs', 'ham')
    ),
    list(
        no_like = c('not like'),
        thing = c('train', 'goat')
    )
)

(x <- token_count(sam_i_am, grouping.var = TRUE, token.list = token_list))
attributes(x)[['pre_collapse_coverage']]

## External Dictionaries
## Not run: 
## dictionary from quanteda
require(quanteda); require(stringi); require(textreadr)

## Nadra Pencle and Irina Malaescu (2016) What's in the Words? Development and Validation of a
##   Multidimensional Dictionary for CSR and Application Using Prospectuses. Journal of Emerging
##   Technologies in Accounting: Fall 2016, Vol. 13, No. 2, pp. 109-127.

dict_corporate_social_responsibility <- "https://provalisresearch.com/Download/CSR.zip" %>%
    textreadr::download() %>%
    unzip(exdir = tempdir()) %>%
    `[`(1) %>%
    dictionary(file = .)

csr <- dict_corporate_social_responsibility %>%
    as_term_list() %>%
    lapply(function(x){
        x %>%
            stringi::stri_replace_all_fixed('_', ' ') %>%
            stringi::stri_replace_all_regex('\\s*\\(.+?\\)', '') %>%
            stringi::stri_replace_all_regex('[^ -~]', "'")
    })

presidential_debates_2012 %>%
     with(token_count(dialogue, list(time, person), csr)) %>%
     plot()


## End(Not run)

trinker/termco documentation built on Jan. 7, 2022, 3:32 a.m.