convert phrases into single tokens

Description

Replace multi-word phrases in text(s) with a compound version of the phrases concatenated with concatenator (by default, the "_" character) to form a single token. This prevents tokenization of the phrases during subsequent processing by eliminating the whitespace delimiter.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
phrasetotoken(object, phrases, ...)

## S4 method for signature 'corpus,ANY'
phrasetotoken(object, phrases, ...)

## S4 method for signature 'character,dictionary'
phrasetotoken(object, phrases, ...)

## S4 method for signature 'character,collocations'
phrasetotoken(object, phrases, ...)

## S4 method for signature 'character,character'
phrasetotoken(object, phrases,
  concatenator = "_", valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, ...)

Arguments

object

source texts, a character or character vector

phrases

a dictionary object that contains some phrases, defined as multiple words delimited by whitespace, up to 9 words long; or a quanteda collocation object created by collocations

...

additional arguments passed through to core "character,character" method

concatenator

the concatenation character that will connect the words making up the multi-word phrases. The default _ is highly recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class [P] will be removed.

valuetype

how to interpret word matching patterns: "glob" for "glob"-style wildcarding, fixed for words as is; "regex" for regular expressions

case_insensitive

if TRUE, ignore case when matching

Value

character or character vector of texts with phrases replaced by compound "words" joined by the concatenator

Author(s)

Kenneth Benoit

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
mytexts <- c("The new law included a capital gains tax, and an inheritance tax.",
             "New York City has raised a taxes: an income tax and a sales tax.")
mydict <- dictionary(list(tax=c("tax", "income tax", "capital gains tax", "inheritance tax")))
(cw <- phrasetotoken(mytexts, mydict))
dfm(cw, verbose=FALSE)

# when used as a dictionary for dfm creation
mydfm2 <- dfm(cw, dictionary = lapply(mydict, function(x) gsub(" ", "_", x)))
mydfm2
# to pick up "taxes" in the second text, set valuetype = "regex"
mydfm3 <- dfm(cw, dictionary = lapply(mydict, phrasetotoken, mydict),
              valuetype = "regex")
mydfm3
## one more token counted for "tax" than before
# using a dictionary to pre-process multi-word expressions
myDict <- dictionary(list(negative = c("bad* word*", "negative", "awful text"),
                          postiive = c("good stuff", "like? th??")))
txt <- c("I liked this, when we can use bad words, in awful text.",
         "Some damn good stuff, like the text, she likes that too.")
phrasetotoken(txt, myDict)

# on simple text
phrasetotoken("This is a simpler version of multi word expressions.", "multi word expression*")

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.