as.tokens: Coercion, checking, and combining functions for tokens...

Description Usage Arguments Details Value Examples

View source: R/tokens.R

Description

Coercion functions to and from tokens objects, checks for whether an object is a tokens object, and functions to combine tokens objects.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
## S3 method for class 'tokens'
as.list(x, ...)

## S3 method for class 'tokens'
as.character(x, use.names = FALSE, ...)

is.tokens(x)

## S3 method for class 'tokens'
unlist(x, recursive = FALSE, use.names = TRUE)

## S3 method for class 'tokens'
t1 + t2

## S3 method for class 'tokens'
c(...)

as.tokens(x, concatenator = "_", ...)

## S3 method for class 'list'
as.tokens(x, concatenator = "_", ...)

## S3 method for class 'tokens'
as.tokens(x, ...)

## S3 method for class 'spacyr_parsed'
as.tokens(
  x,
  concatenator = "/",
  include_pos = c("none", "pos", "tag"),
  use_lemma = FALSE,
  ...
)

is.tokens(x)

Arguments

x

object to be coerced or checked

...

additional arguments used by specific methods. For c.tokens, these are the tokens objects to be concatenated.

use.names

logical; preserve names if TRUE. For as.character and unlist only.

recursive

a required argument for unlist but inapplicable to tokens objects

t1

tokens one to be added

t2

tokens two to be added

concatenator

character between multi-word expressions, default is the underscore character. See Details.

include_pos

character; whether and which part-of-speech tag to use: "none" do not use any part of speech indicator, "pos" use the pos variable, "tag" use the tag variable. The POS will be added to the token after "concatenator".

use_lemma

logical; if TRUE, use the lemma rather than the raw token

Details

The concatenator is used to automatically generate dictionary values for multi-word expressions in tokens_lookup() and dfm_lookup(). The underscore character is commonly used to join elements of multi-word expressions (e.g. "piece_of_cake", "New_York"), but other characters (e.g. whitespace " " or a hyphen "-") can also be used. In those cases, users have to tell the system what is the concatenator in your tokens so that the conversion knows to treat this character as the inter-word delimiter, when reading in the elements that will become the tokens.

Value

as.list returns a simple list of characters from a tokens object.

as.character returns a character vector from a tokens object.

is.tokens returns TRUE if the object is of class tokens, FALSE otherwise.

unlist returns a simple vector of characters from a tokens object.

c(...) and + return a tokens object whose documents have been added as a single sequence of documents.

as.tokens returns a quanteda tokens object.

is.tokens returns TRUE if the object is of class tokens, FALSE otherwise.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# combining tokens
toks1 <- tokens(c(doc1 = "a b c d e", doc2 = "f g h"))
toks2 <- tokens(c(doc3 = "1 2 3"))
toks1 + toks2
c(toks1, toks2)


# create tokens object from list of characters with custom concatenator
dict <- dictionary(list(country = "United States",
                   sea = c("Atlantic Ocean", "Pacific Ocean")))
lis <- list(c("The", "United-States", "has", "the", "Atlantic-Ocean",
              "and", "the", "Pacific-Ocean", "."))
toks <- as.tokens(lis, concatenator = "-")
tokens_lookup(toks, dict)

koheiw/quanteda.core documentation built on Sept. 21, 2020, 3:44 p.m.