spacy_tokenize: Tokenize text with spaCy
In spacyr: Wrapper to the 'spaCy' 'NLP' Library

spacy_tokenize

R Documentation

Tokenize text with spaCy

Description

Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.

Usage

spacy_tokenize(
  x,
  what = c("word", "sentence"),
  remove_punct = FALSE,
  remove_url = FALSE,
  remove_numbers = FALSE,
  remove_separators = TRUE,
  remove_symbols = FALSE,
  padding = FALSE,
  multithread = TRUE,
  output = c("list", "data.frame"),
  ...
)

Arguments

`x`	a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif)
`what`	the unit for splitting the text, available alternatives are: `"word"` word segmenter `"sentence"` sentence segmenter
`remove_punct`	remove punctuation tokens.
`remove_url`	remove tokens that look like a url or email address.
`remove_numbers`	remove tokens that look like a number (e.g. "334", "3.1415", "fifty").
`remove_separators`	remove spaces as separators when all other remove functionalities (e.g. `remove_punct`) have to be set to `FALSE`. When `what = "sentence"`, this option will remove trailing spaces if `TRUE`.
`remove_symbols`	remove symbols. The symbols are either `SYM` in `pos` field, or currency symbols.
`padding`	if `TRUE`, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.
`multithread`	logical; If `TRUE`, the processing is parallelized using spaCy's architecture (https://spacy.io/api)
`output`	type of returning object. Either `list` or `data.frame`.
`...`	not used directly

Value

either list or data.frame of tokens

Examples

## Not run: 
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)

## End(Not run)

spacyr documentation built on May 29, 2024, 4:35 a.m.