asTokenlist: Create a tokenlist data.frame

Description Usage Arguments Details

Description

A tokenlist is a data.frame in which rows represent the tokens of a text (e.g., words, lemma, ngrams). This function creates a tokenlist that is ordered by document ('doc_id' column) and the position of the token in the text ('position' column).

Usage

1
2
3
4
5
asTokenlist(x, doc_id = NULL, language = "english", use_stemming = F,
  lowercase = T, remove_stopwords = F, ngrams = 1,
  doc.col = getOption("doc.col", "doc_id"),
  position.col = getOption("position.col", "position"),
  word.col = getOption("word.col", "word"), ...)

Arguments

x

An object that can be transformed into a tokenlist object. This can be 1) a list of the tokenizedTexts class (quanteda). 2) A data.frame with document_id, position and word columns (see above for explanation of columnnames). Or 3) a character vector, in which case the tokenize function of the quanteda package is used.

doc_id

If the input is a tokenizedTexts list or character vector, the doc_id vector can be given to define document ids (otherwise, the list or vector indices are used)

doc.col

The name of the document_id column. Defaults to "doc_id", unless a global default is specified using setTokenlistColnames()

position.col

The name of the column giving the position in a document. Defaults to "position", unless a global default is specified using setTokenlistColnames()

word.col

The name of the column containing the token text. Defaults to "word", unless a global default is specified using setTokenlistColnames()

...

If x is a character vector, additional arguments will be passed to the tokenize function of the quanteda package

Details

The tokenization is taken care of by the tokenize function of the quanteda package. Additional arguments (...) are passed to the tokenize function.

The default column names for the tokenlist are "doc_id", "position" and "word". Functions in semnet where the tokenlist should be given as an argument assume that these column names are used. If alternative columnnames are prefered, these can be specified in two ways. First, the defaults can be set when calling a function using the doc.col, position.col and word.col parameters. Second, defaults can be set globally by using the setTokenlistColnames() function.


kasperwelbers/semnet documentation built on May 20, 2019, 7:38 a.m.