as_tokenindex: Prepare a tokenIndex

View source: R/token_index.r

as_tokenindexR Documentation

Prepare a tokenIndex

Description

Creates a tokenIndex data.table. Accepts any data.frame given that the required columns (doc_id, sentence, token_id, parent, relation) are present. The names of these columns must be one of the values specified in the respective arguments.

The data in the data.frame will not be changed, with three exceptions. First, the columnnames will be changed if the default values are not used. Second, if a token has itself as its parent (which in some parsers is used to indicate the root), the parent is set to NA (as used in other parsers) to prevent infinite cycles. Third, the data will be sorted by doc_id, sentence, token_id.

Usage

as_tokenindex(
  tokens,
  doc_id = c("doc_id", "document_id"),
  sentence = c("sentence", "sentence_id"),
  token_id = c("token_id"),
  parent = c("parent", "head_token_id"),
  relation = c("relation", "dep_rel"),
  paragraph = NULL
)

Arguments

tokens

A data.frame, data.table, or tokenindex.

doc_id

candidate names for the document id columns

sentence

candidate names for sentence (id/index) column

token_id

candidate names for the token id column. Has to be numeric (Some parsers return token_id's as numbers with a prefix (t_1, w_1))

parent

candidate names for the parent id column. Has to be numeric

relation

candidate names for the relation column

paragraph

Optionally, the name of a column with paragraph ids. This is only necessary if sentences are numbered per paragraph, and therefore not unique within documents. If given, sentences are re-indexed to be unique within documents.

Value

a tokenIndex

Examples

as_tokenindex(tokens_corenlp)

rsyntax documentation built on June 7, 2022, 9:07 a.m.