cnlp_annotate: Run the annotation pipeline on a set of documents

View source: R/annotate.R

cnlp_annotateR Documentation

Run the annotation pipeline on a set of documents

Description

Runs the clean_nlp annotators over a given corpus of text using the desired backend. The details for which annotators to run and how to run them are specified by using one of: cnlp_init_stringi, cnlp_init_spacy, or cnlp_init_udpipe.

Usage

cnlp_annotate(
  input,
  backend = NULL,
  verbose = 10,
  text_name = "text",
  doc_name = "doc_id"
)

Arguments

input

an object containing the data to parse. Either a character vector with the texts (optional names can be given to provide document ids) or a data frame. The data frame should have a column named 'text' containing the raw text to parse; if there is a column named 'doc_id', it is treated as a a document identifier. The name of the text and document id columns can be changed by setting text_name and doc_name This conforms with corpus objects respecting the Text Interchange Format (TIF), while allowing for some variation.

backend

name of the backend to use. Will default to the last model to be initalized.

verbose

set to a positive integer n to display a progress message to display every n'th record. The default is 10. Set to a non-positive integer to turn off messages. Logical input is converted to an integer, so it also possible to set to TRUE (1) to display a message for every document and FALSE (0) to turn off messages.

text_name

column name containing the text input. The default is 'text'. This parameter is ignored when input is a character vector.

doc_name

column name containing the document ids. The default is 'doc_id'. This parameter is ignored when input is a character vector.

Details

The returned object is a named list where each element containing a data frame. The document table contains one row for each document, along with with all of the metadata that was passed as an input. The tokens table has one row for each token detected in the input. The first three columns are always "doc_id" (to index the input document), "sid" (an integer index for the sentence number), and "tid" (an integer index to the specific token). Together, these are a primary key for each row.

Other columns provide extracted data about each token, which differ slightly based on which backend, language, and options are supplied.

  • token: detected token, as given in the original input

  • token_with_ws: detected token along with white space; in, theory, collapsing this field through an entire document will yield the original text

  • lemma: lemmatised version of the token; the exact form depends on the choosen language and backend

  • upos: the universal part of speech code; see https://universaldependencies.org/u/pos/all.html for more information

  • xpos: language dependent part of speech code; the specific categories and their meaning depend on the choosen backend, model and language

  • feats: other extracted linguistic features, typically given as Universal Dependencies (https://universaldependencies.org/u/feat/index.html), but can be model dependent; currently only provided by the udpipe backend

  • tid_source: the token id (tid) of the head word for the dependency relationship starting from this token; for the token attached to the root, this will be given as zero

  • relation: the dependency relation, usually provided using Universal Dependencies (more information available here https://universaldependencies.org/ ), but could be different for a specific model

Value

a named list with components "token", "document" and (when running spacy with NER) "entity".

Author(s)

Taylor B. Arnold, taylor.arnold@acm.org

Examples

cnlp_init_stringi()
cnlp_annotate(un)


cleanNLP documentation built on May 29, 2024, 12:08 p.m.