crf_cbind_attributes: Enrich a data.frame by adding frequently used CRF attributes

View source: R/feature_engineering.R

crf_cbind_attributesR Documentation

Enrich a data.frame by adding frequently used CRF attributes

Description

The CRF attributes which are implemented in this function are merely the neighbouring information of a certain field. For example the previous word, the next word, the combination of the previous 2 words. This function cbinds these neighbouring attributes as columns to the provided data.frame.

By default it adds the following columns to the data.frame

  • the term itself (term[t])

  • the next term (term[t+1])

  • the term after that (term[t+2])

  • the previous term (term[t-1])

  • the term before the previous term (term[t-2])

  • as well as all combinations of these terms (bigrams/trigrams/...) where up to ngram_max number of terms are combined.

See the examples.

Usage

crf_cbind_attributes(
  data,
  terms,
  by,
  from = -2,
  to = 2,
  ngram_max = 3,
  sep = "-"
)

Arguments

data

a data.frame which will be coerced to a data.table (cbinding will be done by reference on the existing data.frame)

terms

a character vector of column names which are part of data for which the function will look to the preceding and following rows in order to cbind this information to the data

by

a character vector of column names which are part of data indicating the fields which define the sequence. Preceding/following terms will be looked for within data of by. Typically this will be a document identifier or sentence identifier in an NLP context.

from

integer, by default set to -2, indicating to look up to 2 terms before the current term

to

integer, by default set to 2, indicating to look up to 2 terms after the current term

ngram_max

integer indicating the maximum number of terms to combine (2 means bigrams, 3 trigrams, ...)

sep

character indicating how to combine the previous/next/current terms. Defaults to '-'.

Examples

x <- data.frame(doc_id = sort(sample.int(n = 10, size = 1000, replace = TRUE)))
x$pos <- sample(c("Art", "N", "Prep", "V", "Adv", "Adj", "Conj", 
                  "Punc", "Num", "Pron", "Int", "Misc"), 
                  size = nrow(x), replace = TRUE)
x <- crf_cbind_attributes(x, terms = "pos", by = "doc_id", 
                          from = -1, to = 1, ngram_max = 3)
head(x)


## Example on some real data
x <- ner_download_modeldata("conll2002-nl")
x <- crf_cbind_attributes(x, terms = c("token", "pos"), 
                          by = c("doc_id", "sentence_id"),
                          ngram_max = 3, sep = "|")


crfsuite documentation built on Sept. 17, 2023, 1:06 a.m.