txt_tagsequence: Identify a contiguous sequence of tags as 1 being entity

View source: R/utils.R

txt_tagsequenceR Documentation

Identify a contiguous sequence of tags as 1 being entity

Description

This function allows to identify contiguous sequences of text which have the same label or which follow the IOB scheme.
Named Entity Recognition or Chunking frequently follows the IOB tagging scheme where "B" means the token begins an entity, "I" means it is inside an entity, "E" means it is the end of an entity and "O" means it is not part of an entity. An example of such an annotation would be 'New', 'York', 'City', 'District' which can be tagged as 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC'.
The function looks for such sequences which start with 'B-LOC' and combines all subsequent labels of the same tagging group into 1 category. This sequence of words also gets a unique identifier such that the terms 'New', 'York', 'City', 'District' would get the same sequence identifier.

Usage

txt_tagsequence(x, entities)

Arguments

x

a character vector of categories in the sequence of occurring (e.g. B-LOC, I-LOC, I-PER, B-PER, O, O, B-PER)

entities

a list of groups, where each list element contains

  • start: A length 1 character string with the start element identifying a sequence start. E.g. 'B-LOC'

  • labels: A character vector containing all the elements which are considered being part of a same labelling sequence, including the starting element. E.g. c('B-LOC', 'I-LOC', 'E-LOC')

The list name of the group defines the label that will be assigned to the entity. If entities is not provided each possible value of x is considered an entity. See the examples.

Value

a list with elements entity_id and entity where

  • entity is a character vector of the same length as x containing entities , constructed by recoding x to the names of names(entities)

  • entity_id is an integer vector of the same length as x containing unique identifiers identfying the compound label sequence such that e.g. the sequence 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC' (New York City District) would get the same entity_id identifier.

See the examples.

Examples

x <- data.frame(
  token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation", 
           "Donald", "Duck", "went", "skiing", 
            "in", "the", "Niagara", "Falls"),
  upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN", 
           "PROPN", "PROPN", "VERB", "VERB", 
           "ADP", "DET", "PROPN", "PROPN"),
  label = c("O", "O", "O", "O", "B-ORG", "I-ORG", 
            "B-PERSON", "I-PERSON", "O", "O", 
            "O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE)
x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos)
x

##
## Define entity groups following the IOB scheme
## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City) 
groups <- list(
 Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")),
 Organisation =  list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")),
 Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")), 
 Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC")))
x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups)
x

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.