txt_tagsequence: Identify a contiguous sequence of tags as 1 being entity
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_tagsequence

R Documentation

Identify a contiguous sequence of tags as 1 being entity

Description

This function allows to identify contiguous sequences of text which have the same label or which follow the IOB scheme.
Named Entity Recognition or Chunking frequently follows the IOB tagging scheme where "B" means the token begins an entity, "I" means it is inside an entity, "E" means it is the end of an entity and "O" means it is not part of an entity. An example of such an annotation would be 'New', 'York', 'City', 'District' which can be tagged as 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC'.
The function looks for such sequences which start with 'B-LOC' and combines all subsequent labels of the same tagging group into 1 category. This sequence of words also gets a unique identifier such that the terms 'New', 'York', 'City', 'District' would get the same sequence identifier.

Usage

txt_tagsequence(x, entities)

Arguments

x

a character vector of categories in the sequence of occurring (e.g. B-LOC, I-LOC, I-PER, B-PER, O, O, B-PER)

entities

a list of groups, where each list element contains

start: A length 1 character string with the start element identifying a sequence start. E.g. 'B-LOC'
labels: A character vector containing all the elements which are considered being part of a same labelling sequence, including the starting element. E.g. c('B-LOC', 'I-LOC', 'E-LOC')

The list name of the group defines the label that will be assigned to the entity. If entities is not provided each possible value of x is considered an entity. See the examples.

Value

a list with elements entity_id and entity where

entity is a character vector of the same length as x containing entities , constructed by recoding x to the names of names(entities)
entity_id is an integer vector of the same length as x containing unique identifiers identfying the compound label sequence such that e.g. the sequence 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC' (New York City District) would get the same entity_id identifier.

See the examples.

Examples

x <- data.frame(
  token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation", 
           "Donald", "Duck", "went", "skiing", 
            "in", "the", "Niagara", "Falls"),
  upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN", 
           "PROPN", "PROPN", "VERB", "VERB", 
           "ADP", "DET", "PROPN", "PROPN"),
  label = c("O", "O", "O", "O", "B-ORG", "I-ORG", 
            "B-PERSON", "I-PERSON", "O", "O", 
            "O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE)
x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos)
x

##
## Define entity groups following the IOB scheme
## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City) 
groups <- list(
 Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")),
 Organisation =  list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")),
 Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")), 
 Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC")))
x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups)
x

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_tagsequence: Identify a contiguous sequence of tags as 1 being entity
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Identify a contiguous sequence of tags as 1 being entity

Description

Usage

Arguments

Value

Examples

Related to txt_tagsequence in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_tagsequence: Identify a contiguous sequence of tags as 1 being entity In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Identify a contiguous sequence of tags as 1 being entity

Description

Usage

Arguments

Value

Examples

Related to txt_tagsequence in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_tagsequence: Identify a contiguous sequence of tags as 1 being entity
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit