Conditional Random Fields for NLP"

options(width = 1000)
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = (require(udpipe, quietly = TRUE)))

Model building

CRFsuite

This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:

For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

Data format

In order to build a CRF model, you need to have

  1. sequences of labels (the hidden state Y) and
  2. attributes of the observations corresponding to the labels (X).

Regarding the label sequence:

Generally the labels follow the IOB type of scheme which look something like: B-ORG, I-ORG, B-YOUROWNLABEL, I-YOUROWNLABEL or O. Indicating the beginning of a certain category (B-), the intermediate part of a certain category (I-) or outside the category (O).

Regarding the attributes of the label observations:

The attributes of the observations are mostly something like the term itself, the neighbouring terms, the parts of speech, the neighbouring parts of speech or any specific feature you can extract and which is relevant to your business domain (e.g. the number of numbers in the token, how far is it from the start of the document or end of the document, is the token capitalised, does it contain an ampersand, ...).

Example data

As an example, let's get some data in Dutch for doing Named Entity Recognition which was distributed as part of the CoNLL-2002 shared task challenge. This dataset contains 1 row per term and provides entity labels as well as the parts of speech tag for each term.

library(udpipe)
udmodel <- udpipe_download_model("dutch")
library(crfsuite)
x <- ner_download_modeldata("conll2002-nl")
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = inherits(x, "data.frame") && !udmodel$download_failed)
subset(x, doc_id == 100)

Attributes

As basic feature enrichment we add the parts of speech tag of the preceding and the next term which we will use later when building the model and do the same for the token. The R package data.table has a nice shift function for this.

library(data.table)
x <- as.data.table(x)
x <- x[, pos_previous   := shift(pos, n = 1, type = "lag"),    by = list(doc_id)]
x <- x[, pos_next       := shift(pos, n = 1, type = "lead"),   by = list(doc_id)]
x <- x[, token_previous := shift(token, n = 1, type = "lag"),  by = list(doc_id)]
x <- x[, token_next     := shift(token, n = 1, type = "lead"), by = list(doc_id)]

Note that CRFsuite handles all attributes equivalently, in order to distinguish between the columns, we need to prepend the column name logic to each column similar as shown at http://www.chokkan.org/software/crfsuite/tutorial.html. This is done using a custom txt_sprintf function which is similar as sprintf but handles NA values gracefully.

x <- x[, pos_previous   := txt_sprintf("pos[w-1]=%s", pos_previous),     by = list(doc_id)]
x <- x[, pos_next       := txt_sprintf("pos[w+1]=%s", pos_next),         by = list(doc_id)]
x <- x[, token_previous := txt_sprintf("token[w-1]=%s", token_previous), by = list(doc_id)]
x <- x[, token_next     := txt_sprintf("token[w-1]=%s", token_next),     by = list(doc_id)]
subset(x, doc_id == 100, select = c("doc_id", "token", "token_previous", "token_next"))
x <- as.data.frame(x)

Model

Train your own CRF model

Once you have data which are tagged with your own categories, you can build a CRF model. On the previous data, split it into a training and test dataset.

crf_train <- subset(x, data == "ned.train")
crf_test  <- subset(x, data == "testa")

And start building your model.

model <- crf(y = crf_train$label, 
             x = crf_train[, c("pos", "pos_previous", "pos_next", 
                               "token", "token_previous", "token_next")], 
             group = crf_train$doc_id, 
             method = "lbfgs", file = "tagger.crfsuite",
             options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1)) 
model
stats <- summary(model)
plot(stats$iterations$loss, pch = 20, type = "b", 
     main = "Loss evolution", xlab = "Iteration", ylab = "Loss")
weights     <- coefficients(model, encoding = "UTF-8")
head(weights$transitions)
head(subset(weights$states, label %in% "B-LOC"), n = 10)

Use the model

You can use the model to get predictions of the named entity / chunks / categories you have trained. Below this is done on the holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.

scores <- predict(model, 
                  newdata = crf_test[, c("pos", "pos_previous", "pos_next", 
                                         "token", "token_previous", "token_next")], 
                  group = crf_test$doc_id)
crf_test$entity <- scores$label
table(crf_test$entity, crf_test$label)

Create training data

In order to facilitate creating training data on your own data, with your own categories, a Shiny app is put inside this R package. To go short, this app allows you to:

To start the app, make sure you have the following packages installed.

install.packages("shiny")
install.packages("flexdashboard")
install.packages("DT")
install.packages("writexl")

And run the app with

rmarkdown::run(file = system.file(package = "crfsuite", "app", "annotation.Rmd"))

The app was developed with shiny 1.4.0.2, flexdashboard 0.5.1.1, DT 0.13, writexl 1.2 and rmarkdown 1.6

CRFSuite Shiny App

Improve the model

When building the model, you need to

Model goodness of fit

In order to identify the parameters of the algorithm, look e.g. at

crf_options("lbfgs")
crf_options("l2sgd")

If you train the model with different algorithm parameters, you probably are interested to see the Precision / Recall / F1 statistics to compare them alongside the model hyperparameters. You can easily get these with the caret R package.

library(caret)
overview <- confusionMatrix(crf_test$entity, crf_test$label, mode = "prec_recall")
overview$overall
overview$byClass[, c("Precision", "Recall", "F1")]

Example with feature engineering

To obtain better models, you need to do feature engineering specific to your business domain.

This example below starts from scratch assuming that you have plain text and you annotated some chunks using the app in this package. Below the manually annotated dataset is shown.

library(crfsuite)
library(udpipe)
library(data.table)
data(airbnb_chunks, package = "crfsuite")
str(airbnb_chunks)

We want to build a classifier for the following categories:

table(airbnb_chunks$chunk_entity)

In order to build the training dataset, we need to have data at the token level. In the example below, this is done using the udpipe R package (https://CRAN.R-project.org/package=udpipe).

## Annotate text data with udpipe (version >= 0.7)
udmodel <- udpipe_download_model("dutch")
udmodel <- udpipe_load_model(udmodel$file_model)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])
airbnb_tokens <- udpipe(x = airbnb_tokens, object = udmodel)
str(airbnb_tokens)

Once you have the data in 1 row per doc_id/token, you can enrich this with the chunk entity. Next, by using the function crf_cbind_attributes we enrich the training data by adding relevant attributes of words in the neighbourhood of the word. We added also a basic column indicating if the term is in the beginning or end of the sentence (bos/eos). Based on that dataset, a model can be built.

x <- merge(airbnb_chunks, airbnb_tokens)
table(x$chunk_entity)
## Indicate beginning of sequence and end of sequence and sequence position
x <- as.data.table(x)
x <- x[, bos := sprintf("BOS+%s", (1:.N)-1), by = list(doc_id)]
x <- x[, eos := sprintf("EOS-%s", (.N:1)-1), by = list(doc_id)]
x <- as.data.frame(x)
## Add preceding and next tokens and parts of speech tags
x <- crf_cbind_attributes(x, terms = c("lemma", "upos"), 
                          by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|")

attributes <- c("bos", "eos", grep("lemma|upos", colnames(x), value=TRUE))
model <- crf(y = x$chunk_entity, x = x[, attributes], 
             group = x$doc_id, 
             method = "lbfgs") 
scores <- predict(model, newdata = x[, attributes], group = x$doc_id)
barplot(table(scores$label[scores$label != "O"]), col = "royalblue", cex.names = 0.75)
invisible(if(file.exists("annotator.crfsuite")) file.remove("annotator.crfsuite"))
invisible(if(file.exists("tagger.crfsuite")) file.remove("tagger.crfsuite"))
invisible(if(file.exists(udmodel$file)) file.remove(udmodel$file))

Now up to you.

Support in text mining

Need support in text mining. Contact BNOSAC: http://www.bnosac.be



Try the crfsuite package in your browser

Any scripts or data that you put into this service are public.

crfsuite documentation built on Sept. 17, 2023, 1:06 a.m.