options(width = 1000) knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = (require(udpipe, quietly = TRUE)))
This R package wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:
For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf
In order to build a CRF model, you need to have
Generally the labels follow the IOB
type of scheme which look something like: B-ORG, I-ORG, B-YOUROWNLABEL, I-YOUROWNLABEL or O. Indicating the beginning of a certain category (B-)
, the intermediate part of a certain category (I-)
or outside the category (O)
.
I went to the New York City District on holidays
would e.g. be labelled as
O, O, O, O, B-LOCATION, I-LOCATION, I-LOCATION, I-LOCATION, O, O
The attributes of the observations are mostly something like the term itself, the neighbouring terms, the parts of speech, the neighbouring parts of speech or any specific feature you can extract and which is relevant to your business domain (e.g. the number of numbers in the token, how far is it from the start of the document or end of the document, is the token capitalised, does it contain an ampersand, ...).
As an example, let's get some data in Dutch for doing Named Entity Recognition which was distributed as part of the CoNLL-2002 shared task challenge. This dataset contains 1 row per term and provides entity labels as well as the parts of speech tag for each term.
library(udpipe) udmodel <- udpipe_download_model("dutch")
library(crfsuite) x <- ner_download_modeldata("conll2002-nl")
knitr::opts_chunk$set(echo = TRUE, message = FALSE, comment = NA, eval = inherits(x, "data.frame") && !udmodel$download_failed)
subset(x, doc_id == 100)
As basic feature enrichment we add the parts of speech tag of the preceding and the next term which we will use later when building the model and do the same for the token. The R package data.table has a nice shift function for this.
library(data.table) x <- as.data.table(x) x <- x[, pos_previous := shift(pos, n = 1, type = "lag"), by = list(doc_id)] x <- x[, pos_next := shift(pos, n = 1, type = "lead"), by = list(doc_id)] x <- x[, token_previous := shift(token, n = 1, type = "lag"), by = list(doc_id)] x <- x[, token_next := shift(token, n = 1, type = "lead"), by = list(doc_id)]
Note that CRFsuite handles all attributes equivalently, in order to distinguish between the columns, we need to prepend the column name logic to each column similar as shown at http://www.chokkan.org/software/crfsuite/tutorial.html. This is done using a custom txt_sprintf function which is similar as sprintf but handles NA values gracefully.
x <- x[, pos_previous := txt_sprintf("pos[w-1]=%s", pos_previous), by = list(doc_id)] x <- x[, pos_next := txt_sprintf("pos[w+1]=%s", pos_next), by = list(doc_id)] x <- x[, token_previous := txt_sprintf("token[w-1]=%s", token_previous), by = list(doc_id)] x <- x[, token_next := txt_sprintf("token[w-1]=%s", token_next), by = list(doc_id)] subset(x, doc_id == 100, select = c("doc_id", "token", "token_previous", "token_next")) x <- as.data.frame(x)
Once you have data which are tagged with your own categories, you can build a CRF model. On the previous data, split it into a training and test dataset.
crf_train <- subset(x, data == "ned.train") crf_test <- subset(x, data == "testa")
And start building your model.
model <- crf(y = crf_train$label, x = crf_train[, c("pos", "pos_previous", "pos_next", "token", "token_previous", "token_next")], group = crf_train$doc_id, method = "lbfgs", file = "tagger.crfsuite", options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1)) model stats <- summary(model) plot(stats$iterations$loss, pch = 20, type = "b", main = "Loss evolution", xlab = "Iteration", ylab = "Loss")
weights <- coefficients(model, encoding = "UTF-8") head(weights$transitions) head(subset(weights$states, label %in% "B-LOC"), n = 10)
You can use the model to get predictions of the named entity / chunks / categories you have trained. Below this is done on the holdout data. Provide the model, your data with the attributes and indicate the group the attributes belong to.
scores <- predict(model, newdata = crf_test[, c("pos", "pos_previous", "pos_next", "token", "token_previous", "token_next")], group = crf_test$doc_id) crf_test$entity <- scores$label table(crf_test$entity, crf_test$label)
In order to facilitate creating training data on your own data, with your own categories, a Shiny app is put inside this R package. To go short, this app allows you to:
To start the app, make sure you have the following packages installed.
install.packages("shiny") install.packages("flexdashboard") install.packages("DT") install.packages("writexl")
And run the app with
rmarkdown::run(file = system.file(package = "crfsuite", "app", "annotation.Rmd"))
The app was developed with shiny 1.4.0.2, flexdashboard 0.5.1.1, DT 0.13, writexl 1.2 and rmarkdown 1.6
When building the model, you need to
In order to identify the parameters of the algorithm, look e.g. at
crf_options("lbfgs") crf_options("l2sgd")
If you train the model with different algorithm parameters, you probably are interested to see the Precision / Recall / F1 statistics to compare them alongside the model hyperparameters. You can easily get these with the caret R package.
library(caret) overview <- confusionMatrix(crf_test$entity, crf_test$label, mode = "prec_recall") overview$overall overview$byClass[, c("Precision", "Recall", "F1")]
To obtain better models, you need to do feature engineering specific to your business domain.
This example below starts from scratch assuming that you have plain text and you annotated some chunks using the app in this package. Below the manually annotated dataset is shown.
library(crfsuite) library(udpipe) library(data.table) data(airbnb_chunks, package = "crfsuite") str(airbnb_chunks)
We want to build a classifier for the following categories:
table(airbnb_chunks$chunk_entity)
In order to build the training dataset, we need to have data at the token level. In the example below, this is done using the udpipe R package (https://CRAN.R-project.org/package=udpipe).
## Annotate text data with udpipe (version >= 0.7) udmodel <- udpipe_download_model("dutch") udmodel <- udpipe_load_model(udmodel$file_model)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")]) airbnb_tokens <- udpipe(x = airbnb_tokens, object = udmodel) str(airbnb_tokens)
Once you have the data in 1 row per doc_id/token, you can enrich this with the chunk entity. Next, by using the function crf_cbind_attributes
we enrich the training data by adding relevant attributes of words in the neighbourhood of the word. We added also a basic column indicating if the term is in the beginning or end of the sentence (bos/eos). Based on that dataset, a model can be built.
x <- merge(airbnb_chunks, airbnb_tokens) table(x$chunk_entity) ## Indicate beginning of sequence and end of sequence and sequence position x <- as.data.table(x) x <- x[, bos := sprintf("BOS+%s", (1:.N)-1), by = list(doc_id)] x <- x[, eos := sprintf("EOS-%s", (.N:1)-1), by = list(doc_id)] x <- as.data.frame(x) ## Add preceding and next tokens and parts of speech tags x <- crf_cbind_attributes(x, terms = c("lemma", "upos"), by = c("doc_id", "sentence_id"), ngram_max = 3, sep = "|") attributes <- c("bos", "eos", grep("lemma|upos", colnames(x), value=TRUE)) model <- crf(y = x$chunk_entity, x = x[, attributes], group = x$doc_id, method = "lbfgs") scores <- predict(model, newdata = x[, attributes], group = x$doc_id) barplot(table(scores$label[scores$label != "O"]), col = "royalblue", cex.names = 0.75)
invisible(if(file.exists("annotator.crfsuite")) file.remove("annotator.crfsuite")) invisible(if(file.exists("tagger.crfsuite")) file.remove("tagger.crfsuite")) invisible(if(file.exists(udmodel$file)) file.remove(udmodel$file))
Now up to you.
Need support in text mining. Contact BNOSAC: http://www.bnosac.be
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.