In the following example, we walk-through a Conditional Random Fields NER model training and prediction.

This challenging annotator will require the user to provide either a labeled dataset during fit() stage, or use external CoNLL 2003 resources to train. It may optionally use an external word embeddings set and a list of additional entities.

The CRF Annotator will also require Part-of-speech tags so we add those in the same Pipeline. Also, we could use our special RecursivePipeline, which will tell SparkNLP's NER CRF approach to use the same pipeline for tagging external resources.

1. Call necessary imports and set the resource path to read local data files

library(dplyr)
library(sparklyr)
library(sparklyr.nested)
library(sparknlp)

2. Download training dataset if not already there

# Download CoNLL 2003 Dataset
library(pins)
library(readr)

trainingDataset <- pin("https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train")

3. Load Spark session

version <- Sys.getenv("SPARK_VERSION", unset = "2.4.3")

config <- sparklyr::spark_config()

options(sparklyr.sanitize.column.names.verbose = TRUE)
options(sparklyr.verbose = TRUE)
options(sparklyr.na.omit.verbose = TRUE)
options(sparklyr.na.action.verbose = TRUE)
sc <- sparklyr::spark_connect(master = "local", version = version, config = config)

cat("Apache Spark version: ", sc$home_version, "\n")
cat("Spark NLP version: ", nlp_version())

4. Create annotator components in the right order, with their training Params. Finisher will output only NER. Put all in pipeline

nerTagger <- nlp_ner_crf(sc, input_cols = c("sentence", "token", "pos", "embeddings"),
                         output_col = "ner",
                         label_col = "label",
                         min_epochs = 1,
                         max_epochs = 1,
                         loss_eps = 1e-3,
                         l2 = 1,
                         C0 = 1250000,
                         random_seed = 0,
                         verbose = 0)

6. Load a dataset for prediction. Training is not relevant from this dataset.

data <- nlp_conll_read_dataset(sc, trainingDataset[1])
head(data)

embeddings <- nlp_word_embeddings_pretrained(sc, output_col = "embeddings")

ready_data <- ml_transform(embeddings, data)
head(ready_data, 4)

7. Training the model. Training doesn't really do anything from the dataset itself.

system.time(
  ner_model <- ml_fit(nerTagger, ready_data)
)

8. Save NerCrfModel into disk after training

ml_save(ner_model, "./pip_wo_embedd/", overwrite = TRUE)


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.