nlp_ner_dl: Spark NLP NerDLApproach Named Entity Recognition Deep...
In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

nlp_ner_dl

R Documentation

Spark NLP NerDLApproach Named Entity Recognition Deep Learning annotator

Description

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.

Usage

nlp_ner_dl(
  x,
  input_cols,
  output_col,
  label_col = NULL,
  max_epochs = NULL,
  lr = NULL,
  po = NULL,
  batch_size = NULL,
  dropout = NULL,
  verbose = NULL,
  include_confidence = NULL,
  include_all_confidence_scores = NULL,
  random_seed = NULL,
  graph_folder = NULL,
  validation_split = NULL,
  eval_log_extended = NULL,
  enable_output_logs = NULL,
  output_logs_path = NULL,
  enable_memory_optimizer = NULL,
  uid = random_string("ner_dl_")
)

Arguments

`x`	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
`input_cols`	Input columns. String array.
`output_col`	Output column. String.
`label_col`	If DatasetPath is not provided, this seq of Annotation type of column should have labeled data per token (string)
`max_epochs`	Maximum number of epochs to train (integer)
`lr`	Initial learning rate (float)
`po`	Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch) (float)
`batch_size`	Batch size for training (integer)
`dropout`	Dropout coefficient (float)
`verbose`	Verbosity level (integer)
`include_confidence`	whether to include confidence values (boolean)
`include_all_confidence_scores`	whether to include all confidence scores in annotation metadata or just score of the predicted tag (boolean)
`random_seed`	Random seed (integer)
`graph_folder`	folder path that contain external graph files
`validation_split`	proportion of the data to use for validation (float)
`eval_log_extended`	? (boolean)
`enable_output_logs`	whether to enable the TensorFlow output logs (boolean)
`output_logs_path`	path for the output logs
`enable_memory_optimizer`	allow training NerDLApproach on a dataset larger than the memory
`uid`	A character string used to uniquely identify the ML estimator.

Details

Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. See https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl

Value

The object returned depends on the class of x.

spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.

When x is a spark_connection the function returns a NerDLApproach estimator. When x is a ml_pipeline the pipeline with the NerDLApproach added. When x is a tbl_spark a transformed tbl_spark (note that the Dataframe passed in must have the input_cols specified).