nlp_medical_ner: Spark NLP MedicalNerModel Named Entity Recognition Deep...

View source: R/medical-ner.R

nlp_medical_nerR Documentation

Spark NLP MedicalNerModel Named Entity Recognition Deep Learning annotator

Description

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.

Usage

nlp_medical_ner(
  x,
  input_cols,
  output_col,
  label_col = NULL,
  max_epochs = NULL,
  lr = NULL,
  po = NULL,
  batch_size = NULL,
  dropout = NULL,
  verbose = NULL,
  include_confidence = NULL,
  random_seed = NULL,
  graph_folder = NULL,
  validation_split = NULL,
  eval_log_extended = NULL,
  enable_output_logs = NULL,
  output_logs_path = NULL,
  enable_memory_optimizer = NULL,
  pretrained_model_path = NULL,
  override_existing_tags = NULL,
  tags_mapping = NULL,
  test_dataset = NULL,
  use_contrib = NULL,
  log_prefix = NULL,
  include_all_confidence_scores = NULL,
  graph_file = NULL,
  uid = random_string("medical_ner_")
)

Arguments

x

A spark_connection, ml_pipeline, or a tbl_spark.

input_cols

Input columns. String array.

output_col

Output column. String.

label_col

If DatasetPath is not provided, this seq of Annotation type of column should have labeled data per token (string)

max_epochs

Maximum number of epochs to train (integer)

lr

Initial learning rate (float)

po

Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch) (float)

batch_size

Batch size for training (integer)

dropout

Dropout coefficient (float)

verbose

Verbosity level (integer)

include_confidence

whether to include confidence values (boolean)

random_seed

Random seed (integer)

graph_folder

folder path that contain external graph files

validation_split

proportion of the data to use for validation (float)

eval_log_extended

whether logs for validation to be extended: it displays time and evaluation of each label. (boolean)

enable_output_logs

whether to enable the TensorFlow output logs (boolean)

output_logs_path

path for the output logs

enable_memory_optimizer

allow training NerDLApproach on a dataset larger than the memory

pretrained_model_path

set the location of an already trained MedicalNerModel, which is used as a starting point for training the new model.

override_existing_tags

controls whether to override already learned tags when using a pretrained model to initialize the new model.

tags_mapping

a string list specifying how old tags are mapped to new ones. (e.g. c("B-PER,B-VIP", "I-PER,I-VIP"))

test_dataset

path to test dataset

use_contrib

whether to use contrib LSTM cells

log_prefix

a string prefix to be included in the logs

include_all_confidence_scores

whether to include confidence scores in annotation metadata

graph_file

Folder path that contain external graph files

uid

A character string used to uniquely identify the ML estimator.

Details

Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. See https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl

Value

The object returned depends on the class of x.

  • spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.

  • ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.

  • tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.

When x is a spark_connection the function returns a NerDLApproach estimator. When x is a ml_pipeline the pipeline with the NerDLApproach added. When x is a tbl_spark a transformed tbl_spark (note that the Dataframe passed in must have the input_cols specified).


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.