nlp_medical_ner: Spark NLP MedicalNerModel Named Entity Recognition Deep...
In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

nlp_medical_ner

R Documentation

Spark NLP MedicalNerModel Named Entity Recognition Deep Learning annotator

Description

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.

Usage

nlp_medical_ner(
  x,
  input_cols,
  output_col,
  label_col = NULL,
  max_epochs = NULL,
  lr = NULL,
  po = NULL,
  batch_size = NULL,
  dropout = NULL,
  verbose = NULL,
  include_confidence = NULL,
  random_seed = NULL,
  graph_folder = NULL,
  validation_split = NULL,
  eval_log_extended = NULL,
  enable_output_logs = NULL,
  output_logs_path = NULL,
  enable_memory_optimizer = NULL,
  pretrained_model_path = NULL,
  override_existing_tags = NULL,
  tags_mapping = NULL,
  test_dataset = NULL,
  use_contrib = NULL,
  log_prefix = NULL,
  include_all_confidence_scores = NULL,
  graph_file = NULL,
  uid = random_string("medical_ner_")
)

Arguments

`x`	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
`input_cols`	Input columns. String array.
`output_col`	Output column. String.
`label_col`	If DatasetPath is not provided, this seq of Annotation type of column should have labeled data per token (string)
`max_epochs`	Maximum number of epochs to train (integer)
`lr`	Initial learning rate (float)
`po`	Learning rate decay coefficient. Real Learning Rate: lr / (1 + po * epoch) (float)
`batch_size`	Batch size for training (integer)
`dropout`	Dropout coefficient (float)
`verbose`	Verbosity level (integer)
`include_confidence`	whether to include confidence values (boolean)
`random_seed`	Random seed (integer)
`graph_folder`	folder path that contain external graph files
`validation_split`	proportion of the data to use for validation (float)
`eval_log_extended`	whether logs for validation to be extended: it displays time and evaluation of each label. (boolean)
`enable_output_logs`	whether to enable the TensorFlow output logs (boolean)
`output_logs_path`	path for the output logs
`enable_memory_optimizer`	allow training NerDLApproach on a dataset larger than the memory
`pretrained_model_path`	set the location of an already trained MedicalNerModel, which is used as a starting point for training the new model.
`override_existing_tags`	controls whether to override already learned tags when using a pretrained model to initialize the new model.
`tags_mapping`	a string list specifying how old tags are mapped to new ones. (e.g. c("B-PER,B-VIP", "I-PER,I-VIP"))
`test_dataset`	path to test dataset
`use_contrib`	whether to use contrib LSTM cells
`log_prefix`	a string prefix to be included in the logs
`include_all_confidence_scores`	whether to include confidence scores in annotation metadata
`graph_file`	Folder path that contain external graph files
`uid`	A character string used to uniquely identify the ML estimator.

Details

Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. See https://nlp.johnsnowlabs.com/docs/en/annotators#ner-dl

Value

The object returned depends on the class of x.

spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.

When x is a spark_connection the function returns a NerDLApproach estimator. When x is a ml_pipeline the pipeline with the NerDLApproach added. When x is a tbl_spark a transformed tbl_spark (note that the Dataframe passed in must have the input_cols specified).