nlp_context_spell_checker: Spark NLP ContextSpellCheckerApproach

View source: R/context-spell-checker.R

nlp_context_spell_checkerR Documentation

Spark NLP ContextSpellCheckerApproach

Description

Spark ML estimator that Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information See https://nlp.johnsnowlabs.com/docs/en/annotators#context-spellchecker

Usage

nlp_context_spell_checker(
  x,
  input_cols,
  output_col,
  batch_size = NULL,
  compound_count = NULL,
  case_strategy = NULL,
  class_count = NULL,
  epochs = NULL,
  error_threshold = NULL,
  final_learning_rate = NULL,
  initial_learning_rate = NULL,
  lm_classes = NULL,
  lazy_annotator = NULL,
  max_candidates = NULL,
  max_window_len = NULL,
  min_count = NULL,
  tradeoff = NULL,
  validation_fraction = NULL,
  weighted_dist_path = NULL,
  word_max_dist = NULL,
  uid = random_string("context_spell_checker_")
)

Arguments

x

A spark_connection, ml_pipeline, or a tbl_spark.

input_cols

Input columns. String array.

output_col

Output column. String.

batch_size

batch size for training in NLM. Defaults to 24

compound_count

blacklist

case_strategy

What case combinations to try when generating candidates. ALL_UPPER_CASE = 0, FIRST_LETTER_CAPITALIZED = 1, ALL = 2. Defaults to 2.

class_count

class threshold

epochs

Number of epochs to train the language model. Defaults to 2.

error_threshold

Threshold perplexity for a word to be considered as an error. Defaults to 10f.

final_learning_rate

Final learning rate for the LM. Defaults to 0.0005

initial_learning_rate

Initial learning rate for the LM. Defaults to 0.7

lm_classes

Number of classes to use during factorization of the softmax output in the LM. Defaults to 2000.

lazy_annotator

lazy annotator

max_candidates

Maximum number of candidates for every word. Defaults to 6.

max_window_len

Maximum size for the window used to remember history prior to every correction. Defaults to 5.

min_count

Min number of times a token should appear to be included in vocab. Defaults to 3.0f.

tradeoff

Tradeoff between the cost of a word error and a transition in the language model. Defaults to 18.0f.

validation_fraction

Percentage of datapoints to use for validation. Defaults to .1f.

weighted_dist_path

The path to the file containing the weighted_dist_path for the levenshtein distance.

word_max_dist

Maximum distance for the generated candidates for every word. Defaults to 3.

uid

A character string used to uniquely identify the ML estimator.

Value

The object returned depends on the class of x.

  • spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.

  • ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.

  • tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.