nlp_context_spell_checker: Spark NLP ContextSpellCheckerApproach
In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

nlp_context_spell_checker

R Documentation

Spark NLP ContextSpellCheckerApproach

Description

Spark ML estimator that Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information See https://nlp.johnsnowlabs.com/docs/en/annotators#context-spellchecker

Usage

nlp_context_spell_checker(
  x,
  input_cols,
  output_col,
  batch_size = NULL,
  compound_count = NULL,
  case_strategy = NULL,
  class_count = NULL,
  epochs = NULL,
  error_threshold = NULL,
  final_learning_rate = NULL,
  initial_learning_rate = NULL,
  lm_classes = NULL,
  lazy_annotator = NULL,
  max_candidates = NULL,
  max_window_len = NULL,
  min_count = NULL,
  tradeoff = NULL,
  validation_fraction = NULL,
  weighted_dist_path = NULL,
  word_max_dist = NULL,
  uid = random_string("context_spell_checker_")
)

Arguments

`x`	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
`input_cols`	Input columns. String array.
`output_col`	Output column. String.
`batch_size`	batch size for training in NLM. Defaults to 24
`compound_count`	blacklist
`case_strategy`	What case combinations to try when generating candidates. ALL_UPPER_CASE = 0, FIRST_LETTER_CAPITALIZED = 1, ALL = 2. Defaults to 2.
`class_count`	class threshold
`epochs`	Number of epochs to train the language model. Defaults to 2.
`error_threshold`	Threshold perplexity for a word to be considered as an error. Defaults to 10f.
`final_learning_rate`	Final learning rate for the LM. Defaults to 0.0005
`initial_learning_rate`	Initial learning rate for the LM. Defaults to 0.7
`lm_classes`	Number of classes to use during factorization of the softmax output in the LM. Defaults to 2000.
`lazy_annotator`	lazy annotator
`max_candidates`	Maximum number of candidates for every word. Defaults to 6.
`max_window_len`	Maximum size for the window used to remember history prior to every correction. Defaults to 5.
`min_count`	Min number of times a token should appear to be included in vocab. Defaults to 3.0f.
`tradeoff`	Tradeoff between the cost of a word error and a transition in the language model. Defaults to 18.0f.
`validation_fraction`	Percentage of datapoints to use for validation. Defaults to .1f.
`weighted_dist_path`	The path to the file containing the weighted_dist_path for the levenshtein distance.
`word_max_dist`	Maximum distance for the generated candidates for every word. Defaults to 3.
`uid`	A character string used to uniquely identify the ML estimator.

Value

The object returned depends on the class of x.

spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.