View source: R/context-spell-checker.R
| nlp_context_spell_checker | R Documentation |
Spark ML estimator that Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information See https://nlp.johnsnowlabs.com/docs/en/annotators#context-spellchecker
nlp_context_spell_checker(
x,
input_cols,
output_col,
batch_size = NULL,
compound_count = NULL,
case_strategy = NULL,
class_count = NULL,
epochs = NULL,
error_threshold = NULL,
final_learning_rate = NULL,
initial_learning_rate = NULL,
lm_classes = NULL,
lazy_annotator = NULL,
max_candidates = NULL,
max_window_len = NULL,
min_count = NULL,
tradeoff = NULL,
validation_fraction = NULL,
weighted_dist_path = NULL,
word_max_dist = NULL,
uid = random_string("context_spell_checker_")
)
x |
A |
input_cols |
Input columns. String array. |
output_col |
Output column. String. |
batch_size |
batch size for training in NLM. Defaults to 24 |
compound_count |
blacklist |
case_strategy |
What case combinations to try when generating candidates. ALL_UPPER_CASE = 0, FIRST_LETTER_CAPITALIZED = 1, ALL = 2. Defaults to 2. |
class_count |
class threshold |
epochs |
Number of epochs to train the language model. Defaults to 2. |
error_threshold |
Threshold perplexity for a word to be considered as an error. Defaults to 10f. |
final_learning_rate |
Final learning rate for the LM. Defaults to 0.0005 |
initial_learning_rate |
Initial learning rate for the LM. Defaults to 0.7 |
lm_classes |
Number of classes to use during factorization of the softmax output in the LM. Defaults to 2000. |
lazy_annotator |
lazy annotator |
max_candidates |
Maximum number of candidates for every word. Defaults to 6. |
max_window_len |
Maximum size for the window used to remember history prior to every correction. Defaults to 5. |
min_count |
Min number of times a token should appear to be included in vocab. Defaults to 3.0f. |
tradeoff |
Tradeoff between the cost of a word error and a transition in the language model. Defaults to 18.0f. |
validation_fraction |
Percentage of datapoints to use for validation. Defaults to .1f. |
weighted_dist_path |
The path to the file containing the weighted_dist_path for the levenshtein distance. |
word_max_dist |
Maximum distance for the generated candidates for every word. Defaults to 3. |
uid |
A character string used to uniquely identify the ML estimator. |
The object returned depends on the class of x.
spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to
a Spark Estimator object and can be used to compose
Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then
immediately fit with the input tbl_spark, returning an NLP model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.