View source: R/context-spell-checker.R
nlp_context_spell_checker | R Documentation |
Spark ML estimator that Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information See https://nlp.johnsnowlabs.com/docs/en/annotators#context-spellchecker
nlp_context_spell_checker( x, input_cols, output_col, batch_size = NULL, compound_count = NULL, case_strategy = NULL, class_count = NULL, epochs = NULL, error_threshold = NULL, final_learning_rate = NULL, initial_learning_rate = NULL, lm_classes = NULL, lazy_annotator = NULL, max_candidates = NULL, max_window_len = NULL, min_count = NULL, tradeoff = NULL, validation_fraction = NULL, weighted_dist_path = NULL, word_max_dist = NULL, uid = random_string("context_spell_checker_") )
x |
A |
input_cols |
Input columns. String array. |
output_col |
Output column. String. |
batch_size |
batch size for training in NLM. Defaults to 24 |
compound_count |
blacklist |
case_strategy |
What case combinations to try when generating candidates. ALL_UPPER_CASE = 0, FIRST_LETTER_CAPITALIZED = 1, ALL = 2. Defaults to 2. |
class_count |
class threshold |
epochs |
Number of epochs to train the language model. Defaults to 2. |
error_threshold |
Threshold perplexity for a word to be considered as an error. Defaults to 10f. |
final_learning_rate |
Final learning rate for the LM. Defaults to 0.0005 |
initial_learning_rate |
Initial learning rate for the LM. Defaults to 0.7 |
lm_classes |
Number of classes to use during factorization of the softmax output in the LM. Defaults to 2000. |
lazy_annotator |
lazy annotator |
max_candidates |
Maximum number of candidates for every word. Defaults to 6. |
max_window_len |
Maximum size for the window used to remember history prior to every correction. Defaults to 5. |
min_count |
Min number of times a token should appear to be included in vocab. Defaults to 3.0f. |
tradeoff |
Tradeoff between the cost of a word error and a transition in the language model. Defaults to 18.0f. |
validation_fraction |
Percentage of datapoints to use for validation. Defaults to .1f. |
weighted_dist_path |
The path to the file containing the weighted_dist_path for the levenshtein distance. |
word_max_dist |
Maximum distance for the generated candidates for every word. Defaults to 3. |
uid |
A character string used to uniquely identify the ML estimator. |
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object. The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the NLP estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning an NLP model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.