| nlp_tokenizer | R Documentation |
Spark ML estimator that identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. See https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer
nlp_tokenizer(
x,
input_cols,
output_col,
exceptions = NULL,
exceptions_path = NULL,
exceptions_path_read_as = "LINE_BY_LINE",
exceptions_path_options = list(format = "text"),
case_sensitive_exceptions = NULL,
context_chars = NULL,
split_chars = NULL,
split_pattern = NULL,
target_pattern = NULL,
suffix_pattern = NULL,
prefix_pattern = NULL,
infix_patterns = NULL,
uid = random_string("tokenizer_")
)
x |
A |
input_cols |
Input columns. String array. |
output_col |
Output column. String. |
exceptions |
String array. List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split. |
exceptions_path |
NOTE: NOT IMPLEMENTED. String. Path to txt file with list of token exceptions |
exceptions_path_read_as |
LINE_BY_LINE or SPARK_DATASET |
exceptions_path_options |
Options to pass to the Spark reader. Defaults to "format" = "text" |
case_sensitive_exceptions |
Boolean. Whether to follow case sensitiveness for matching exceptions in text |
context_chars |
String array. Whether to follow case sensitiveness for matching exceptions in text |
split_chars |
String array. List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns. |
split_pattern |
String. pattern to separate from the inside of tokens. takes priority over splitChars. |
target_pattern |
String. Basic regex rule to identify a candidate for tokenization. Defaults to |
suffix_pattern |
String. Regex to identify subtokens that are in the end of the token. Regex has to end with |
prefix_pattern |
String. Regex to identify subtokens that come in the beginning of the token. Regex has to start with |
infix_patterns |
String array. extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general). |
uid |
A character string used to uniquely identify the ML estimator. |
The object returned depends on the class of x.
spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to
a Spark Estimator object and can be used to compose
Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with
the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then
immediately fit with the input tbl_spark, returning an NLP model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.