nlp_tokenizer | R Documentation |
Spark ML estimator that identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. See https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer
nlp_tokenizer( x, input_cols, output_col, exceptions = NULL, exceptions_path = NULL, exceptions_path_read_as = "LINE_BY_LINE", exceptions_path_options = list(format = "text"), case_sensitive_exceptions = NULL, context_chars = NULL, split_chars = NULL, split_pattern = NULL, target_pattern = NULL, suffix_pattern = NULL, prefix_pattern = NULL, infix_patterns = NULL, uid = random_string("tokenizer_") )
x |
A |
input_cols |
Input columns. String array. |
output_col |
Output column. String. |
exceptions |
String array. List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split. |
exceptions_path |
NOTE: NOT IMPLEMENTED. String. Path to txt file with list of token exceptions |
exceptions_path_read_as |
LINE_BY_LINE or SPARK_DATASET |
exceptions_path_options |
Options to pass to the Spark reader. Defaults to "format" = "text" |
case_sensitive_exceptions |
Boolean. Whether to follow case sensitiveness for matching exceptions in text |
context_chars |
String array. Whether to follow case sensitiveness for matching exceptions in text |
split_chars |
String array. List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns. |
split_pattern |
String. pattern to separate from the inside of tokens. takes priority over splitChars. |
target_pattern |
String. Basic regex rule to identify a candidate for tokenization. Defaults to |
suffix_pattern |
String. Regex to identify subtokens that are in the end of the token. Regex has to end with |
prefix_pattern |
String. Regex to identify subtokens that come in the beginning of the token. Regex has to start with |
infix_patterns |
String array. extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general). |
uid |
A character string used to uniquely identify the ML estimator. |
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object. The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the NLP estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning an NLP model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.