nlp_tokenizer: Spark NLP Tokenizer approach
In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

nlp_tokenizer

R Documentation

Spark NLP Tokenizer approach

Description

Spark ML estimator that identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. See https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer

Usage

nlp_tokenizer(
  x,
  input_cols,
  output_col,
  exceptions = NULL,
  exceptions_path = NULL,
  exceptions_path_read_as = "LINE_BY_LINE",
  exceptions_path_options = list(format = "text"),
  case_sensitive_exceptions = NULL,
  context_chars = NULL,
  split_chars = NULL,
  split_pattern = NULL,
  target_pattern = NULL,
  suffix_pattern = NULL,
  prefix_pattern = NULL,
  infix_patterns = NULL,
  uid = random_string("tokenizer_")
)

Arguments

`x`	A `spark_connection`, `ml_pipeline`, or a `tbl_spark`.
`input_cols`	Input columns. String array.
`output_col`	Output column. String.
`exceptions`	String array. List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.
`exceptions_path`	NOTE: NOT IMPLEMENTED. String. Path to txt file with list of token exceptions
`exceptions_path_read_as`	LINE_BY_LINE or SPARK_DATASET
`exceptions_path_options`	Options to pass to the Spark reader. Defaults to "format" = "text"
`case_sensitive_exceptions`	Boolean. Whether to follow case sensitiveness for matching exceptions in text
`context_chars`	String array. Whether to follow case sensitiveness for matching exceptions in text
`split_chars`	String array. List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.
`split_pattern`	String. pattern to separate from the inside of tokens. takes priority over splitChars.
`target_pattern`	String. Basic regex rule to identify a candidate for tokenization. Defaults to `\\S+` which means anything not a space
`suffix_pattern`	String. Regex to identify subtokens that are in the end of the token. Regex has to end with `\\z` and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
`prefix_pattern`	String. Regex to identify subtokens that come in the beginning of the token. Regex has to start with `\\A` and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis
`infix_patterns`	String array. extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).
`uid`	A character string used to uniquely identify the ML estimator.

Value

The object returned depends on the class of x.

spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.
ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.
tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.