nlp_tokenizer: Spark NLP Tokenizer approach

View source: R/tokenizer.R

nlp_tokenizerR Documentation

Spark NLP Tokenizer approach

Description

Spark ML estimator that identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. See https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer

Usage

nlp_tokenizer(
  x,
  input_cols,
  output_col,
  exceptions = NULL,
  exceptions_path = NULL,
  exceptions_path_read_as = "LINE_BY_LINE",
  exceptions_path_options = list(format = "text"),
  case_sensitive_exceptions = NULL,
  context_chars = NULL,
  split_chars = NULL,
  split_pattern = NULL,
  target_pattern = NULL,
  suffix_pattern = NULL,
  prefix_pattern = NULL,
  infix_patterns = NULL,
  uid = random_string("tokenizer_")
)

Arguments

x

A spark_connection, ml_pipeline, or a tbl_spark.

input_cols

Input columns. String array.

output_col

Output column. String.

exceptions

String array. List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.

exceptions_path

NOTE: NOT IMPLEMENTED. String. Path to txt file with list of token exceptions

exceptions_path_read_as

LINE_BY_LINE or SPARK_DATASET

exceptions_path_options

Options to pass to the Spark reader. Defaults to "format" = "text"

case_sensitive_exceptions

Boolean. Whether to follow case sensitiveness for matching exceptions in text

context_chars

String array. Whether to follow case sensitiveness for matching exceptions in text

split_chars

String array. List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.

split_pattern

String. pattern to separate from the inside of tokens. takes priority over splitChars.

target_pattern

String. Basic regex rule to identify a candidate for tokenization. Defaults to \\S+ which means anything not a space

suffix_pattern

String. Regex to identify subtokens that are in the end of the token. Regex has to end with \\z and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis

prefix_pattern

String. Regex to identify subtokens that come in the beginning of the token. Regex has to start with \\A and must contain groups (). Each group will become a separate token within the prefix. Defaults to non-letter characters. e.g. quotes or parenthesis

infix_patterns

String array. extension pattern regex with groups to the top of the rules (will target first, from more specific to the more general).

uid

A character string used to uniquely identify the ML estimator.

Value

The object returned depends on the class of x.

  • spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.

  • ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.

  • tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.