nlp_document_normalizer: Spark NLP DocumentNormalizer

View source: R/document_normalizer.R

nlp_document_normalizerR Documentation

Spark NLP DocumentNormalizer

Description

Spark ML transformer which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

Usage

nlp_document_normalizer(
  x,
  input_cols,
  output_col,
  action = NULL,
  encoding = NULL,
  lower_case = NULL,
  patterns = NULL,
  policy = NULL,
  replacement = NULL,
  uid = random_string("document_normalizer_")
)

Arguments

x

A spark_connection, ml_pipeline, or a tbl_spark.

input_cols

Input columns. String array.

output_col

Output column. String.

action

Action to perform applying regex patterns on text

encoding

File encoding to apply on normalized documents (Default: "disable")

lower_case

Whether to convert strings to lowercase (Default: false)

patterns

Normalization regex patterns which match will be removed from document (Default: Array("<^>*>"))

policy

RemovalPolicy to remove patterns from text with a given policy (Default: "pretty_all"). Possible values are "all", "pretty_all", "first", "pretty_first"

replacement

Replacement string to apply when regexes match (Default: " ")

uid

A character string used to uniquely identify the ML estimator.

Details

See https://nlp.johnsnowlabs.com/docs/en/annotators#documentnormalizer-text-cleaning

Value

The object returned depends on the class of x.

  • spark_connection: When x is a spark_connection, the function returns an instance of a ml_estimator object. The object contains a pointer to a Spark Estimator object and can be used to compose Pipeline objects.

  • ml_pipeline: When x is a ml_pipeline, the function returns a ml_pipeline with the NLP estimator appended to the pipeline.

  • tbl_spark: When x is a tbl_spark, an estimator is constructed then immediately fit with the input tbl_spark, returning an NLP model.


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.