View source: R/document_normalizer.R
nlp_document_normalizer | R Documentation |
Spark ML transformer which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.
nlp_document_normalizer( x, input_cols, output_col, action = NULL, encoding = NULL, lower_case = NULL, patterns = NULL, policy = NULL, replacement = NULL, uid = random_string("document_normalizer_") )
x |
A |
input_cols |
Input columns. String array. |
output_col |
Output column. String. |
action |
Action to perform applying regex patterns on text |
encoding |
File encoding to apply on normalized documents (Default: "disable") |
lower_case |
Whether to convert strings to lowercase (Default: false) |
patterns |
Normalization regex patterns which match will be removed from document (Default: Array("<^>*>")) |
policy |
RemovalPolicy to remove patterns from text with a given policy (Default: "pretty_all"). Possible values are "all", "pretty_all", "first", "pretty_first" |
replacement |
Replacement string to apply when regexes match (Default: " ") |
uid |
A character string used to uniquely identify the ML estimator. |
See https://nlp.johnsnowlabs.com/docs/en/annotators#documentnormalizer-text-cleaning
The object returned depends on the class of x
.
spark_connection
: When x
is a spark_connection
, the function returns an instance of a ml_estimator
object. The object contains a pointer to
a Spark Estimator
object and can be used to compose
Pipeline
objects.
ml_pipeline
: When x
is a ml_pipeline
, the function returns a ml_pipeline
with
the NLP estimator appended to the pipeline.
tbl_spark
: When x
is a tbl_spark
, an estimator is constructed then
immediately fit with the input tbl_spark
, returning an NLP model.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.