In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

This notebook is adapted from John Snow Labs workshop Jupyter/Python tutorial "extractor.ipynb" (https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/text-matcher-pipeline/extractor.ipynb)

In the following example, we walk-through our straight forward Text Matcher Annotator.

This annotator will take a list of sentences from a text file and look them up in the given target dataset.

This annotator is an Annotator Model and hence does not require training.

1. Call necessary imports and set the resource path to read local data files

library(sparklyr)
library(sparknlp)
library(dplyr)

2. Connect to Spark

version <- Sys.getenv("SPARK_VERSION", unset = "2.4.0")

config <- sparklyr::spark_config()
#config$`sparklyr.shell.driver-memory` <- "8g"

options(sparklyr.sanitize.column.names.verbose = TRUE)
options(sparklyr.verbose = TRUE)
options(sparklyr.na.omit.verbose = TRUE)
options(sparklyr.na.action.verbose = TRUE)
sc <- sparklyr::spark_connect(master = "local[*]", version = version, config = config)

3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata

document_assembler <- nlp_document_assembler(sc, input_col = "text", output_col = "document")
sentence_detector <- nlp_sentence_detector(sc, input_cols = c("document"), output_col = "sentence")
tokenizer <- nlp_tokenizer(sc, input_cols = c("document"), output_col = "token")
extractor <- nlp_text_matcher(sc, input_cols = c("token", "sentence"), output_col = "entities", path = "entities.txt")
finisher <- nlp_finisher(sc, input_cols = "entities", include_metadata = FALSE, clean_annotations = TRUE)

pipeline <- ml_pipeline(document_assembler,
                        sentence_detector,
                        tokenizer,
                        extractor,
                        finisher)

4. Load the input data to be annotated

tdir <- tempdir()
download.file("https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip", 
              paste0(tdir, "/sentiment.parquet.zip"))
unzip(paste0(tdir, "/sentiment.parquet.zip"), exdir = tdir)

data <- spark_read_parquet(sc, "sentiment", paste0(tdir, "/sentiment.parquet")) %>%
  head(1000)

head(data, n = 20)

5. Running the fir for sentence detection and tokenization

print("Start fitting")
model <- ml_fit(pipeline, data)
print("Fitting is ended")

6. Running the transform on data to do text matching. It will append a new column with mathed entities

extracted <- ml_transform(model, data)
extracted

extracted %>%
    filter(size(finished_entities) != 0)

7. The model could be saved locally and reloaded to run again

ml_save(model, "./extractor.model")

same_model <- ml_load(sc, "./extractor.model")
ml_transform(same_model, data) %>%
  filter(size(finished_entities) != 0)

r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r-spark/sparknlp
R Interface to John Snow Labs Spark NLP

In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

1. Call necessary imports and set the resource path to read local data files

2. Connect to Spark

3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata

4. Load the input data to be annotated

5. Running the fir for sentence detection and tokenization

6. Running the transform on data to do text matching. It will append a new column with mathed entities

7. The model could be saved locally and reloaded to run again

R Package Documentation

Browse R Packages

We want your feedback!

r-spark/sparknlp R Interface to John Snow Labs Spark NLP

In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

1. Call necessary imports and set the resource path to read local data files

2. Connect to Spark

3. Create appropriate annotators. We are using Sentence Detection and Tokenizing the sentence. The Finisher will clean the annotations and exclude the metadata

4. Load the input data to be annotated

5. Running the fir for sentence detection and tokenization

6. Running the transform on data to do text matching. It will append a new column with mathed entities

7. The model could be saved locally and reloaded to run again

R Package Documentation

Browse R Packages

We want your feedback!

r-spark/sparknlp
R Interface to John Snow Labs Spark NLP