nlp_conll_read_dataset: Transform CoNLL format text file to Spark dataframe

nlp_conll_read_datasetR Documentation

Transform CoNLL format text file to Spark dataframe

Description

In order to train a Named Entity Recognition DL annotator, we need to get CoNLL format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset. See https://nlp.johnsnowlabs.com/docs/en/annotators#conll-dataset. All the function arguments have defaults. See https://nlp.johnsnowlabs.com/api/index.html#com.johnsnowlabs.nlp.training.CoNLL for the defaults.

Usage

nlp_conll_read_dataset(
  sc,
  path,
  read_as = NULL,
  document_col = NULL,
  sentence_col = NULL,
  token_col = NULL,
  pos_col = NULL,
  conll_label_index = NULL,
  conll_pos_index = NULL,
  conll_text_col = NULL,
  label_col = NULL,
  explode_sentences = NULL,
  delimiter = NULL,
  parallelism = NULL,
  storage_level = NULL
)

Arguments

sc

a Spark connection

path

path to the file to read

read_as

Can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used (default LINE_BY_LINE)

document_col

name to use for the document column

sentence_col

name to use for the sentence column

token_col

name to use for the token column

pos_col

name to use for the part of speech column

conll_label_index

index position in the file of the ner label

conll_pos_index

index position in the file of the part of speech label

conll_text_col

name to use for the text column

label_col

name to use for the label column

explode_sentences

boolean whether the sentences should be exploded or not

delimiter

Delimiter used to separate columns inside CoNLL file

parallelism

integer value

storage_level

specifies the storage level to use for the dataset. Must be a string value from org.apache.spark.storage.StorageLevel (e.g. "DISK_ONLY"). See https://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html

Value

Spark dataframe containing the imported data


r-spark/sparknlp documentation built on Oct. 15, 2022, 10:50 a.m.