nlp_conll_read_dataset: Transform CoNLL format text file to Spark dataframe
In r-spark/sparknlp: R Interface to John Snow Labs Spark NLP

nlp_conll_read_dataset

R Documentation

Transform CoNLL format text file to Spark dataframe

Description

In order to train a Named Entity Recognition DL annotator, we need to get CoNLL format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset. See https://nlp.johnsnowlabs.com/docs/en/annotators#conll-dataset. All the function arguments have defaults. See https://nlp.johnsnowlabs.com/api/index.html#com.johnsnowlabs.nlp.training.CoNLL for the defaults.

Usage

nlp_conll_read_dataset(
  sc,
  path,
  read_as = NULL,
  document_col = NULL,
  sentence_col = NULL,
  token_col = NULL,
  pos_col = NULL,
  conll_label_index = NULL,
  conll_pos_index = NULL,
  conll_text_col = NULL,
  label_col = NULL,
  explode_sentences = NULL,
  delimiter = NULL,
  parallelism = NULL,
  storage_level = NULL
)

Arguments

`sc`	a Spark connection
`path`	path to the file to read
`read_as`	Can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used (default LINE_BY_LINE)
`document_col`	name to use for the document column
`sentence_col`	name to use for the sentence column
`token_col`	name to use for the token column
`pos_col`	name to use for the part of speech column
`conll_label_index`	index position in the file of the ner label
`conll_pos_index`	index position in the file of the part of speech label
`conll_text_col`	name to use for the text column
`label_col`	name to use for the label column
`explode_sentences`	boolean whether the sentences should be exploded or not
`delimiter`	Delimiter used to separate columns inside CoNLL file
`parallelism`	integer value
`storage_level`	specifies the storage level to use for the dataset. Must be a string value from org.apache.spark.storage.StorageLevel (e.g. "DISK_ONLY"). See https://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html