nlp_conll_read_dataset | R Documentation |
In order to train a Named Entity Recognition DL annotator, we need to get CoNLL format data as a spark dataframe. There is a component that does this for us: it reads a plain text file and transforms it to a spark dataset. See https://nlp.johnsnowlabs.com/docs/en/annotators#conll-dataset. All the function arguments have defaults. See https://nlp.johnsnowlabs.com/api/index.html#com.johnsnowlabs.nlp.training.CoNLL for the defaults.
nlp_conll_read_dataset( sc, path, read_as = NULL, document_col = NULL, sentence_col = NULL, token_col = NULL, pos_col = NULL, conll_label_index = NULL, conll_pos_index = NULL, conll_text_col = NULL, label_col = NULL, explode_sentences = NULL, delimiter = NULL, parallelism = NULL, storage_level = NULL )
sc |
a Spark connection |
path |
path to the file to read |
read_as |
Can be LINE_BY_LINE or SPARK_DATASET, with options if latter is used (default LINE_BY_LINE) |
document_col |
name to use for the document column |
sentence_col |
name to use for the sentence column |
token_col |
name to use for the token column |
pos_col |
name to use for the part of speech column |
conll_label_index |
index position in the file of the ner label |
conll_pos_index |
index position in the file of the part of speech label |
conll_text_col |
name to use for the text column |
label_col |
name to use for the label column |
explode_sentences |
boolean whether the sentences should be exploded or not |
delimiter |
Delimiter used to separate columns inside CoNLL file |
parallelism |
integer value |
storage_level |
specifies the storage level to use for the dataset. Must be a string value from org.apache.spark.storage.StorageLevel (e.g. "DISK_ONLY"). See https://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html |
Spark dataframe containing the imported data
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.