corenlp_blocked: Runs Stanford CoreNLP on a collection of .txt files and...

Description Usage Arguments Value

View source: R/corenlp_blocked.R

Description

Runs Stanford CoreNLP on a collection of .txt files and processes them in blocks of a specified size, saving intermediate results to disk. Designed to function on very large corpora.

Usage

1
2
3
4
5
corenlp_blocked(output_directory, document_directory, file_list = NULL,
  block_size = 1000, syntactic_parsing = FALSE,
  coreference_resolution = FALSE, additional_options = "",
  return_raw_output = FALSE, version = "3.5.2", parallel = FALSE,
  cores = 1, first_block = NULL, last_block = NULL)

Arguments

output_directory

The path to a directory where the user would like CoreNLP output to be stored. Output will be saved to this directory in .Rdata files named CoreNLP_Output_1.Rdata ... CoreNLP_Output_N.Rdata

document_directory

A directory path to a directory contianing .txt files (one per document) to be run through CoreNLP.

file_list

An optional list of .txt files to be used. Can be useful if the user only wants to process a subset of documents in the directory such as when the corpus is extremely large.

block_size

The number of docuemnts to be processed at a time. Defaults to 1000.

syntactic_parsing

Logical indicating whether syntactic parsing should be included as an option. Defaults to FALSE. Caution, enabling this argument may greatly increase runtime. If TRUE, output will automatically be return in raw format.

coreference_resolution

Logical indicating whether coreference resolution should be included as an option. Defaults to FALSE. Caution, enabling this argument may greatly increase runtime. If TRUE, output will automatically be return in raw format.

additional_options

An optional string specifying additional options for CoreNLP. May cause unexpected behavior, use at your own risk!

return_raw_output

Defaults to FALSE, if TRUE, then CoreNLP output is not parsed and raw list objects are returned.

version

The version of Core-NLP to download. Defaults to '3.5.2'. Newer versions of CoreNLP will be made available at a later date.

parallel

Logical indicating whether CoreNLP should be run in parallel.

cores

The number of cores to be used if CoreNLP is being run in parallel.

first_block

Used to run CoreNLP on certain block ranges.

last_block

Used to run CoreNLP on certain block ranges.

Value

Does not return anything, saves all output to disk.


matthewjdenny/SpeedReader documentation built on March 25, 2020, 5:32 p.m.