segment: Split table into directories with text segments.

Description Usage Arguments Value Examples

View source: R/segment.R

Description

Multithreaded processing using the StanfordCoreNLP class requires splitting up input data into pieces of text ("chunks") available as files that are processed in parallel. The segment() function performs this split operation, i.e. it creates directories with chunks within a superdirectory.

Usage

1
segment(x, dir, chunksize = 10L, progress = interactive())

Arguments

x

A data.table with columns 'doc_id' (integer values) and 'text'. Further columns are ignored.

dir

Superdirectory for directories with segments that will be processed sequentially.

chunksize

An integer value, the number of strings that will reside in the chunk directories.

progress

A logical value, whether to show progress bar.

Value

The function returns a character vector with the directories that contain files with text segments.

Examples

1
2
3
4
5
library(data.table)
reuters_txt <- readLines(system.file(package = "bignlp", "extdata", "txt", "reuters.txt"))
dt <- data.table(doc_id = 1L:length(reuters_txt), text = reuters_txt)
segdir <- tempdir()
dirs <- segment(x = dt, dir = segdir, chunksize = 10L)

PolMine/bignlp documentation built on Jan. 29, 2021, 1:14 a.m.