clean_corpus: Clean Corpus

Description Usage Arguments Examples

Description

A function that cleans a corpus based on user specification. Handles each file in the ipath in parallel and runs clean_file on each file. Outputs the cleaned version of the file into the output directory specified. Make sure output directory either doesn't exist (yet) or has nothing important in it, As this function will delete whatever is already in there. Look at the documentation for clean_file to see the commands to pass to the cleaning script.

Usage

1
clean_corpus(ipath, odir, ncores, clean_commands_str)

Arguments

ipath

A string specifying the path to all the text files to handle.

odir

A string specifying the path to an output directory.

ncores

A number specifying the number of cores to use.

clean_commands_str

A string containing the combined commands for the cleaning script.

Examples

1
2
3
4
## Not run: 
clean_corpus("/path/to/corpus/", "./cleaned/", 20, "-lnprsd --maintain-newlines --min-size 2")

## End(Not run)

avkoehl/textprocessingDSI documentation built on June 5, 2019, 7:41 p.m.