Description Usage Arguments Value Examples
A function that runs the full pipeline for cleaning a corpus and preparing it
to be topic modelled. Has several optional arguments and some mandatory arguments.
Designed to be the simplest way to process a corpus. If you want more control,
run the various pieces of the pipeline manually. The pipeline is essentially:
1) combine the corpus into a single file (one document per line) and split the corpus into equal sized chunks
2) clean those chunks in parallel
3) lemma those chunks in parallel
4) find and remove the 'sparse' and 'abundant' terms
5) recombine the now cleaned corpus into a single file (one document per line)
6) delete the intermediary directories that were created
7) save the parameters used to clean in an info file in the opath
1 2 3 |
ipath |
A string specifying path to the raw text files containing the corpus. |
opath |
A string specifying the path to put all the outputs in. |
delim |
A number (1 or 0), 0 means files are concated as they are, 1 means to replace newlines with spaces and delimit each document with a newline. |
ncores |
A number specifying the number of cores to use. |
clean_commands |
A string containing the combined parameters for running the cleaning script, refer to clean_corpus and clean_file documentation. |
lemma |
**optional** A boolean if true the corpus will be lemmatized. |
split |
**optional** A character('c' or 'l', default='c') specifying if the corpus should be split by memory size or by line count. |
size |
**optional** A number (default 50,000) specifying the line count or size in kilobytes to segment the corpus into. |
sparsity |
**optional** A number (default = 0.02) determining the threshold for sparse words to get rid of, refer to get_sparse documenation. |
abundance |
**optional** A number (default = 0.98) determining the threshold for abundant words to get rid of, refer to get_abundant documentation. |
verbose |
**optional** A bool, if set to TRUE print to console more information. |
A string giving path to the cleaned corpus file, containing one document on each line.
1 2 3 4 5 6 | ## Not run:
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines", split="l", size=2000)
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines", split="c", size=200000)
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines",lemma=FALSE,sparsity=0.04)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.