pipeline: Text Processing Pipeline
In avkoehl/textprocessingDSI: Clean an arbitrarily large corpus for topic modelling over many cores

Description Usage Arguments Value Examples

A function that runs the full pipeline for cleaning a corpus and preparing it to be topic modelled. Has several optional arguments and some mandatory arguments. Designed to be the simplest way to process a corpus. If you want more control, run the various pieces of the pipeline manually. The pipeline is essentially:
1) combine the corpus into a single file (one document per line) and split the corpus into equal sized chunks
2) clean those chunks in parallel
3) lemma those chunks in parallel
4) find and remove the 'sparse' and 'abundant' terms
5) recombine the now cleaned corpus into a single file (one document per line)
6) delete the intermediary directories that were created
7) save the parameters used to clean in an info file in the opath

1
2
3

pipeline(ipath, opath, delim, ncores, clean_commands, lemma = TRUE,
  split = "c", size = 50000, sparsity = 0.02, abundance = 0.98,
  verbose = FALSE)

`ipath`	A string specifying path to the raw text files containing the corpus.
`opath`	A string specifying the path to put all the outputs in.
`delim`	A number (1 or 0), 0 means files are concated as they are, 1 means to replace newlines with spaces and delimit each document with a newline.
`ncores`	A number specifying the number of cores to use.
`clean_commands`	A string containing the combined parameters for running the cleaning script, refer to clean_corpus and clean_file documentation.
`lemma`	optional A boolean if true the corpus will be lemmatized.
`split`	optional A character('c' or 'l', default='c') specifying if the corpus should be split by memory size or by line count.
`size`	optional A number (default 50,000) specifying the line count or size in kilobytes to segment the corpus into.
`sparsity`	optional A number (default = 0.02) determining the threshold for sparse words to get rid of, refer to get_sparse documenation.
`abundance`	optional A number (default = 0.98) determining the threshold for abundant words to get rid of, refer to get_abundant documentation.
`verbose`	optional A bool, if set to TRUE print to console more information.

A string giving path to the cleaned corpus file, containing one document on each line.

## Not run: 
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines", split="l", size=2000)
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines", split="c", size=200000)
pipeline("/corpus/", "/opath/", 1, 20, "-lnprsd --maintain-newlines",lemma=FALSE,sparsity=0.04)

## End(Not run)

avkoehl/textprocessingDSI documentation built on June 5, 2019, 7:41 p.m.

avkoehl/textprocessingDSI index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

avkoehl/textprocessingDSI
Clean an arbitrarily large corpus for topic modelling over many cores

pipeline: Text Processing Pipeline
In avkoehl/textprocessingDSI: Clean an arbitrarily large corpus for topic modelling over many cores

Description

Usage

Arguments

Value

Examples

Related to pipeline in avkoehl/textprocessingDSI...

R Package Documentation

Browse R Packages

We want your feedback!

avkoehl/textprocessingDSI Clean an arbitrarily large corpus for topic modelling over many cores

pipeline: Text Processing Pipeline In avkoehl/textprocessingDSI: Clean an arbitrarily large corpus for topic modelling over many cores

Description

Usage

Arguments

Value

Examples

Related to pipeline in avkoehl/textprocessingDSI...

R Package Documentation

Browse R Packages

We want your feedback!

avkoehl/textprocessingDSI
Clean an arbitrarily large corpus for topic modelling over many cores

pipeline: Text Processing Pipeline
In avkoehl/textprocessingDSI: Clean an arbitrarily large corpus for topic modelling over many cores