build.corpus: Build a corpus that can be used in the textreg call.
In textreg: n-Gram Text Regression, aka Concise Comparative Summarization

Description Usage Arguments Details Value Note Examples

Pre-building a corpus allows for calling multiple textregs without doing a lot of initial data processing (e.g., if you want to explore different ban lists or regularization parameters)

1 2	build.corpus(corpus, labeling, banned = NULL, verbosity = 1, token.type = "word")

`corpus`	A list of strings or a corpus from the `tm` package.
`labeling`	A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document.
`banned`	List of words that should be dropped from consideration.
`verbosity`	Level of output. 0 is no printed output.
`token.type`	"word" or "character" as tokens.

See the bathtub vignette for more complete discussion of this method and the options you might pass to it.

A textreg.corpus object is not a tm-style corpus. In particular, all text pre-processing, etc., to text should be done to the data before building the textreg.corpus object.

A textreg.corpus object.

Unfortunately, the process of seperating out the textreg call and the build.corpus call is not quite as clean as one would hope. The build.corpus call moves the text into the C++ memory, but the way the search tree is built for the regression it is hard to salvage it across runs and so this is of limited use. In particular, the labeling and banned words cannot be easily changed. Future versions of the package would ideally remedy this.