textProcessor: Process a vector of raw texts
In stm: Estimation of the Structural Topic Model

textProcessor

R Documentation

Process a vector of raw texts

Description

Function that takes in a vector of raw texts (in a variety of languages) and performs basic operations. This function is essentially a wrapper tm package where various user specified options can be selected.

Usage

textProcessor(
  documents,
  metadata = NULL,
  lowercase = TRUE,
  removestopwords = TRUE,
  removenumbers = TRUE,
  removepunctuation = TRUE,
  ucp = FALSE,
  stem = TRUE,
  wordLengths = c(3, Inf),
  sparselevel = 1,
  language = "en",
  verbose = TRUE,
  onlycharacter = FALSE,
  striphtml = FALSE,
  customstopwords = NULL,
  custompunctuation = NULL,
  v1 = FALSE
)

Arguments

`documents`	The documents to be processed. A character vector where each entry is the full text of a document (if passed as a different type it will attempt to convert to a character vector).
`metadata`	Additional data about the documents. Specifically a `data.frame` or `matrix` object with number of rows equal to the number of documents and one column per meta-data type. The column names are used to label the metadata. The metadata do not affect the text processing, but providing the metadata object insures that if documents are dropped the corresponding metadata rows are dropped as well.
`lowercase`	Whether all words should be converted to lower case. Defaults to TRUE.
`removestopwords`	Whether stop words should be removed using the SMART stopword list (in English) or the snowball stopword lists (for all other languages). Defaults to TRUE.
`removenumbers`	Whether numbers should be removed. Defaults to TRUE.
`removepunctuation`	whether punctuation should be removed. Defaults to TRUE.
`ucp`	When TRUE passes `ucp=TRUE` to `tm::removePunctuation` which removes a more general set of punctuation (the Unicode general category P). Defaults to `FALSE`.
`stem`	Whether or not to stem words. Defaults to TRUE
`wordLengths`	From the tm package. An integer vector of length 2. Words shorter than the minimum word length `wordLengths[1]` or longer than the maximum word length `wordLengths[2]` are discarded. Defaults to `c(3, Inf)`, i.e., a minimum word length of 3 characters.
`sparselevel`	Removes terms where at least sparselevel proportion of the entries are 0. Defaults to 1 which effectively turns the feature off.
`language`	Language used for processing. Defaults to English. `tm` uses the `SnowballC` stemmer which as of version 0.5 supports "danish dutch english finnish french german hungarian italian norwegian portuguese romanian russian spanish swedish turkish". These can be specified as any on of the above strings or by the three-letter ISO-639 codes. You can also set language to "na" if you want to leave it deliberately unspecified (see documentation in `tm`) Note that languages listed here may not all have accompanying stopwords. However if you have your own stopword list you can use customstopwords below.
`verbose`	If true prints information as it processes.
`onlycharacter`	When TRUE, runs a regular expression substitution to replace all non-alphanumeric characters. These characters can crash textProcessor for some operating systems. May remove foreign characters depending on encoding. Defaults to FALSE. Defaults to FALSE. Runs before call to tm package.
`striphtml`	When TRUE, runs a regular expression substitution to strip html contained within <>. Defaults to FALSE. Runs before call to tm package.
`customstopwords`	A character vector containing words to be removed. Defaults to NULL which does not remove any additional words. This function is primarily for easy removal of application specific stopwords. Note that as with standard stopwords these are removed after converting everything to lower case but before removing numbers, punctuation or stemming. Thus words to be removed should be all lower case but otherwise complete.
`custompunctuation`	A character vector containing any characters to be removed immediately after standard punctuation removal. This function exists primarily for easy removal of application specific punctuation not caught by the punctuation filter (although see also the `ucp` argument to turn on a stronger punctuation filter). This can in theory be used to remove any characters you don't want in the text for some reason. In practice what this function does is collapse the character vector to one string and put square brackets around it in order to make a pattern that can be matched and replaced with `gsub` at the punctuation removal stage. If the `custompunctuation` vector is length 1 and the first element is a left square bracket, the function assumes that you have passed a regular expression and passes that directly along to `gsub`.
`v1`	A logical which defaults to `FALSE`. If set to `TRUE` it will use the ordering of operations we used in Version 1.0 of the package.

Details

This function is designed to provide a convenient and quick way to process a relatively small volume texts for analysis with the package. It is designed to quickly ingest data in a simple form like a spreadsheet where each document sits in a single cell. If you have texts more complicated than a spreadsheet, we recommend you check out the excellent readtext package.

The processor always strips extra white space but all other processing options are optional. Stemming uses the snowball stemmers and supports a wide variety of languages. Words in the vocabulary can be dropped due to sparsity and stop word removal. If a document no longer contains any words it is dropped from the output. Specifying meta-data is a convenient way to make sure the appropriate rows are dropped from the corresponding metadata file.

When the option sparseLevel is set to a number other than 1, infrequently appearing words are removed. When a term is removed from the vocabulary a message will print to the screen (as long as verbose has not been set to FALSE). The message indicates the number of terms removed (that is, the number of vocabulary entries) as well as the number of tokens removed (appearances of individual words). The function prepDocuments provides additional methods to prune infrequent words. In general the functionality there should be preferred.

We emphasize that this function is a convenience wrapper around the excellent tm package functionality without which it wouldn't be possible.

Value

`documents`	A list containing the documents in the stm format.
`vocab`	Character vector of vocabulary.
`meta`	Data frame or matrix containing the user-supplied metadata for the retained documents.

References

Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-9.1.

Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54.

Examples




head(gadarian)
#Process the data for analysis.
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
meta<-temp$meta
vocab<-temp$vocab
docs<-temp$documents
out <- prepDocuments(docs, vocab, meta)
docs<-out$documents
vocab<-out$vocab
meta <-out$meta


#Example of custom punctuation removal.
docs <- c("co.rr?ec!t")
textProcessor(docs,custompunctuation=c(".","?","!"),
              removepunctuation = FALSE)$vocab
#note that the above should now say "correct"

stm documentation built on June 24, 2024, 5:18 p.m.