Description Usage Arguments Value Examples
Preprocesses a corpus of texts into a document-frequency matrix in 128 different ways.
1 2 3 4 |
text |
A vector of strings (one per document) or quanteda corpus object from which we wish to form a document-term matrix. |
use_ngrams |
Option to extract 1,2, and 3-grams from the text as another potential preprocessing step. Defaults to TRUE. |
infrequent_term_threshold |
A proportion threshold at which infrequent terms are to be filtered. Defaults to 0.01 (terms that appear in less than 1 percent of documents). |
parallel |
Logical indicating whether factorial preprocessing should be performed in parallel. Defaults to FALSE. |
cores |
Defaults to 1, can be set to any number less than or equal to the number of cores on one's computer. |
intermediate_directory |
Optional path to a directory where each dfm will be saved as an intermediate step. The file names will follow the convention intermediate_dfm_i.Rdata, where i is the index of the combination of preprocessing choices. The function will then attempt to read all of the dfm's back into a list if return_results = TRUE (by default), or simply end the function call if return_results = FALSE. This can be a useful option if the user is preprocessing a corpus that would make a dfm list that was impractical to work with due to its size. |
parameterization_range |
Defaults to NULL, but can be set to a numeric vector of indexes relating to preprocessing decisions. This can be used to restart large analyses after power failure. |
return_results |
Defaults to TRUE, can be set to FALSE to prevent an overly large dfm list from being created. |
verbose |
Logical indicating whether more information should be printed to the screen to let the user know about progress in preprocessing. Defaults to TRUE. |
A list object containing permutations of the document-term matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 | ## Not run:
# load the package
library(preText)
# load in the data
data("UK_Manifestos")
# preprocess data
preprocessed_documents <- factorial_preprocessing(
UK_Manifestos,
use_ngrams = TRUE,
infrequent_term_threshold = 0.02,
verbose = TRUE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.