Description Usage Arguments Value Examples
This function does part-of-speech tagging and removes all parts of speech that are not non-name nouns. It also removes punctuation, numbers, words with less than three characters, stopwords and unusual characters (characters not in ISO-8859-1, ie non-latin1-ASCII). For use with JSTOR's Data for Research datasets (http://dfr.jstor.org/). This function uses the stoplist in the tm package. The location of tm's English stopwords list can be found by entering this at the R prompt: paste0(.libPaths()[1], "/tm/stopwords/english.dat") Note that the part-of-speech tagging can result in the removal of words of interest. Currently I'm not sure how to keep those words.
1 | JSTOR_dtmofnouns(unpack1grams, word = NULL, sparse = 1, POStag = TRUE)
|
unpack1grams |
object returned by the function JSTOR_unpack1grams. |
word |
Optional word or vector of words to subset the documents by, ie. make a document term matrix containing only documents in which this word (or words) appears at least once. |
sparse |
A numeric for the maximal allowed sparsity, default is one (ie. no sparsing applied). Removes sparse terms from a term-document matrix, see help(removeSparseTerms) for more details. Values close to 1 result in a sparse matrix, values close to zero result in a dense matrix. It may be useful to reduce sparseness if the matrix is too big to manipulate in memory or if processing times are long. |
POStag |
logical Do part-of-speech tagging to identify and remove non-nouns. Default is True, but the option is here to speed things up when working interactively with large numbers of documents. |
Returns a Document Term Matrix containing documents, ready for more advanced text mining and topic modelling.
1 | ## nouns <- JSTOR_dtmofnouns(unpack1grams)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.