JSTOR_dtmofnouns: Make a Document Term Matrix containing only nouns

Description Usage Arguments Value Examples

Description

This function does part-of-speech tagging and removes all parts of speech that are not non-name nouns. It also removes punctuation, numbers, words with less than three characters, stopwords and unusual characters (characters not in ISO-8859-1, ie non-latin1-ASCII). For use with JSTOR's Data for Research datasets (http://dfr.jstor.org/). This function uses the stoplist in the tm package. The location of tm's English stopwords list can be found by entering this at the R prompt: paste0(.libPaths()[1], "/tm/stopwords/english.dat") Note that the part-of-speech tagging can result in the removal of words of interest. Currently I'm not sure how to keep those words.

Usage

1
JSTOR_dtmofnouns(unpack1grams, word = NULL, sparse = 1, POStag = TRUE)

Arguments

unpack1grams

object returned by the function JSTOR_unpack1grams.

word

Optional word or vector of words to subset the documents by, ie. make a document term matrix containing only documents in which this word (or words) appears at least once.

sparse

A numeric for the maximal allowed sparsity, default is one (ie. no sparsing applied). Removes sparse terms from a term-document matrix, see help(removeSparseTerms) for more details. Values close to 1 result in a sparse matrix, values close to zero result in a dense matrix. It may be useful to reduce sparseness if the matrix is too big to manipulate in memory or if processing times are long.

POStag

logical Do part-of-speech tagging to identify and remove non-nouns. Default is True, but the option is here to speed things up when working interactively with large numbers of documents.

Value

Returns a Document Term Matrix containing documents, ready for more advanced text mining and topic modelling.

Examples

1
## nouns <- JSTOR_dtmofnouns(unpack1grams) 

benmarwick/JSTORr documentation built on May 12, 2019, 12:59 p.m.