nonword | R Documentation |
This function's aim is to eliminate everything is not an alphanumeric word/token from a corpora of documents. It also has an option to decide if numbers has to be removed too. Moreover, it is possible to override both the paramenter for the pattern identifying words and the one identifying the replacements (default is a white space).
nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ") ## S3 method for class 'list' nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ") ## S3 method for class 'VCorpus' nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ") ## S3 method for class 'character' nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ") ## Default S3 method: nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ")
corpus |
a compatible object storing documents (actually, list (and
corpus-list of (tokened) documents,
character vectors and |
numbers |
(lgl) if TRUE also numbers are removed (default FALSE) |
... |
Additional option |
pattern |
(chr) an alternative regular expression used to remove
(i.e., to substitute with |
replacement |
(chr) the string used to sobstitute the ones which will
be eliminated. Default is |
an object of the same class of input with documents witten with only "words" retained.
data(liu_corpus) nonword('hell0 w.rld') nonword('hell0 w.rld', numbers = TRUE) # remove also numbers nonword('hell0 w.rld', replacement = '*') # use "*" instead of white space nonword('hell0 w.rld', pattern = 'w[^\\s]+') # anithing starting with "w" nonword(liu_corpus)$content[[1]]$content # "-" removed in "anti-angiogenesis"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.