nonword: Eliminate non-words

nonwordR Documentation

Eliminate non-words

Description

This function's aim is to eliminate everything is not an alphanumeric word/token from a corpora of documents. It also has an option to decide if numbers has to be removed too. Moreover, it is possible to override both the paramenter for the pattern identifying words and the one identifying the replacements (default is a white space).

Usage

nonword(corpus, numbers = FALSE, ..., pattern = NULL, replacement = " ")

## S3 method for class 'list'
nonword(corpus, numbers = FALSE, ..., pattern = NULL,
  replacement = " ")

## S3 method for class 'VCorpus'
nonword(corpus, numbers = FALSE, ..., pattern = NULL,
  replacement = " ")

## S3 method for class 'character'
nonword(corpus, numbers = FALSE, ..., pattern = NULL,
  replacement = " ")

## Default S3 method:
nonword(corpus, numbers = FALSE, ..., pattern = NULL,
  replacement = " ")

Arguments

corpus

a compatible object storing documents (actually, list (and corpus-list of (tokened) documents, character vectors and VCorpus)

numbers

(lgl) if TRUE also numbers are removed (default FALSE)

...

Additional option

pattern

(chr) an alternative regular expression used to remove (i.e., to substitute with replacement) everything that match it. Default is NULL. If not NULL the option numbers is ignored.

replacement

(chr) the string used to sobstitute the ones which will be eliminated. Default is ' '.

Value

an object of the same class of input with documents witten with only "words" retained.

Examples

data(liu_corpus)

nonword('hell0 w.rld')
nonword('hell0 w.rld', numbers = TRUE)                  # remove also numbers
nonword('hell0 w.rld', replacement = '*')    # use "*" instead of white space
nonword('hell0 w.rld', pattern = 'w[^\\s]+')     # anithing starting with "w"

nonword(liu_corpus)$content[[1]]$content # "-" removed in "anti-angiogenesis"

UBESP-DCTV/costumer documentation built on Feb. 1, 2023, 4:52 a.m.