prepare: Preprocess Document

Description Usage Arguments See Also Examples

View source: R/prepare.R

Description

Preprocess the document, note that this replaces the object in place.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Arguments

text

An object inheriting of class document or corpus.

...

Other special classes

remove_corrupt_utf8

Remove corrupt UTF8 characters.

remove_case

Convert to lowercase.

strip_stopwords

Remove stopwords, i.e.: "all", "almost", "alone".

strip_numbers

Remove numbers.

strip_html_tags

Remove html tags, including the style and script tags.

strip_punctuation

Remove punctuation.

remove_words

Remove the occurences of words from 'doc'.

strip_non_letters

Remove anything non-numeric.

strip_sparse_terms

Remove sparse terms.

strip_frequent_terms

Remove frequent terms.

strip_articles

Remove articles: "a", "an", "the".

strip_indefinite_articles

Removes indefinite articles: "a", "an".

strip_definite_articles

Remove "the".

strip_preposition

Remove preprositions, i.e.: "across", "around", "before".

strip_pronouns

Remove pronounces, i.e.: "I", "you", "he", "she".

update_lexicon

Whether to update the lexicon of the corpus, see update_lexicon.

update_inverse_index

Whether to update the inverse index of the corpus, see update_inverse_index.

See Also

stem_words to stem your document.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
## Not run: 
init_textanalysis()

# build document
doc <- string_document("This <span>is</span> a very short document!.!")

# replaces in place!
prepare(doc)
get_text(doc)

## End(Not run)

news-r/textanalysis documentation built on Nov. 4, 2019, 9:40 p.m.