text_clean: Function to clean your tweets text

Description Usage Arguments Details See Also Examples

View source: R/textMining.R

Description

Some pre-process the data in some standard ways.

Usage

1
2
text_clean(docvec, rmDuplicates = FALSE, cores = 6, stems = NULL,
  partial = FALSE)

Arguments

rmDuplicates

if remove duplicated tweets

cores

number of cores for parallel computing

stems

customized stems to be removed

partial

partial cleaning. step 1 to 11

tweets

tweets retrieved from tweet_corpus function

Details

1. Convert to basic ASCII text to avoid silly characters
2. Make everything consistently lower case
3. Remove the "RT" (retweet) so duplicates are duplicates
4. Remove links
5. Remove punctuation
6. Remove tabs
7. "&" is "&amp" in HTML, so after punctuation removed ...
8. Leading blanks
9. Lagging blanks
10. General spaces (should just do all whitespaces no?)
11. Get rid of duplicates!
12. Convert to tm corpus
13. Remove English stop words.
14. Remove numbers.
15. Stem the words.
16. Remove the customized stems

See Also

tweet_corpus

Examples

1
2
3
setupTwitterConn()
tweets <- tweet_corpus(search = "audusd", n = 100, since = as.character(Sys.Date()-7), until = as.character(Sys.Date()))
tweets <- text_clean(tweets$v, rmDuplicates = FALSE, cores = 6, stems = c("audusd"))

ivanliu1989/RQuant documentation built on Sept. 13, 2019, 11:53 a.m.