text_clean: Function to clean your tweets text
In ivanliu1989/RQuant: A collection of generic functions

Description Usage Arguments Details See Also Examples

Some pre-process the data in some standard ways.

1 2	text_clean(docvec, rmDuplicates = FALSE, cores = 6, stems = NULL, partial = FALSE)

`rmDuplicates`	if remove duplicated tweets
`cores`	number of cores for parallel computing
`stems`	customized stems to be removed
`partial`	partial cleaning. step 1 to 11
`tweets`	tweets retrieved from `tweet_corpus` function

1. Convert to basic ASCII text to avoid silly characters
2. Make everything consistently lower case
3. Remove the "RT" (retweet) so duplicates are duplicates
4. Remove links
5. Remove punctuation
6. Remove tabs
7. "&" is "&amp" in HTML, so after punctuation removed ...
8. Leading blanks
9. Lagging blanks
10. General spaces (should just do all whitespaces no?)
11. Get rid of duplicates!
12. Convert to tm corpus
13. Remove English stop words.
14. Remove numbers.
15. Stem the words.
16. Remove the customized stems

tweet_corpus

1
2
3

setupTwitterConn()
tweets <- tweet_corpus(search = "audusd", n = 100, since = as.character(Sys.Date()-7), until = as.character(Sys.Date()))
tweets <- text_clean(tweets$v, rmDuplicates = FALSE, cores = 6, stems = c("audusd"))