pool_tweets: Prepare Tweets for topic modeling by pooling

Description Usage Arguments Details Value References See Also Examples

View source: R/pool_tweets.R

Description

This function pools a data frame of parsed tweets into document pools.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
pool_tweets(
  data,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_emojis = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1L
)

Arguments

data

Data frame of parsed tweets. Obtained either by using load_tweets or stream_in in conjunction with tweets_with_users.

remove_numbers

Logical. If TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day. See tokens.

remove_punct

Logical. If TRUE remove all characters in the Unicode "Punctuation" [P] class, with exceptions for those used as prefixes for valid social media tags if preserve_tags = TRUE. See tokens

remove_symbols

Logical. If TRUE remove all characters in the Unicode "Symbol" [S] class.

remove_url

Logical. If TRUE find and eliminate URLs beginning with http(s).

remove_emojis

Logical. If TRUE all emojis will be removed from tweets.

remove_users

Logical. If TRUE will remove all mentions of user names from documents.

remove_hashtags

Logical. If TRUE will remove hashtags (not only the symbol but the hashtagged word itself) from documents.

cosine_threshold

Double. Value between 0 and 1 specifying the cosine similarity threshold to be used for document pooling. Tweets without a hashtag will be assigned to document (hashtag) pools based upon this metric. Low thresholds will reduce topic coherence by including a large number of tweets without a hashtag into the document pools. Higher thresholds will lead to more coherent topics but will reduce document sizes.

stopwords

a character vector, list of character vectors, dictionary or collocations object. See pattern for details. Defaults to stopwords("english").

n_grams

Integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a n in the n-gram(s) that are produced. See tokens_ngrams

Details

Pools tweets by hashtags using cosine similarity to create longer pseudo-documents for better LDA estimation and creates n-gram tokens. The method applies an implementation of the pooling algorithm from Mehrotra et al. 2013.

Value

List with corpus object and dfm object of pooled tweets.

References

Mehrotra, Rishabh & Sanner, Scott & Buntine, Wray & Xie, Lexing. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. 889-892. 10.1145/2484028.2484166.

See Also

tokens, dfm

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
## Not run: 

library(Twitmo)

# load tweets (included in package)
mytweets <- load_tweets(system.file("extdata", "tweets_20191027-141233.json", package = "Twitmo"))

pool <- pool_tweets(data = mytweets,
                    remove_numbers = TRUE,
                    remove_punct = TRUE,
                    remove_symbols = TRUE,
                    remove_url = TRUE,
                    remove_users = TRUE,
                    remove_hashtags = TRUE,
                    remove_emojis = TRUE,
                    cosine_threshold = 0.9,
                    stopwords = "en",
                    n_grams = 1)

## End(Not run)

Twitmo documentation built on Dec. 11, 2021, 10:01 a.m.