pool_tweets: Prepare Tweets for topic modeling by pooling
In abuchmueller/Twitmo: Twitter Topic Modeling and Visualization for R

pool_tweets

R Documentation

Prepare Tweets for topic modeling by pooling

Description

This function pools a data frame of parsed tweets into document pools.

Usage

pool_tweets(
  data,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_emojis = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1L
)

Arguments

`data`	Data frame containing tweets and hashtags. Works with any data frame, as long as there is a "text" column of type character string and a "hashtags" column with comma separated character vectors. Can be obtained either by using `load_tweets` on a json object returned by Twitter's API v1.1 or by using `stream_in` on any json file, as long as it has a "text" and "hashtags" field. If you are unsure about the requirements you may load the sample piece of data contained in the package by following the example in the the example section of this help page.
`remove_numbers`	Logical. If `TRUE` remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day. See tokens.
`remove_punct`	Logical. If `TRUE` remove all characters in the Unicode "Punctuation" [P] class, with exceptions for those used as prefixes for valid social media tags if `preserve_tags = TRUE`. See tokens
`remove_symbols`	Logical. If `TRUE` remove all characters in the Unicode "Symbol" [S] class.
`remove_url`	Logical. If `TRUE` find and eliminate URLs beginning with http(s).
`remove_emojis`	Logical. If `TRUE` all emojis will be removed from tweets.
`remove_users`	Logical. If `TRUE` will remove all mentions of user names from documents.
`remove_hashtags`	Logical. If `TRUE` will remove hashtags (not only the symbol but the hashtagged word itself) from documents.
`cosine_threshold`	Double. Value between 0 and 1 specifying the cosine similarity threshold to be used for document pooling. Tweets without a hashtag will be assigned to document (hashtag) pools based upon this metric. Low thresholds will reduce topic coherence by including a large number of tweets without a hashtag into the document pools. Higher thresholds will lead to more coherent topics but will reduce document sizes.
`stopwords`	a character vector, list of character vectors, dictionary or collocations object. See pattern for details. Defaults to stopwords("english").
`n_grams`	Integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a n in the n-gram(s) that are produced. See tokens_ngrams

Details

Pools tweets by hashtags using cosine similarity to create longer pseudo-documents for better LDA estimation and creates n-gram tokens. The method applies an implementation of the pooling algorithm from Mehrotra et al. 2013.

Value

List with corpus object and dfm object of pooled tweets.

References

Mehrotra, Rishabh & Sanner, Scott & Buntine, Wray & Xie, Lexing. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. 889-892. 10.1145/2484028.2484166.

Examples

## Not run: 

library(Twitmo)

# load tweets (included in package)
mytweets <- load_tweets(system.file("extdata", "tweets_20191027-141233.json", package = "Twitmo"))

pool <- pool_tweets(
  data = mytweets,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  remove_emojis = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1
)

## End(Not run)

abuchmueller/Twitmo documentation built on Sept. 14, 2022, 8:06 p.m.