pool_tweets: Prepare Tweets for topic modeling by pooling

View source: R/pool_tweets.R

pool_tweetsR Documentation

Prepare Tweets for topic modeling by pooling

Description

This function pools a data frame of parsed tweets into document pools.

Usage

pool_tweets(
  data,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_emojis = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1L
)

Arguments

data

Data frame containing tweets and hashtags. Works with any data frame, as long as there is a "text" column of type character string and a "hashtags" column with comma separated character vectors. Can be obtained either by using load_tweets on a json object returned by Twitter's API v1.1 or by using stream_in on any json file, as long as it has a "text" and "hashtags" field. If you are unsure about the requirements you may load the sample piece of data contained in the package by following the example in the the example section of this help page.

remove_numbers

Logical. If TRUE remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day. See tokens.

remove_punct

Logical. If TRUE remove all characters in the Unicode "Punctuation" [P] class, with exceptions for those used as prefixes for valid social media tags if preserve_tags = TRUE. See tokens

remove_symbols

Logical. If TRUE remove all characters in the Unicode "Symbol" [S] class.

remove_url

Logical. If TRUE find and eliminate URLs beginning with http(s).

remove_emojis

Logical. If TRUE all emojis will be removed from tweets.

remove_users

Logical. If TRUE will remove all mentions of user names from documents.

remove_hashtags

Logical. If TRUE will remove hashtags (not only the symbol but the hashtagged word itself) from documents.

cosine_threshold

Double. Value between 0 and 1 specifying the cosine similarity threshold to be used for document pooling. Tweets without a hashtag will be assigned to document (hashtag) pools based upon this metric. Low thresholds will reduce topic coherence by including a large number of tweets without a hashtag into the document pools. Higher thresholds will lead to more coherent topics but will reduce document sizes.

stopwords

a character vector, list of character vectors, dictionary or collocations object. See pattern for details. Defaults to stopwords("english").

n_grams

Integer vector specifying the number of elements to be concatenated in each n-gram. Each element of this vector will define a n in the n-gram(s) that are produced. See tokens_ngrams

Details

Pools tweets by hashtags using cosine similarity to create longer pseudo-documents for better LDA estimation and creates n-gram tokens. The method applies an implementation of the pooling algorithm from Mehrotra et al. 2013.

Value

List with corpus object and dfm object of pooled tweets.

References

Mehrotra, Rishabh & Sanner, Scott & Buntine, Wray & Xie, Lexing. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. 889-892. 10.1145/2484028.2484166.

See Also

tokens, dfm

Examples

## Not run: 

library(Twitmo)

# load tweets (included in package)
mytweets <- load_tweets(system.file("extdata", "tweets_20191027-141233.json", package = "Twitmo"))

pool <- pool_tweets(
  data = mytweets,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_url = TRUE,
  remove_users = TRUE,
  remove_hashtags = TRUE,
  remove_emojis = TRUE,
  cosine_threshold = 0.9,
  stopwords = "en",
  n_grams = 1
)

## End(Not run)



abuchmueller/Twitmo documentation built on Sept. 14, 2022, 8:06 p.m.