rtweetclean is a R package built to act as a processor of data generated by the existing rtweet package that can produce clean data frames, summarize data, and generate new features.
Our package aims to add additional resources for users of the already existing rtweet package. rtweet is a package built around Twitter’s API and is used to scrape tweet information from their servers. Our package creates functionality which enables users to process the raw data from rtweet into a more understandable format by extracting and organizing the contents of tweets for a user. rtweet is specifically built to be used in analysis of a specific user’s timeline (generated using tweepy’s api.user_timeline function). Users can easily visualize average engagement based on time of day posted, see basic summary statistics of word contents and sentiment analysis of tweets and have a processed dataset that can be used in a wide variety of machine learning models.
You can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("UBC-MDS/rtweetclean")
clean_df(raw_tweets_df, handle = "", text_only = TRUE, word_count = TRUE, emojis = TRUE, hashtags = TRUE, sentiment = TRUE, flesch_readability = TRUE, proportion_of_avg_retweets = TRUE, proportion_of_avg_favorites = TRUE)
:
Cleans up and creates new feature columns to the dataframe obtained
by using the tweepy package, including: a link/hashtag/emoji free
version of the text column, a wordcount, a sentimentality score, a
flesch readability score, a user entered handle, a column containing
all emojis used, and a column containing all hashtags used.
tweet_words(clean_tweets_df)
: Generates a list of the most
frequently used words by the twitter user. Uses the dataframe
created by the clean_df()
function as an input.
sentiment_total(clean_tweets_df)
: Returns a summary count of the
number of words that are associated with a list of
emotions/sentiments. Uses the text_only
column of the dataframe
created by the clean_df()
function as an input.
engagement_by_hour(clean_tweets_df)
: Plots a line chart showing
popular times of the day to tweet for the user, by displaying
average engagement (favorites + retweets) received by hour of day.
Uses the dataframe created by the clean_df()
function as an input.
| R Package | |:----------------------:| | magrittr >= 2.0.1 | | dplyr >= 1.0.4 | | lubridate >= 1.7.10 | | ggplot2 >= 3.3.3 | | tidyr >= 1.1.3 | | tidytext >= 0.3.0 | | rtweet >= 0.7.0 | | stringr >= 1.4.0 | | knitr >= 1.31 | | testthat >= 3.0.2 | | rmarkdown >= 2.7 |
Vignette for this package can be found here
Below function calls are based on below example dataframe:
library(rtweetclean)
created_at = c("2021-03-06 16:03:31",
"2021-03-05 21:57:47",
'2021-03-05 05:50:50',
'2021-03-05 7:32:33')
text <- c("example tweet text 1 @user2 @user",
"#example #tweet 2 ",
"example tweet 3 https://t.co/G4ziCaPond",
"example tweet 4")
retweet_count <- c(43, 12, 24, 29)
favorite_count <- c(85, 41, 65, 54)
timeline_rtweet_toy <- data.frame(text, retweet_count, favorite_count, created_at)
timeline_rtweet_toy
## text retweet_count favorite_count
## 1 example tweet text 1 @user2 @user 43 85
## 2 #example #tweet 2 12 41
## 3 example tweet 3 https://t.co/G4ziCaPond 24 65
## 4 example tweet 4 29 54
## created_at
## 1 2021-03-06 16:03:31
## 2 2021-03-05 21:57:47
## 3 2021-03-05 05:50:50
## 4 2021-03-05 7:32:33
Function calls in action:
cleaned_timeline <- clean_df(timeline_rtweet_toy)
cleaned_timeline
## text retweet_count favorite_count
## 1 example tweet text 1 @user2 @user 43 85
## 2 #example #tweet 2 12 41
## 3 example tweet 3 https://t.co/G4ziCaPond 24 65
## 4 example tweet 4 29 54
## created_at text_only word_count emojis prptn_rts_vs_avg
## 1 2021-03-06 16:03:31 example tweet text 1 4 1.5925926
## 2 2021-03-05 21:57:47 2 1 0.4444444
## 3 2021-03-05 05:50:50 example tweet 3 3 0.8888889
## 4 2021-03-05 7:32:33 example tweet 4 3 1.0740741
## prptn_favorites_vs_avg
## 1 1.3877551
## 2 0.6693878
## 3 1.0612245
## 4 0.8816327
tweet_words(cleaned_timeline, top_n=3)
## words count
## 1 tweet 3
## 2 example 3
## 3 text 1
sentiment_total(cleaned_timeline, drop_sentiment = FALSE)
## # A tibble: 10 x 3
## sentiment word_count total_words
## <chr> <dbl> <int>
## 1 anger 0 11
## 2 anticipation 0 11
## 3 disgust 0 11
## 4 fear 0 11
## 5 joy 0 11
## 6 negative 0 11
## 7 positive 0 11
## 8 sadness 0 11
## 9 surprise 0 11
## 10 trust 0 11
engagement_by_hour(cleaned_timeline)
rtweetclean provides additional functionality to the existing rtweet package by generating commonly desired data features relevant with twitter API data. This is combined with streamlined summary statistics methods that can quickly and effortlessly produce figures and tables of various different factors in your rtweet data. This allows users to easily understand and analyze information about a twitter user’s timeline. Specifically, examining an accounts engagement, most common words, and emotional sentiment can each be done with the various functions in the package.
Documentation for this package can be found here
We welcome and recognize all contributions. Please see contributing guidelines in the Contributing document. This repository is currently maintained by nashmakh, calsvein, MattTPin, syadk.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.