knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(dplyr) library(ggplot2) library(knitr)
It's Thursday, September 26, 2019, and over the course of the last several days, President Trump's attempt to pressure the Ukrainian president to investigate his political rival, former Vice President Joe Biden, is coming to light and Congress has begun formal impeachment proceedings.
The investigation was prompted by Wall Street Journal reports of a whistleblower complaint regarding a July 25th call between President Trump and Ukrainian President Volodymyr Zelensky. Yesterday, th White House released a rough transcript of this call, and this morning, the House Intelligence Committee released the unclassified portion of the whistleblower complaint.
As you might expect, Democratic and Republican party members have very different reactions to these developments. Recent polls suggest that Americans generally do not support impeachment proceedings.
I've been looking for an opportunity to put my data skills to work and this seems like a perfect opportunity. For this project, I will examine a Twitter data set of tweets to see...
My goal is to explore and understand perceptions of the whistleblower and answer some basic questions:
To answer these questions, I used Twitter to collect source data for the analysis. Why Twitter? Four reasons:
Each morning in a five-day window, I used the API to collect a small sample, targeting approximately 10,000 status updates without re-tweets. I then compiled the data sets in to a single file and performed a series of data cleaning steps. Then I performed an exploratory analysis and sentiment analysis.
I used the Twitter Search API to gather source data for this analysis. The API returns a data set with 90 variables. For this project, I focused on the following subset.
Users can choose to assign a place to their status updates, but may not do this. Places are specific names locations with corresponding geographical coordinates. Status updates with places are not necessarily issues from that location but they can be about the location.
Date | Time | Observations (Status Updates) -----------|-----------------|------------------------------ 2019-09-26 | 9:35am, Eastern | 9,946 2019-09-27 | 7:02am, Eastern | 9,957 2019-09-28 | 7:11am, Eastern | 9,911 2019-09-29 | 6:48am, Eastern | 9,863 2019-09-30 |
My initial review of the data surfaced several data issues:
I documented my strategy to address these issues below.
While URLs might provide some useful information about the external content users are sharing, I decided to remove them from the status update text and focus on what I can learn from the status update text itself.
Hashtags are already parsed and stored in the data set returned by the API, so I decided to remove them from status update text.
Emoticons provide useful clues to a user's sentiment, so I decided to build a function to identify them in the status update text and keep them in a separate column in the cleaned data set.
Perceptions are associated with users and those perceptions will be expressed in status updates. While it's possible for user to change their perception, they probably will not do that over a period of minutes or hours, so I think it is safe to avoid the complexity of somehow combining or accounting for multiple tweets. That said, I will explore how many users tweet more than once over the five-day period and adjust this strategy if the data suggests this assumption is not correct.
I will perform analysis to compare the two sets of locations to assess differences and potentially perform exploratory analysis with both sets of locations.
I will build functions to clean and normalize the data to make it as consistent as possible.
If democratic party politics within the United States is a significant driver of perception, it would make sense to account for international users. I will keep these users in the data set for additional analysis, but identify them so they can be filtered when assessing party politics as a driver.
When performing sentiment analysis, I will use standard dictionaries to identify words. This ma reduce the accuracy of the analysis, but...
Average number of tweets per user over the time frame? 2.67. Range of tweets? Frequency / histogram.
When looking at the number of status updates per user, a few stood out for the large number of updates.
Criteria for identifying suspected bot account? Does account activity resemble that of a person?
Check to see if Twitter nuked the accounts.
Load users data
data(users)
Examine users by age of accounts
summary_by_age <- users %>% group_by(account_age_in_years) %>% summarise(count = n()) %>% arrange(account_age_in_years) %>% mutate(cumulative_count = cumsum(count), frequency = round(count / sum(count), 2), cumulative_frequency = round(cumulative_count / sum(count), 2)) kable(summary_by_age, caption = "User Accounts Age By Year", colnames = c("Age In Years", "Accounts", "Cumulative Accounts", "Frequency", "Cumulative Frequency"), align = "lrrrr", format.args = list(big.mark = ",")) ggplot(summary_by_age, aes(account_age_in_years, count)) + geom_col() + scale_x_continuous(breaks = c(0:13)) + labs(title = "User Accounts Age By Year", x = "Age in Years", y = "Count")
Mean account age is r mean(summary_by_age$account_age_in_years)
years.
Twitter was founded on March 21, 2006, and some of the platform's early adopters are captured in the data. Have 193 users whose accounts date back to 2006 or 2007.
Frequency distribution shows percent of overall users between 5 and 10 percent for all other years except 2010. Users whose accounts date back to 2010 make up 18 percent of the accounts captured. This jump is prominent in the plot that follows.
I did some research to verify this pattern and came across a Business Insider article that documents a 44 percent increase in the Twitter account population in the first half of 2010. So, for now, I'll assume this is a legitimate pattern in the data.
Article link:
https://www.businessinsider.com/chart-of-the-day-new-twitter-accounts-2010-12
Examine users by tweet volume.
summary_by_tweet_count <- users %>% group_by(tweet_count) %>% summarise(count = n()) %>% arrange(desc(count)) %>% mutate(cumulative_count = cumsum(count), frequency = round(count / sum(count), 4), cumulative_frequency = round(cumulative_count / sum(count), 4)) kable(summary_by_tweet_count, caption = "User Accounts By Tweet Count", colnames = c("Tweet Count", "Accounts", "Cumulative Accounts", "Frequency", "Cumulative Frequency"), align = "lrrrr", format.args = list(big.mark = ","))
91 percent of the users captured in this sample tweeted a status update fewer than five times and nearly 67 percent tweeted only once.
Some accounts stand out based on the large volume of status updates. There are 11 accounts with 100 or more tweets in the sample, and one account with over 1,000 tweets. I suspect these might be Twitter bot accounts and will need to verify them.
What's a reasonable number of tweets per day per user? Any published data?
One article suggests 4.4 tweets per day. This sample has 42 days of tweets, so using the 4.4 metric would place the threshold at 193 tweets. All but the last three groups in the sample fall below that threshold (there is one user that has a total of 194 tweets, but we'll treat that person as being below the threshold).
https://blog.hubspot.com/blog/tabid/6307/bid/4594/ls-22-Tweets-Per-Day-the-Optimum
Examine users by location. Start by showing the count of users with a missing value in the state name variable versus those that have a valid value.
summary_by_state_value <- users %>% mutate(has_state_name = !is.na(state_name)) %>% group_by(has_state_name) %>% summarise(count = n()) %>% mutate(percent = round(count / sum(count), 2)) kable(summary_by_state_value, caption = "Summary By State Value", colnames = c("Has State Name", "Count", "Percent"), align = "lrr", format.args = list(big.mark = ","))
So we have good state names for 42 percent of the users in the dataset. This group was captured by searching for valid state names or codes in the location variable.
There are opportunities to capture additional users by enhancing the search to include cities within the Unites States and variations on spelling, e.g. "Penn" and "Penna" refer to Pennsylvania. There is also an opportunity to capture users that have identified that they reside in the United States but did not include a specific state in the location variable.
Plot the number of users by state.
summary_by_state <- users %>% filter(!is.na(state_name)) %>% group_by(state_name) %>% summarise(count = n()) %>% arrange(desc(count)) %>% mutate(cumulative_count = cumsum(count), frequency = round(count / sum(count), 4), cumulative_frequency = round(cumulative_count / sum(count), 4)) kable(summary_by_state, caption = "User Accounts By State", colnames = c("State", "Users", "Cumulative Users", "Frequency", "Cumulative Frequency"), align = "lrrrr", format.args = list(big.mark = ",")) ggplot(summary_by_state, aes(reorder(state_name, count), count)) + geom_col() + coord_flip() + labs(title = "User Accounts By State", x = "Count", y = "State")
Generally speaking, the order of states appears to be consistent with the population of states measured by 2019 census estimates. California, Texas, Florida, New York, Pennsylvania, Illinois, Ohio and Georgia have the highest estimated populations as of July 1, 2019. With the exception of Ohio and Georgia, these states top the list of states in the dataset. There is a similar consistency with states that have the lowest populations as well.
Census estimate source data:
https://simple.m.wikipedia.org/wiki/list/List_of_U.S._states_by_population
Load tweets data.
data(tweets)
summary_by_date <- tweets %>% group_by(created_at_date) %>% summarise(count = n()) ggplot(summary_by_date, aes(created_at_date, count)) + geom_col() + scale_x_date(date_minor_breaks = "1 day") + labs(title = "Tweets By Date", x = "Date", y = "Count")
summary_by_weekday <- tweets %>% group_by(created_at_weekday) %>% summarise(count = n()) ggplot(summary_by_weekday, aes(created_at_weekday, count)) + geom_col() + labs(title = "Tweets By Weekday", x = "Weekday", y = "Count")
summary_by_hour <- tweets %>% group_by(created_at_hour) %>% summarise(count = n()) ggplot(summary_by_hour, aes(created_at_hour, count)) + geom_col() + labs(title = "Tweets By Hour", x = "Hour", y = "Count")
summary_by_day_and_hour <- tweets %>% group_by(created_at_date, created_at_hour) %>% summarise(count = n()) ggplot(summary_by_day_and_hour, aes(created_at_hour, created_at_date)) + geom_line() + labs(title = "Tweets By Date And Hour", x = "Hour", y = "Count")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.