knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(dplyr)
library(ggplot2)
library(knitr)

Executive Summary

Context

It's Thursday, September 26, 2019, and over the course of the last several days, President Trump's attempt to pressure the Ukrainian president to investigate his political rival, former Vice President Joe Biden, is coming to light and Congress has begun formal impeachment proceedings.

The investigation was prompted by Wall Street Journal reports of a whistleblower complaint regarding a July 25th call between President Trump and Ukrainian President Volodymyr Zelensky. Yesterday, th White House released a rough transcript of this call, and this morning, the House Intelligence Committee released the unclassified portion of the whistleblower complaint.

As you might expect, Democratic and Republican party members have very different reactions to these developments. Recent polls suggest that Americans generally do not support impeachment proceedings.

I've been looking for an opportunity to put my data skills to work and this seems like a perfect opportunity. For this project, I will examine a Twitter data set of tweets to see...

Objective

My goal is to explore and understand perceptions of the whistleblower and answer some basic questions:

  1. What is the perception of the whistleblower?
  2. Is this perception driven primarily by democratic party politics? What other forces might be at play?
  3. How do developments in the impeachment proceedings and media reporting influence this perception over time?

Approach

To answer these questions, I used Twitter to collect source data for the analysis. Why Twitter? Four reasons:

  1. Twitter data is relatively easy to collect and share,
  2. Twitter data is accessible in real-time and can be collected as events unfold,
  3. The Twitter Search API can be used to generate unique data sets; in this case, I focused on tweets containing the #whistlerblower hashtag, and
  4. From a personal development perspective, I wanted to gain experience and insight related to performing data analysis with Twitter data.

Each morning in a five-day window, I used the API to collect a small sample, targeting approximately 10,000 status updates without re-tweets. I then compiled the data sets in to a single file and performed a series of data cleaning steps. Then I performed an exploratory analysis and sentiment analysis.

Findings

Analysis

Source Data

I used the Twitter Search API to gather source data for this analysis. The API returns a data set with 90 variables. For this project, I focused on the following subset.

Users can choose to assign a place to their status updates, but may not do this. Places are specific names locations with corresponding geographical coordinates. Status updates with places are not necessarily issues from that location but they can be about the location.

Source Files Generated

Date | Time | Observations (Status Updates) -----------|-----------------|------------------------------ 2019-09-26 | 9:35am, Eastern | 9,946 2019-09-27 | 7:02am, Eastern | 9,957 2019-09-28 | 7:11am, Eastern | 9,911 2019-09-29 | 6:48am, Eastern | 9,863 2019-09-30 |

Initial Data Review

My initial review of the data surfaced several data issues:

  1. Status updates have URLs, hashtags and emoticons embedded in the status updates. In some instances, the status update is made up entirely of these elements and no other text.
  2. It's obvious that users are posting multiple tweets, sometimes in fast succession. I suspect this is an effect of the character limit on status updates. In any case, I need to account for this user behavior.
  3. There are essentially two locations in the data set, the location associated with the profile and the location associated with the status update. I'm interested in exploring how geography comes into play and will need to decide how to use the two locations.
  4. There is variability in how locations are captured, e.g. one tweet might capture a specific location as "West Pittston, PA" and another might capture it as "West Pittston." I need to find a way to normalize this data.
  5. There are users in the data set that, based on their profile location, reside outside the United States. I need to consider whether I need to account for or remove these users from the data set.
  6. Text quality varies across status updates as a result of spelling errors, use of contractions and slang in the updates and use of sarcasm, which is likely to be present in at least some of the status updates.

I documented my strategy to address these issues below.

URLs, Hashtags and Emoticons

While URLs might provide some useful information about the external content users are sharing, I decided to remove them from the status update text and focus on what I can learn from the status update text itself.

Hashtags are already parsed and stored in the data set returned by the API, so I decided to remove them from status update text.

Emoticons provide useful clues to a user's sentiment, so I decided to build a function to identify them in the status update text and keep them in a separate column in the cleaned data set.

Multiple Tweets

Perceptions are associated with users and those perceptions will be expressed in status updates. While it's possible for user to change their perception, they probably will not do that over a period of minutes or hours, so I think it is safe to avoid the complexity of somehow combining or accounting for multiple tweets. That said, I will explore how many users tweet more than once over the five-day period and adjust this strategy if the data suggests this assumption is not correct.

Locations

I will perform analysis to compare the two sets of locations to assess differences and potentially perform exploratory analysis with both sets of locations.

Location Variability

I will build functions to clean and normalize the data to make it as consistent as possible.

Users Outside the United States

If democratic party politics within the United States is a significant driver of perception, it would make sense to account for international users. I will keep these users in the data set for additional analysis, but identify them so they can be filtered when assessing party politics as a driver.

Text Quality

When performing sentiment analysis, I will use standard dictionaries to identify words. This ma reduce the accuracy of the analysis, but...

User Analysis

Unique Users

Average number of tweets per user over the time frame? 2.67. Range of tweets? Frequency / histogram.

Twitter Bots

When looking at the number of status updates per user, a few stood out for the large number of updates.

Criteria for identifying suspected bot account? Does account activity resemble that of a person?

  1. Conversing with friends or saying things to users who don't interact?
  2. Diversity of posts, or sticking to one topic?
  3. Twitter handle non-sensical and doesn't relate to name of account holder.
  4. Spelling of account holder name.
  5. Website name can be a clue.
  6. Ratio of following to followers is high, 10:1...indicates bot owner is following people at rendom to get them to follow back.
  7. Tweets lack actual content.

Check to see if Twitter nuked the accounts.

Analysis

Load users data

data(users)

Examine users by age of accounts

summary_by_age <- users %>%
  group_by(account_age_in_years) %>%
  summarise(count = n()) %>%
  arrange(account_age_in_years) %>%
  mutate(cumulative_count = cumsum(count),
         frequency = round(count / sum(count), 2),
         cumulative_frequency = round(cumulative_count / sum(count), 2))

kable(summary_by_age,
      caption = "User Accounts Age By Year",
      colnames = c("Age In Years",
                   "Accounts",
                   "Cumulative Accounts",
                   "Frequency",
                   "Cumulative Frequency"),
      align = "lrrrr",
      format.args = list(big.mark = ","))

ggplot(summary_by_age, aes(account_age_in_years, count)) + 
  geom_col() +
  scale_x_continuous(breaks = c(0:13)) +
  labs(title = "User Accounts Age By Year",
       x = "Age in Years",
       y = "Count")

Mean account age is r mean(summary_by_age$account_age_in_years) years.

Twitter was founded on March 21, 2006, and some of the platform's early adopters are captured in the data. Have 193 users whose accounts date back to 2006 or 2007.

Frequency distribution shows percent of overall users between 5 and 10 percent for all other years except 2010. Users whose accounts date back to 2010 make up 18 percent of the accounts captured. This jump is prominent in the plot that follows.

I did some research to verify this pattern and came across a Business Insider article that documents a 44 percent increase in the Twitter account population in the first half of 2010. So, for now, I'll assume this is a legitimate pattern in the data.

Article link:

https://www.businessinsider.com/chart-of-the-day-new-twitter-accounts-2010-12

Examine users by tweet volume.

summary_by_tweet_count <- users %>%
  group_by(tweet_count) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(cumulative_count = cumsum(count),
         frequency = round(count / sum(count), 4),
         cumulative_frequency = round(cumulative_count / sum(count), 4))

kable(summary_by_tweet_count,
      caption = "User Accounts By Tweet Count",
      colnames = c("Tweet Count",
                   "Accounts",
                   "Cumulative Accounts",
                   "Frequency",
                   "Cumulative Frequency"),
      align = "lrrrr",
      format.args = list(big.mark = ","))

91 percent of the users captured in this sample tweeted a status update fewer than five times and nearly 67 percent tweeted only once.

Some accounts stand out based on the large volume of status updates. There are 11 accounts with 100 or more tweets in the sample, and one account with over 1,000 tweets. I suspect these might be Twitter bot accounts and will need to verify them.

What's a reasonable number of tweets per day per user? Any published data?

One article suggests 4.4 tweets per day. This sample has 42 days of tweets, so using the 4.4 metric would place the threshold at 193 tweets. All but the last three groups in the sample fall below that threshold (there is one user that has a total of 194 tweets, but we'll treat that person as being below the threshold).

https://blog.hubspot.com/blog/tabid/6307/bid/4594/ls-22-Tweets-Per-Day-the-Optimum

Examine users by location. Start by showing the count of users with a missing value in the state name variable versus those that have a valid value.

summary_by_state_value <- users %>%
  mutate(has_state_name = !is.na(state_name)) %>%
  group_by(has_state_name) %>%
  summarise(count = n()) %>%
  mutate(percent = round(count / sum(count), 2))

kable(summary_by_state_value,
      caption = "Summary By State Value",
      colnames = c("Has State Name",
                   "Count",
                   "Percent"),
      align = "lrr",
      format.args = list(big.mark = ","))

So we have good state names for 42 percent of the users in the dataset. This group was captured by searching for valid state names or codes in the location variable.

There are opportunities to capture additional users by enhancing the search to include cities within the Unites States and variations on spelling, e.g. "Penn" and "Penna" refer to Pennsylvania. There is also an opportunity to capture users that have identified that they reside in the United States but did not include a specific state in the location variable.

Plot the number of users by state.

summary_by_state <- users %>%
  filter(!is.na(state_name)) %>%
  group_by(state_name) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  mutate(cumulative_count = cumsum(count),
         frequency = round(count / sum(count), 4),
         cumulative_frequency = round(cumulative_count / sum(count), 4))

kable(summary_by_state,
      caption = "User Accounts By State",
      colnames = c("State",
                   "Users",
                   "Cumulative Users",
                   "Frequency",
                   "Cumulative Frequency"),
      align = "lrrrr",
      format.args = list(big.mark = ","))

ggplot(summary_by_state, aes(reorder(state_name, count), count)) + 
  geom_col() +
  coord_flip() +
  labs(title = "User Accounts By State",
       x = "Count",
       y = "State")

Generally speaking, the order of states appears to be consistent with the population of states measured by 2019 census estimates. California, Texas, Florida, New York, Pennsylvania, Illinois, Ohio and Georgia have the highest estimated populations as of July 1, 2019. With the exception of Ohio and Georgia, these states top the list of states in the dataset. There is a similar consistency with states that have the lowest populations as well.

Census estimate source data:

https://simple.m.wikipedia.org/wiki/list/List_of_U.S._states_by_population

Load tweets data.

data(tweets)
summary_by_date <- tweets %>%
  group_by(created_at_date) %>%
  summarise(count = n())

ggplot(summary_by_date, aes(created_at_date, count)) +
  geom_col() +
  scale_x_date(date_minor_breaks = "1 day") +
  labs(title = "Tweets By Date",
       x = "Date",
       y = "Count")
summary_by_weekday <- tweets %>%
  group_by(created_at_weekday) %>%
  summarise(count = n())

ggplot(summary_by_weekday, aes(created_at_weekday, count)) +
  geom_col() +
  labs(title = "Tweets By Weekday",
       x = "Weekday",
       y = "Count")
summary_by_hour <- tweets %>%
  group_by(created_at_hour) %>%
  summarise(count = n())

ggplot(summary_by_hour, aes(created_at_hour, count)) +
  geom_col() +
  labs(title = "Tweets By Hour",
       x = "Hour",
       y = "Count")
summary_by_day_and_hour <- tweets %>%
  group_by(created_at_date, created_at_hour) %>%
  summarise(count = n())

ggplot(summary_by_day_and_hour, aes(created_at_hour, created_at_date)) +
  geom_line() +
  labs(title = "Tweets By Date And Hour",
       x = "Hour",
       y = "Count")


dtminnick/whistleblower documentation built on Nov. 14, 2019, 2:45 p.m.