ISMB Twitter Analysis

The goal of this work is to analyze two different collections of archived tweets from the ISMB conferences from 2012 and 2014, and compare and contrast them. The collections of tweets are from Neil Saunders (his initial analysis and data) and Stephen Turner (initial analysis and data).

Before we start to use this data, we will do some munging on it so that both data-sets are comparable and we can apply the same functions to them for comparison.

baseLoc <- system.file(package="ismbTweetAnalysis")
extPath <- file.path(baseLoc, "extdata")
load(file.path(extPath, "ismb2012.RData"))
ismb12 <- ismb[, c("text", "created", "id", "screenName")]
ismb12$hashSearch <- "ismb"
save(ismb12, file="data/ismb2012.RData")

This is especially important for the 2014 data, because it is not in rdata format, and there were actually three different hashtags searched for.

ismb <- readTweetData(file.path(extPath, "ismb.txt"), "ismb")
ismb2014 <- readTweetData(file.path(extPath, "ismb2014.txt"), "ismb2014")
ismb14 <- readTweetData(file.path(extPath, "ismb14.txt"), "ismb14")
ismb14 <- rbind(ismb, ismb14, ismb2014)
save(ismb14, file="data/ismb2014.RData")

Analysis

Lets do some simple things in each case. For example, we can look at the distribution of tweets by time, who had the most retweets, who was the most prolific tweeter (as a percentage of total tweets), etc, and even for changes between the two years. Note that this is not the most comprehensive analysis we could do, because this is mostly an example analysis of how to do an analysis as a package vignette, but I need to do something, right?

2012

library(ismbTweetAnalysis)
data(ismb2012)
head(ismb12)
library(ggplot2)
ggplot(ismb12, aes(x=created)) + geom_bar()

What about who does the most tweeting, whether direct or retweets?

counts2012 <- tweetCounts(ismb12)
head(counts2012)
head(counts2012[order(counts2012$total, decreasing = TRUE),])
head(counts2012[order(counts2012$original, decreasing = TRUE), ])
head(counts2012[order(counts2012$retweet, decreasing = TRUE), ])

2014

Lets do a similar analysis for 2014.

data(ismb2014)

Simple visualization of the tweets by time.

ggplot(ismb14, aes(x = created)) + geom_bar()

Again, who does the most tweeting and retweeting?

counts2014 <- tweetCounts(ismb14)
head(counts2014)
head(counts2014[order(counts2014$total, decreasing=TRUE), ])
head(counts2014[order(counts2014$retweet, decreasing=TRUE), ])

Comparison

Density of Tweets Compared to Start Time

Now we want to do a comparison between the two datasets. Initially, we will compare the frequency of tweets over time with respect to the starting date of the conference. In 2012, the special interest groups started at 8:30, July 13, 2012. In 2014, the start was at 8:30, July 11, 2014. We will calculate the difference in time of the tweets compared to the start time in hours.

start2012 <- as.POSIXlt("2012-07-13 08:30 PST")
start2014 <- as.POSIXlt("2014-07-11 08:30 PST")

diff12 <- as.numeric(difftime(ismb12$created, start2012, units = "hours"))
diff14 <- as.numeric(difftime(ismb14$created, start2014, units = "hours"))

diffAll <- data.frame(time = c(diff12, diff14), year = rep(c("12", "14"), times = c(length(diff12), length(diff14))))
ggplot(diffAll, aes(x = time, fill = year)) + geom_density(alpha = 0.5)

Ranking of Repeat Tweeters

Assuming that there are repeat tweeters between the years, lets compare their ranks between the two years. We will use two metrics, total number of original tweets (not retweets), and how many retweets an individual got (measure of popularity of a tweet).

bothYears <- intersect(ismb12$screenName, ismb14$screenName)
counts2012$rank <- tweetRank(counts2012$total)
counts2014$rank <- tweetRank(counts2014$total)

countDiff1214 <- abs(counts2012[bothYears, "rank"] - counts2014[bothYears, "rank"])
countDiff1214 <- data.frame(screenName = bothYears, diff = countDiff1214, stringsAsFactors = FALSE)
countDiff1214 <- countDiff1214[order(countDiff1214$diff),]
head(countDiff1214)
tail(countDiff1214)
rtCount12 <- retweetCount(ismb12)
rtCount14 <- retweetCount(ismb14)
rtTot12 <- totalRT(rtCount12, "countRT")
rtTot14 <- totalRT(rtCount14, "countRT")

rtTot12$rank <- tweetRank(rtTot12$sumRT)
rtTot14$rank <- tweetRank(rtTot14$sumRT)

bothRT <- intersect(rtTot12$screenName, rtTot14$screenName)

rtDiff1214 <- abs(rtTot12[bothRT, "rank"] - rtTot14[bothRT, "rank"])
rtDiff1214 <- data.frame(screenName = bothRT, diff = rtDiff1214, stringsAsFactors = FALSE)
rtDiff1214 <- rtDiff1214[order(rtDiff1214$diff), ]
head(rtDiff1214)
tail(rtDiff1214)


rmflight/ismbTweetAnalysis documentation built on May 27, 2019, 9:32 a.m.