The goal of this work is to analyze two different collections of archived tweets from the ISMB conferences from 2012 and 2014, and compare and contrast them. The collections of tweets are from Neil Saunders (his initial analysis and data) and Stephen Turner (initial analysis and data).
Before we start to use this data, we will do some munging on it so that both data-sets are comparable and we can apply the same functions to them for comparison.
baseLoc <- system.file(package="ismbTweetAnalysis") extPath <- file.path(baseLoc, "extdata")
load(file.path(extPath, "ismb2012.RData")) ismb12 <- ismb[, c("text", "created", "id", "screenName")] ismb12$hashSearch <- "ismb" save(ismb12, file="data/ismb2012.RData")
This is especially important for the 2014 data, because it is not in rdata
format, and there were actually three different hashtags searched for.
ismb <- readTweetData(file.path(extPath, "ismb.txt"), "ismb") ismb2014 <- readTweetData(file.path(extPath, "ismb2014.txt"), "ismb2014") ismb14 <- readTweetData(file.path(extPath, "ismb14.txt"), "ismb14") ismb14 <- rbind(ismb, ismb14, ismb2014) save(ismb14, file="data/ismb2014.RData")
Lets do some simple things in each case. For example, we can look at the distribution of tweets by time, who had the most retweets
, who was the most prolific tweeter (as a percentage of total tweets), etc, and even for changes between the two years. Note that this is not the most comprehensive analysis we could do, because this is mostly an example analysis of how to do an analysis as a package vignette, but I need to do something, right?
library(ismbTweetAnalysis) data(ismb2012) head(ismb12)
library(ggplot2) ggplot(ismb12, aes(x=created)) + geom_bar()
What about who does the most tweeting, whether direct or retweets?
counts2012 <- tweetCounts(ismb12) head(counts2012) head(counts2012[order(counts2012$total, decreasing = TRUE),])
head(counts2012[order(counts2012$original, decreasing = TRUE), ])
head(counts2012[order(counts2012$retweet, decreasing = TRUE), ])
Lets do a similar analysis for 2014.
data(ismb2014)
Simple visualization of the tweets by time.
ggplot(ismb14, aes(x = created)) + geom_bar()
Again, who does the most tweeting and retweeting?
counts2014 <- tweetCounts(ismb14) head(counts2014)
head(counts2014[order(counts2014$total, decreasing=TRUE), ])
head(counts2014[order(counts2014$retweet, decreasing=TRUE), ])
Now we want to do a comparison between the two datasets. Initially, we will compare the frequency of tweets over time with respect to the starting date of the conference. In 2012, the special interest groups started at 8:30, July 13, 2012. In 2014, the start was at 8:30, July 11, 2014. We will calculate the difference in time of the tweets compared to the start time in hours.
start2012 <- as.POSIXlt("2012-07-13 08:30 PST") start2014 <- as.POSIXlt("2014-07-11 08:30 PST") diff12 <- as.numeric(difftime(ismb12$created, start2012, units = "hours")) diff14 <- as.numeric(difftime(ismb14$created, start2014, units = "hours")) diffAll <- data.frame(time = c(diff12, diff14), year = rep(c("12", "14"), times = c(length(diff12), length(diff14))))
ggplot(diffAll, aes(x = time, fill = year)) + geom_density(alpha = 0.5)
Assuming that there are repeat tweeters between the years, lets compare their ranks between the two years. We will use two metrics, total number of original tweets (not retweets), and how many retweets an individual got (measure of popularity of a tweet).
bothYears <- intersect(ismb12$screenName, ismb14$screenName)
counts2012$rank <- tweetRank(counts2012$total) counts2014$rank <- tweetRank(counts2014$total) countDiff1214 <- abs(counts2012[bothYears, "rank"] - counts2014[bothYears, "rank"]) countDiff1214 <- data.frame(screenName = bothYears, diff = countDiff1214, stringsAsFactors = FALSE) countDiff1214 <- countDiff1214[order(countDiff1214$diff),] head(countDiff1214) tail(countDiff1214)
rtCount12 <- retweetCount(ismb12) rtCount14 <- retweetCount(ismb14)
rtTot12 <- totalRT(rtCount12, "countRT") rtTot14 <- totalRT(rtCount14, "countRT") rtTot12$rank <- tweetRank(rtTot12$sumRT) rtTot14$rank <- tweetRank(rtTot14$sumRT) bothRT <- intersect(rtTot12$screenName, rtTot14$screenName) rtDiff1214 <- abs(rtTot12[bothRT, "rank"] - rtTot14[bothRT, "rank"]) rtDiff1214 <- data.frame(screenName = bothRT, diff = rtDiff1214, stringsAsFactors = FALSE) rtDiff1214 <- rtDiff1214[order(rtDiff1214$diff), ] head(rtDiff1214) tail(rtDiff1214)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.