In Heipihanhan/declass: Query Data From Declassification Engine API

`declass` R Package for History Lab Declassification Engine API

Declassification Engine API is provided by Columbia University History Lab.

library(httr)
library(jsonlite)
library(rvest)

declass_welcome <- function(return_type = "string"){
  url <- "http://api.declassification-engine.org/declass"
  message <- httr::content(httr::GET(url))
  if (return_type == "html") {
    return(message)
  } else if (return_type == "string") {
    text <- rvest::html_text(message)
    return(text)
  } else {
    notice <- "Return_type not supported"
    return(notice)
  }
}

declass_welcome()

What collections are available ?

declass_collection <- function(return_type = "list"){
  url <- "http://api.declassification-engine.org/declass/v0.4/collections"
  parsed_json <- jsonlite::fromJSON(url)
  collections <- parsed_json$results$collections
  if (return_type == "list") {
    return(collections)
  } else if (return_type == "raw") {
    raw <- content(GET(url))
    return(raw)
  } else {
    notice <- "Return_type not supported"
    return(notice)
  }
}

declass_collection()

What's in the `clinton` email collection ?

declass_collection_entity <- function(collection) {
  if (collection %in% c("cpdoc", "clinton", "kissinger", "statedeptcables", "frus", 
                        "ddrs", "cabinet", "pdb")) {
    url <- paste("http://api.declassification-engine.org/declass/v0.4/entity_info/?collection=", 
                 collection, sep = "")
    parsed_json <- jsonlite::fromJSON(url)
    collection <- parsed_json$results$entity
    return(collection)
  } else {
    notice <- "Collection name and entities not found"
    return(notice)
  }
}

declass_collection_entity("clinton")

What countries are mentioned in the `clinton` emails ?

declass_entity_data <- function(collection, entity, page_size = 25, page = 1, num_pages = 10) {
  url <- paste("http://api.declassification-engine.org/declass/v0.4/entity_info/?collection=",
               collection, "&entity=", entity, "&page_size=", page_size, "&page=", page, sep = "")

   if (collection %in% c("cpdoc",  "kissinger", "cabinet") & 
       entity %in% c("countries", "persons", "topics")) {
    }
   else if (collection %in% c("statedeptcables", "frus", "ddrs", "pdb") & 
            entity %in% c("countries", "persons", "topics", "classifications")) {
    }
   else if (collection == "clinton" & entity %in% c("countries", "persons", "classifications")) {
      } 
  else {
    notice <- "Collection or entity name not found"
    return(notice)

  }
  print(url)
  parsed_json <- jsonlite::fromJSON(url)
  entity_data <- parsed_json$results
  num_pages <- num_pages - 1

  while (is.null(parsed_json$next_page) == FALSE & num_pages > 0) {
    print(parsed_json$next_page)
    parsed_json <- jsonlite::fromJSON(parsed_json$next_page)
    entity_data <- rbind(entity_data, parsed_json$results)
    num_pages <- num_pages - 1
  }
  return(entity_data)
}

By default, the function declass_entity_data shows 25 records per page and up to 10 pages, which means 250 records in total. However, you can customize it based on what you need. Here in clinton collection, we see that there's a total of 200 countries. We don't need more.

We could do some exploratory analysis on top of this.

clinton_countries <- declass_entity_data("clinton", "countries")

summary(clinton_countries)

It looks like the dataset is highly skewed. We could only look at countries who are mentioned above the average count.

library(dplyr)
d <- clinton_countries %>% 
  select(name, count) %>% 
  filter(count > 288) %>% 
  arrange(desc(count))
knitr::kable(d)

The country United States is what skewed the data. It's not surprising to see in clinton's email, as being the U.S. Secretary of State, she mentionted her own country the most. I will filter out the country United States, and see what other countries did Clinton mention the most in her emails.

d <- d %>% 
  filter(name != "United States")

library(ggplot2)
gg <- ggplot(d, aes(x = name, y = count))
gg + geom_bar(stat = "identity") + theme_gray()

We could also map out the top countries mentioned in Clinton's emails.

Here's a small tutorial.

#install.packages("ggmap")
library(ggmap)

map_countries <- map_data("world")
as.factor(d$name) %>% levels()
d$name <- recode(d$name, "United States" = "USA", "United Kingdome" = "UK")

map_countries_joined <- left_join(map_countries, d, by = c("region" = "name"))

map_countries_joined <- map_countries_joined %>% 
  mutate(fill = ifelse(is.na(count), F, T))
head(map_countries_joined)

ggplot() + geom_polygon(data = map_countries_joined, aes(x = long, y = lat, group = group, fill = fill)) + 
  scale_fill_manual(values = c("#CCCCCC", "#e60000")) + 
  labs(title = "Top 30 Countries mentionted in Clinton Emails (exclude USA)", 
       subtitle = "Data Source: Columbia University History Lab") + 
  theme(text = element_text(family = "Gill Sans", color = "#FFFFFF"), 
        panel.background = element_rect(fill = "#444444"),
        plot.background = element_rect(fill = "#444444"),
        panel.grid = element_blank(),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 10),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none")