United Nations Voting Correlations
In widyr: Widen, Process, then Re-Tidy Data

library(knitr)

options(width = 102)
knitr::opts_chunk$set(message = FALSE, warning = FALSE)

library(ggplot2)
theme_set(theme_bw())

Here we'll examine an example application of the widyr package, particularly the pairwise_cor and pairwise_dist functions. We'll use the data on United Nations General Assembly voting from the unvotes package:

if (!requireNamespace("unvotes", quietly = TRUE)) {
  print("This vignette requires the unvotes package to be installed. Exiting...")
  knitr::knit_exit()
}

library(dplyr)
library(unvotes)

un_votes

This dataset has one row for each country for each roll call vote. We're interested in finding pairs of countries that tended to vote similarly.

Pairwise correlations

Notice that the vote column is a factor, with levels (in order) "yes", "abstain", and "no":

levels(un_votes$vote)

We may then be interested in obtaining a measure of country-to-country agreement for each vote, using the pairwise_cor function.

library(widyr)

cors <- un_votes %>%
  mutate(vote = as.numeric(vote)) %>%
  pairwise_cor(country, rcid, vote, use = "pairwise.complete.obs", sort = TRUE)

cors

We could, for example, find the countries that the US is most and least in agreement with:

US_cors <- cors %>%
  filter(item1 == "United States")

# Most in agreement
US_cors

# Least in agreement
US_cors %>%
  arrange(correlation)

This can be particularly useful when visualized on a map.

if (require("maps", quietly = TRUE) &&
    require("fuzzyjoin", quietly = TRUE) &&
    require("countrycode", quietly = TRUE) &&
    require("ggplot2", quietly = TRUE)) {
  world_data <- map_data("world") %>%
    regex_full_join(iso3166, by = c("region" = "mapname")) %>%
    filter(region != "Antarctica")

  US_cors %>%
    mutate(a2 = countrycode(item2, "country.name", "iso2c")) %>%
    full_join(world_data, by = "a2") %>%
    ggplot(aes(long, lat, group = group, fill = correlation)) +
    geom_polygon(color = "gray", size = .1) +
    scale_fill_gradient2() +
    coord_quickmap() +
    theme_void() +
    labs(title = "Correlation of each country's UN votes with the United States",
         subtitle = "Blue indicates agreement, red indicates disagreement",
         fill = "Correlation w/ US")
}

Visualizing clusters in a network

Another useful kind of visualization is a network plot, which can be created with Thomas Pedersen's ggraph package. We can filter for pairs of countries with correlations above a particular threshold.

if (require("ggraph", quietly = TRUE) &&
    require("igraph", quietly = TRUE) &&
    require("countrycode", quietly = TRUE)) {
  cors_filtered <- cors %>%
    filter(correlation > .6)

  continents <- tibble(country = unique(un_votes$country)) %>%
    filter(country %in% cors_filtered$item1 |
             country %in% cors_filtered$item2) %>%
    mutate(continent = countrycode(country, "country.name", "continent"))

  set.seed(2017)

  cors_filtered %>%
    graph_from_data_frame(vertices = continents) %>%
    ggraph() +
    geom_edge_link(aes(edge_alpha = correlation)) +
    geom_node_point(aes(color = continent), size = 3) +
    geom_node_text(aes(label = name), check_overlap = TRUE, vjust = 1, hjust = 1) +
    theme_void() +
    labs(title = "Network of countries with correlated United Nations votes")
}

Choosing the threshold for filtering correlations (or other measures of similarity) typically requires some trial and error. Setting too high a threshold will make a graph too sparse, while too low a threshold will make a graph too crowded.