In akrmenec/efficient-R: Becoming an Efficient R Programmer

This vignetted extends Chapter 6 of Efficient R Programming to discuss merging.

The starting point is map data and a mapping package:

library("efficient")
library("dplyr")
library("ggmap")
world = map_data("world")
names(world)

Visually compare this new dataset of the world with ghg_ems (e.g. via View(world); View(ghg_ems)). It is clear that the column region in the former contains the same information as Country in the latter. This will be the joining variable; renaming it in world will make the join more efficient.

data(ghg_ems, package = "efficient")
world = rename(world, Country = region)
ghg_ems$All = rowSums(ghg_ems[3:7])

``{block same-class, type = "rmdtip"} Ensure that both joining variables have the same class (combiningcharacterandfactor` columns can cause havoc).

How large is the overlap between `ghg_ems$Country` and `world$Country`? We can find out using the `%in%` operator, which finds out how many elements in one vector match those in another vector. Specifically, we will find out how many *unique* country names from `ghg_ems` are present in the `world` dataset:

```r
unique_countries_ghg_ems = unique(ghg_ems$Country)
unique_countries_world = unique(world$Country)
matched = unique_countries_ghg_ems %in% unique_countries_world
table(matched)

This comparison exercise has been fruitful: most of the countries in the co2 dataset exist in the world dataset. But what about the 20 country names that do not match? We can identify these as follows:

(unmatched_countries_ghg_ems <- unique_countries_ghg_ems[!matched])

It is clear from the output that some of the non-matches (e.g. the European Union) are not countries at all. However, others, such as 'Gambia, The' and the United States clearly should have matches. Fuzzy matching can help find which countries do match, as illustrated the first non-matching country below:

(unmatched_country = unmatched_countries_ghg_ems[1])
unmatched_world_selection = agrep(pattern = unmatched_country, unique_countries_world, max.distance = 10)
unmatched_world_countries = unique_countries_world[unmatched_world_selection]

What just happened? We verified that first unmatching country in the ghg_ems dataset was not in the world country names. So we used the more powerful agrep to search for fuzzy matches (with the max.distance argument set to 10. The results show that the country Antigua & Barbuda from the ghg_ems data matches two countries in the world dataset. We can update the names in the dataset we are joining to accordingly:

world$Country[world$Country %in% unmatched_world_countries] =
  unmatched_countries_ghg_ems[1]

The above code reduces the number of country names in the world dataset by replacing both "Antigua" and "Barbuda" to "Antigua & Barbuda". This would not work other way around: how would one know whether to change "Antigua & Barbuda" to "Antigua" or to "Barbuda".

Thus fuzzy matching is still a laborious process that must be complemented by human judgement. It takes a human to know for sure that United States is represented as USA in the world dataset, without risking false matches via agrep.

To fix the remaining issues, we simply continued with the same method, using a for loop and verifying the results instead of doing all by hand. The code used to match the remaining unmatched countries can be seen on the book's GitHub page.

i = n[1]
match_df = tibble(co2_name = n, w_name = NA)
for(i in n){
  (fm = agrep(i, w_u, max.distance = 10))
  (w_um = w_u[fm])
  match_df$w_name[match_df$co2_name == i] = paste(w_um, collapse = "|")
  # world$Country[world$Country %in% w_u1] = i
}

# View(match_df) # check the results: 1, 3 , 14, 16, 17 are right
i = 3
for(i in c(1, 3 , 14, 16, 17)){
  world$Country[grep(match_df$w_name[i], world$Country) ] =
    match_df$co2_name[i]
}
match_df = match_df[-c(1, 3 , 14, 16, 17),]

# manually fix countries with multiple matches
world$Country[grep("Baham", world$Country)] =
  c_u[grep("Baham", c_u)]
world$Country[grep("Democratic Republic of the Congo", world$Country)] =
  c_u[grep("Congo, Dem. Rep.", c_u)]
world$Country[grep("Republic of C", world$Country)] =
  c_u[grep("Congo, R", c_u)]
world$Country[grep("Ivo", world$Country)] =
  c_u[grep("Ivo", c_u)]
world$Country[grep("Gambia", world$Country)] =
  c_u[grep("Gambia", c_u)]
world$Country[grep("Macedonia", world$Country)] =
  c_u[grep("Macedonia", c_u)]
world$Country[grep("USA", world$Country)] =
  c_u[grep("United States", c_u)]
world$Country[grep("UK", world$Country)] =
  c_u[grep("United Kingdom", c_u)]
world$Country[grep("North Korea", world$Country)] =
  c_u[grep("Korea, Dem. Rep. \\(N", c_u)]
world$Country[grep("South Korea", world$Country)] =
  c_u[grep("Korea, Rep", c_u)]
world$Country[grep("Russia", world$Country)] =
  c_u[grep("Russia", c_u)]
world$Country[grep("Vincent", world$Country)] =
  c_u[grep("Vincent", c_u)]

# ghg_ems = ghg_ems[!ghg_ems$Country == "World",]

# save the result as 'm', for match
c_u = unique(ghg_ems$Country)
w_u = unique(world$Country)
m = c_u %in% w_u
# summary(m)
n = c_u[!m]

There is one more stage that is needed before global CO^2^ emissions can be mapped for any year: the data must be joined. The base function merge can do this but we strongly recommend using one of the join functions from dplyr, such as left_join (which keeps all rows in the original dataset) and inner_join (which keeps only rows with matches in both datasets). This is a very clear case of when dplyr is advantageous: merge()'s interface is complicated, the code is less readable, and *_join functions are faster. inner_ (which keeps all rows in both datasets for which there are matches) and left_ (which keeps only) join() methods are illustrated below:

nrow(world)
nrow({world_co2 = left_join(world, ghg_ems)})
nrow(inner_join(world, ghg_ems))

Note that inner_join removes rows from the world dataset which have no match in ghg_ems: if we were to plot the resulting dataset, the continent of Antarctica and a number of countries not represented in the ghg_ems dataset would be absent. Figure \@ref(fig:6-1) shows the results of this data carpentry, produced using a modified version of the ggplot2 code below, were worth the effort.

world_co2_2012 = filter(world_co2, Year == 2012 | is.na(Year))
ggplot(world_co2_2012, aes(long, lat)) +
  geom_polygon(aes(fill = All, group = group))

library("scales")
world_co2_2012 = filter(world_co2, Year == 2012 | is.na(Year))
ggplot(world_co2_2012, aes(long, lat, group = group)) +
  geom_polygon(aes(fill = All)) +
  geom_path(size = 0.2) +
  scale_fill_gradient(
    low = "blue", high = "red", trans = "log",
    breaks = trans_breaks("log10", function(x) 10^x),
    labels = trans_format("log10", math_format(10^.x)),
    name = expression(MtCO[2])) +
  coord_equal() +
  theme_nothing(legend = TRUE)
# ggsave("figures/world_co2.png")