This vignetted extends Chapter 6 of Efficient R Programming to discuss merging.
The starting point is map data and a mapping package:
library("efficient") library("dplyr") library("ggmap") world = map_data("world") names(world)
Visually compare this new dataset of the world
with ghg_ems
(e.g. via View(world); View(ghg_ems)
). It is clear that the column region
in the former contains the same information as Country
in the latter. This will be the joining variable; renaming it in world
will make the join more efficient.
data(ghg_ems, package = "efficient") world = rename(world, Country = region) ghg_ems$All = rowSums(ghg_ems[3:7])
``{block same-class, type = "rmdtip"}
Ensure that both joining variables have the same class (combining
characterand
factor` columns can cause havoc).
How large is the overlap between `ghg_ems$Country` and `world$Country`? We can find out using the `%in%` operator, which finds out how many elements in one vector match those in another vector. Specifically, we will find out how many *unique* country names from `ghg_ems` are present in the `world` dataset: ```r unique_countries_ghg_ems = unique(ghg_ems$Country) unique_countries_world = unique(world$Country) matched = unique_countries_ghg_ems %in% unique_countries_world table(matched)
This comparison exercise has been fruitful: most of the countries in the co2
dataset exist in the world
dataset. But what about the 20 country names that do not match? We can identify these as follows:
(unmatched_countries_ghg_ems <- unique_countries_ghg_ems[!matched])
It is clear from the output that some of the non-matches (e.g. the European Union) are not countries at all. However, others, such as 'Gambia, The' and the United States clearly should have matches. Fuzzy matching can help find which countries do match, as illustrated the first non-matching country below:
(unmatched_country = unmatched_countries_ghg_ems[1]) unmatched_world_selection = agrep(pattern = unmatched_country, unique_countries_world, max.distance = 10) unmatched_world_countries = unique_countries_world[unmatched_world_selection]
What just happened? We verified that first unmatching country in the ghg_ems
dataset was not in the world
country names. So we used the more powerful agrep
to search for fuzzy matches (with the max.distance
argument set to 10
. The results show that the country Antigua & Barbuda
from the ghg_ems
data matches two countries in the world
dataset. We can update the names in the dataset we are joining to accordingly:
world$Country[world$Country %in% unmatched_world_countries] = unmatched_countries_ghg_ems[1]
The above code reduces the number of country names in the world
dataset by replacing both "Antigua" and "Barbuda" to "Antigua & Barbuda". This would not work other way around: how would one know whether to change "Antigua & Barbuda" to "Antigua" or to "Barbuda".
Thus fuzzy matching is still a laborious process that must be complemented by human judgement. It takes a human to know for sure that United States
is represented as USA
in the world
dataset, without risking false matches via agrep
.
To fix the remaining issues, we simply continued with the same method, using a for
loop and verifying the results instead of doing all by hand. The code used to match the remaining unmatched countries can be seen on the book's GitHub page.
i = n[1] match_df = tibble(co2_name = n, w_name = NA) for(i in n){ (fm = agrep(i, w_u, max.distance = 10)) (w_um = w_u[fm]) match_df$w_name[match_df$co2_name == i] = paste(w_um, collapse = "|") # world$Country[world$Country %in% w_u1] = i } # View(match_df) # check the results: 1, 3 , 14, 16, 17 are right i = 3 for(i in c(1, 3 , 14, 16, 17)){ world$Country[grep(match_df$w_name[i], world$Country) ] = match_df$co2_name[i] } match_df = match_df[-c(1, 3 , 14, 16, 17),] # manually fix countries with multiple matches world$Country[grep("Baham", world$Country)] = c_u[grep("Baham", c_u)] world$Country[grep("Democratic Republic of the Congo", world$Country)] = c_u[grep("Congo, Dem. Rep.", c_u)] world$Country[grep("Republic of C", world$Country)] = c_u[grep("Congo, R", c_u)] world$Country[grep("Ivo", world$Country)] = c_u[grep("Ivo", c_u)] world$Country[grep("Gambia", world$Country)] = c_u[grep("Gambia", c_u)] world$Country[grep("Macedonia", world$Country)] = c_u[grep("Macedonia", c_u)] world$Country[grep("USA", world$Country)] = c_u[grep("United States", c_u)] world$Country[grep("UK", world$Country)] = c_u[grep("United Kingdom", c_u)] world$Country[grep("North Korea", world$Country)] = c_u[grep("Korea, Dem. Rep. \\(N", c_u)] world$Country[grep("South Korea", world$Country)] = c_u[grep("Korea, Rep", c_u)] world$Country[grep("Russia", world$Country)] = c_u[grep("Russia", c_u)] world$Country[grep("Vincent", world$Country)] = c_u[grep("Vincent", c_u)] # ghg_ems = ghg_ems[!ghg_ems$Country == "World",] # save the result as 'm', for match c_u = unique(ghg_ems$Country) w_u = unique(world$Country) m = c_u %in% w_u # summary(m) n = c_u[!m]
There is one more stage that is needed before global CO^2^ emissions can be mapped for any year: the data must be joined. The base function merge
can do this but we strongly recommend using one of the join
functions from dplyr, such as left_join
(which keeps all rows in the original dataset) and inner_join
(which keeps only rows with matches in both datasets). This is a very clear case of when dplyr is advantageous: merge()
's interface is complicated, the code is less readable, and *_join
functions are faster. inner_
(which keeps all rows in both datasets for which there are matches) and left_
(which keeps only) join()
methods are illustrated below:
nrow(world) nrow({world_co2 = left_join(world, ghg_ems)}) nrow(inner_join(world, ghg_ems))
Note that inner_join
removes rows from the world
dataset which have no match in ghg_ems
: if we were to plot the resulting dataset, the continent of Antarctica and a number of countries not represented in the ghg_ems
dataset would be absent. Figure \@ref(fig:6-1) shows the results of this data carpentry, produced using a modified version of the ggplot2 code below, were worth the effort.
world_co2_2012 = filter(world_co2, Year == 2012 | is.na(Year)) ggplot(world_co2_2012, aes(long, lat)) + geom_polygon(aes(fill = All, group = group))
library("scales") world_co2_2012 = filter(world_co2, Year == 2012 | is.na(Year)) ggplot(world_co2_2012, aes(long, lat, group = group)) + geom_polygon(aes(fill = All)) + geom_path(size = 0.2) + scale_fill_gradient( low = "blue", high = "red", trans = "log", breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x)), name = expression(MtCO[2])) + coord_equal() + theme_nothing(legend = TRUE) # ggsave("figures/world_co2.png")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.