knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(tidyverse) library(peacesciencer) library(kableExtra)
Dyad-year models---whether directed or non-directed---seek to explain variation in conflict onset by reference to some covariates of interest. This unit of analysis in these models is the dyad-year (e.g. USA-Canada, 1920; USA-Canada, 1921) and not the dispute-year. A researcher who is not careful about the difference will end up with duplicate dyad-year observations for dyads in which there were multiple confrontations ongoing in a calendar year.
Here is my favorite case in point: the Italy-France dyad in 1860. This dyad not only had three unique disputes occurring in 1860, they also had three unique onsets that year. Two were even wars even as France was a passive participant in those. Heck, they even all started effectively at the same time, but concerned different components of the wars of Italian unification happening at that time. A researcher should be mindful about this: their unit of analysis is supposed to be the dyad-year, not the dispute-year. Not knowing the difference is the difference of having three Italy-France observations for 1860 or just one. The researcher who has a dyad-year design wants the latter, not the former.
haven::read_dta("~/Dropbox/data/cow/mid/5/MIDB 5.0.dta") %>% filter(dispnum %in% c(112, 113, 306)) %>% select(dispnum:sidea, fatality, hiact, hostlev)
This means a researcher must make careful design decisions about which cases to exclude from their data. There is no correct answer here, per se. There is good reason theoretically to employ certain case exclusion rules before others, which is what {peacesciencer}
will do by default in the add_cow_mids()
and add_gml_mids()
functions. This vignette will explain what {peacesciencer}
does by default. Users who want to employ their own case exclusion rules are free to use the "whittle" class of functions (e.g. whittle_conflicts_onsets()
, whittle_conflicts_fatality()
) on the dyadic dispute-year data included in this package. I will start with version 5.0 of the CoW-MID data and the dyadic dispute-year data I created from it.
First, let's identify where there are dyad-year duplicates in the data.
cow_mid_dirdisps %>% # make it non-directed for ease of presentation filter(ccode2 > ccode1) %>% group_by(ccode1, ccode2, year) %>% summarize(n = n(), mids = paste0(dispnum, collapse = ", ")) %>% arrange(-n) %>% filter(n > 1) %>% ungroup()
The absolute most in the data is the United Kingdom-Soviet Union dyad, which had six conflicts ongoing and/or initiated in 1920. Next most is a tie between the United States-Soviet Union dyad in 1958, the Egypt-Israel dyad (1959, 1960), and the Syria-Israel dyad (1955). All told, there are 498 dyad-years that duplicate in the dyadic dispute-year data. We need to whittle those down to where there is no more than one dyad-year in these data.
The primary aim is to preserve the unique onsets. The case of the United States-United Kingdom dyad in 1903 will illustrate what's at stake here. Here, the United States and United Kingdom had three MIDs ongoing in 1903. Two (MID#0002 and MID#0254) began in 1902. The third, MID#3301, is a new onset. In this case, we want to remove the observation for MID#0002 and MID#0254 and keep the observation for MID3301.
cow_mid_dirdisps %>% filter(ccode1 == 2 & ccode2 == 200 & year == 1903) %>% select(dispnum:disponset) %>% kbl(., caption = "United States-United Kingdom Dyadic Dispute-Years in 1903", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
Here's how {peacesciencer}
does this first cut. Grouping by dyad-year (i.e. group_by(ccode1, ccode2, year)
), it creates a new variable that equals 1 if the number of rows by dyad-year is more than 1. Maintaining the same grouped structure, it calculates the standard deviation of the the disponset
variable. Cases where no standard deviation could be calculate are cases where the dyad-year does not duplicate and these are assigned as 0. Next, it creates a simple removeme
column that equals 1 if 1) it's a duplicated dyad-year, and 2) it's not a unique onset, and 3) the standard deviation is greater than 0 (i.e. there is at least one onset in that dyad-year). It then removes cases where removeme == 1
.
cow_mid_dirdisps %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% # Remove anything that's not a unique MID onset mutate(sd = sd(disponset), sd = ifelse(is.na(sd), 0, sd)) %>% mutate(removeme = ifelse(duplicated == 1 & disponset == 0 & sd > 0, 1, 0)) %>% filter(removeme != 1) %>% # remove detritus select(-removeme, -sd) %>% # practice safe group_by() ungroup() -> hold_this # ^ The `hold_this` naming convention is my favorite for intermediate objects. # It's also a bad idea to overwrite data objects that come in this package.
Observe how it fixed that USA-United Kingdom observation in 1903.
hold_this %>% filter(ccode1 == 2 & ccode2 == 200 & year == 1903) %>% select(dispnum:disponset)
It did not fix the Italy-France problem from 1860, but that's because all three dispute-years were onsets that year.
hold_this %>% filter(ccode1 == 220 & ccode2 == 325 & year == 1860) %>% select(dispnum:disponset) %>% kbl(., caption = "France-Italy Dyadic Dispute-Years in 1903", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
This just tells us we're not done, but we knew we wouldn't be. We need more exclusion rules to whittle down the data.
If presented the opportunity to keep one dispute and drop another where two appear in a year, researchers will likely prefer the more "serious" one rather than the one that might have been a simple threat to use or show of force. Consider this Russia-Ottoman Empire (Turkey) dyad-year in 1853. There are two unique onsets between the two that year. One (MID#0057) became the Crimean War, an important conflict! The other (MID#0126) was an apparent show of force with no fatalities. Under those conditions, it's an easy call to keep the one with more fatalities.
hold_this %>% filter(ccode1 == 365 & ccode2 == 640 & year == 1853) %>% select(dispnum:disponset, fatality1:fatality2, hiact1, hiact2) %>% kbl(., caption = "Russia-Ottoman Empire Dyadic Dispute-Years in 1853", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
There is one limitation with CoW-MID data toward this end. We obviously know CoW-MID only assigns fatalities at the end of the dispute to the participants, so we'd have no way of knowing a priori how many fatalities in that Russia-Turkey dyad were in 1853. We could have a situation like Belgium-Germany in 1939-1940. In that case, the highest action in which Belgium engaged against Germany in 1939 was a mobilization and the war that momentarily eliminated Belgium from the international system happened the next year. We also don't know to what extent Turkey was responsible for Russia's fatalities. The Crimean War was a multilateral war pitting the Russians against the United Kingdom, Austria-Hungary, Italy, Turkey, and France.
Thus, what follows is crude, but still useful. We'll use the dispute-level fatality information as a stand-in here and keep the duplicate dyad-year observation with the highest fatality score. We'll also need to take inventory of how to handle the cases where fatality == -9
. In a forthcoming data release, we find that cases of missing fatalities in the CoW-MID data mean that there were fatalities in more than half of the cases. Some were even wars! However, we'd have no way of knowing this from CoW-MID. We'll be safe and recode -9 to be .5, indicating more than 0 fatalities but "less" than the fatality level of 1 (1-25 deaths) in that CoW-MID can at least confidently say the latter happened.
hold_this %>% left_join(., cow_mid_disps %>% select(dispnum, fatality)) %>% mutate(fatality = ifelse(fatality == -9, .5, fatality)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the highest fatality filter(fatality == max(fatality)) %>% mutate(fatality = ifelse(fatality == .5, -9, fatality)) %>% arrange(ccode1, ccode2, year) %>% # practice safe group_by() ungroup() -> hold_this
This will fix the Russia-Turkey-1853 problem.
hold_this %>% filter(ccode1 == 365 & ccode2 == 640 & year == 1853)
It won't fix cases where there were multiple disputes initiated in the same year in the dyad, but no one died. There are lot of these. So, we'll need more case exclusion rules.
The next case exclusion rule will want to continue isolating those serious MIDs from MIDs of lesser severity. Consider this case of India and Pakistan in 1963.
hold_this %>% filter(ccode1 == 750 & ccode2 == 770 & year == 1963) %>% select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2) %>% kbl(., caption = "India-Pakistan Dyadic Dispute-Years in 1963", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
These are two unique MID onsets in 1963 and neither was fatal, meaning this duplicate dyad-year is still here. However, MID#2630 was just a threat to use force whereas MID#1317 had an occupation of territory (by Pakistan against India). The former is a threat. The latter is a use. MID#2630 has a higher hostility level and that is the MID we'll want to keep. The same caveat applies, as it did with fatalities, so we'll have to use the dispute-level hostility variable as a plug-in here.
hold_this %>% left_join(., cow_mid_disps %>% select(dispnum, hostlev)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the highest hostlev filter(hostlev == max(hostlev)) %>% arrange(ccode1, ccode2, year) %>% # practice safe group_by() ungroup() -> hold_this
This will at least fix that India-Pakistan observation in 1963, and others like it.
hold_this %>% filter(ccode1 == 750 & ccode2 == 770 & year == 1963) %>% select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2)
At this point, we still have duplicate dyad-years remaining in these data, but we've selected on cases that are fairly similar to each other (at least given the dispute- and participant-level data that are available). The duplicates that remain will be unique onsets with the same fatality levels and hostility levels. The next available measure that approximates dispute severity is duration. Consider this duplicate observation of Colombia-Peru in 1852 and the corresponding MIDs (MID#1506 and MID#1523).
haven::read_dta("~/Dropbox/data/cow/mid/5/MIDB 5.0.dta") %>% filter(dispnum %in% c(1506, 1523)) %>% select(dispnum:sidea, fatality, hiact, hostlev)
These MIDs look fairly similar. They both started the same year. They both have the same level of fatalities (none). They both have the same hostility level (a show of force). It would be tough to read tea leaves to argue that an alert (hiact: 8
) is "greater" than a show of force (hiact: 7
) even as 8 > 7 (i.e. CoW-MID action codes have never been truly ordinal). Further, they're both multilateral MIDs. MID#1506 pit Venezuela and Colombia against Chile and Peru whereas MID#1523 pit Chile and Colombia against Peru. Both even unhelpfully have some unknown duration to them. There are -9s in start days in both.
However, MID#1523 has the highest minimum duration. It lasted at least 110 days (and as many as 140) whereas MID#1506 has a minimum duration of 63 days (and a maximum duration of 122 days). Under those conditions, we will keep the one with the minimum duration and then, where duplicates still remain, keep the one with the highest maximum duration.
hold_this %>% left_join(., cow_mid_disps %>% select(dispnum, mindur, maxdur)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the highest mindur filter(mindur == max(mindur)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the highest maxdur filter(maxdur == max(maxdur)) %>% # practice safe group_by() ungroup() -> hold_this
This will fix that Colombia-Peru problem in 1852.
hold_this %>% filter(ccode1 == 135 & ccode2 == 100 & year == 1852) %>% select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2)
We had started with 498 duplicate directed dyad-years in the dyadic dispute-year data. We're now down to just 24 directed (12 non-directed) dyad-years. A glance at these remaining observations suggest the substance here is very similar. For example, MID#4428 and MID#4430 are both one-day border fortifications between Kyrgyzstan and Uzbekistan in 2005. MID#2171 and MID#2172 are both one-day threats to use force between Cyprus and Turkey in 1965.
hold_this %>% group_by(ccode1, ccode2, year) %>% filter(n() > 1) %>% filter(ccode2 > ccode1) %>% select(dispnum:disponset, hiact1:hiact2, fatality:maxdur) %>% kbl(., caption = "Duplicate Non-Directed Dyad-Years Still Remaining", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
The final case exclusion rules will round us home. First, a few of these duplicate dyad-years feature a case where one dispute was reciprocated and the other was not. For example, MID#4428 was a mutual border fortification while MID#4430 was just one border fortification directed by Kyrgyzstan against Uzbekistan. Thus, we should keep the one that involved at least two codable incidents rather than the MID in which there was just one codable incident.
A reader may object here that reciprocation should feature higher in the proverbial chain, given its prominence in the audience cost literature. I caution that we should not do this. Gibler and Miller (also with Little) have driven home that the reciprocation variable is an information-poor variable. It only minimally tells you that Side B in a MID initiated a militarized incident or was involved in an attack in which there was no clear initiator. In our review of the conflict data, we find that attacks or ambushes initiated by Side A are countered when they happen more than half the time. Further, inferences made from the reciprocation variable are among the most sensitive to the errors we report in the CoW-MID data. For that reason, we discourage researchers from using this variable for their analyses and, for this application, it's why {peacesciencer}
uses the dispute-level reciprocation variable near the bottom of the rung in its case exclusions.
Still, here's how to do that.
hold_this %>% left_join(., cow_mid_disps %>% select(dispnum, recip)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the reciprocated ones, where non-reciprocated ones exist filter(recip == max(recip)) %>% arrange(ccode1, ccode2, year) %>% # practice safe group_by() ungroup() -> hold_this
We're down to just three duplicate dyad-years now. The only reason MID#4428 and MID#4430 are both still there is CoW-MID has MID#4428 as unreciprocated at the dispute-level while it also has a militarized incident for Side B in the dispute. This is a CoW-MID issue and not a {peacesciencer}
issue.
hold_this %>% group_by(ccode1, ccode2, year) %>% filter(n() > 1) %>% filter(ccode2 > ccode1) %>% select(dispnum:disponset, hiact1:hiact2, fatality:maxdur) %>% kbl(., caption = "Duplicate Non-Directed Dyad-Years Still Remaining", booktabs = TRUE, longtable = TRUE) %>% kable_styling(position = "center", full_width = F, bootstrap_options = "striped")
All three are effectively identical MIDs. They start the same year. They have the same fatality-level, hostility-level, duration, and both are either are reciprocated or not-reciprocated (that MID#4428/MID#4430 issue notwithstanding). Thus, we will select the one that has the lowest start month.
hold_this %>% left_join(., cow_mid_disps %>% select(dispnum, stmon)) %>% arrange(ccode1, ccode2, year) %>% group_by(ccode1, ccode2, year) %>% mutate(duplicated = ifelse(n() > 1, 1, 0)) %>% group_by(ccode1, ccode2, year, duplicated) %>% # Keep the reciprocated ones, where non-reciprocated ones exist filter(stmon == min(stmon)) %>% arrange(ccode1, ccode2, year) %>% # practice safe group_by() ungroup() -> hold_this # And we're done
And this is enough to eliminate duplicate dyad-years.
hold_this %>% group_by(ccode1, ccode2, year) %>% filter(n() > 1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.