knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(readr)
One of the central problems in ensembled data packages is setting equivalences between observations from different datasets. This is a central problem because it is what enables us to relate datasets to one another.
What is required is some classificatory scheme so that same is recognised as same and different observations are recognised as different. In many cases, it is useful to have some kind of registry of unique units that existing and, importantly, new data can be read against.
There are several important features of such a registry that, while not required are certainly very helpful:
For some databases, unique identifiers are not too much of a problem because
there exists a relatively comprehensive, authoritative, succinct, and intelligible registry.
An example of this is the central states
database in the {manystates}
package.
Here we can rely on ISO-3166 alpha-3 codes for most modern states,
complementing them with COW or other codes for more historical states.
This combined registry is made available through the manystates::code_states()
function.
But we are not aware of comprehensive, authoritative, succinct, and intelligible registry of treaties. To solve this problem, we need to think both about how observations might be coded and what this codes might represent.
To test different solutions to the problem,
we construct a sample of treaty titles from three different datasets in the manyenviron::agreements
database.
Let's take only titles of treaties established in a particular decade, the 1980s,
so that we can increase the probability of matches.
library(manydata) manydata::get_packages("manyenviron") samples <- lapply(manyenviron::agreements, function(x) x[x$Beg > "1990-01-01" & x$Beg < "2000-01-01", ]) titles <- unlist(purrr::map(samples, "Title")) dates <- unlist(purrr::map(samples, "Beg")) sample <- cbind.data.frame(titles, dates)
The sample has been hand coded so that the exact number of agreements that are duplicates or linked to one another is known. We hand coded 400 agreements from the sample to compare duplicates identification.
agreements <- read_csv("hand-coded-sample.csv")
If we use the hand coded duplicates to correctly identify all duplicates in the sample, we get a perfect match! That is, we identify all true positives and no false positives. The rates of true positives and false positives are represented the receiver operating characteristics (ROC) graph below. The area under the curve (AUC score), in dark grey, in the graph is maximized (equals 1).
library(WVPlots) WVPlots::ROCPlot(agreements, "duplicate", "duplicate", 1, title = "Duplicates Match") # WVPlots::PRPlot(agreements, "duplicate", "duplicate", 1, title = "Duplicates Match") # Precision recall (PR) plots are also available for comparing solutions.
In reality, we often do not get a perfect score as we rarely know all the true duplicates across several large datasets.
First we could simply offer a unique code for each title, no duplicates. Obviously this would generate errors on the side of false negatives for many titles that are quite obviously (for humans) equivalent. That is, we would find zero false positives as well as zero true positives, as the ROC plot illustrates.
agreements$unique <- 0 WVPlots::ROCPlot(agreements, "unique", "duplicate", 1, title = "Unique") # WVPlots::PRPlot(agreements, "unique", "duplicate", 1, title = "Unique")
If we randomly create a variable with 200 values and use it to find duplicates, the AUC scores marginally. The ROC plot illustrates how some true positives as well as a few false positives are matched.
agreements$random <- runif(400) WVPlots::ROCPlot(agreements, "random", "duplicate", 1, title = "Random") # WVPlots::PRPlot(agreements, "random", "duplicate", 1, title = "Random") # PR results also illustrate well how Unique duplicate matches work.
Here we only consider equivalent those titles that are exactly the same as one another.
agreements$strict <- ifelse(duplicated(agreements$titles) | duplicated(agreements$titles, fromLast = TRUE), 1, 0) WVPlots::ROCPlot(agreements, "strict", "duplicate", 1, title = "Strict") # WVPlots::PRPlot(agreements, "strict", "duplicate", 1, title = "Strict") # PR plot is consistent with ROC plot and shows little improvement from a random match.
Note that the AUC probably boosted by our use of manypkgs::standardise_titles()
,
but misses a number of more problematic cases.
For example a "Treaty Of..." and "Treaty For..." would be considered distinct,
as would "Antarctic Treaty" and "Antarctic Treaty 1959",
even though they ought to be rendered equivalent.
Strict match is very specify, but not very sensible, as it generates no false positives
but it only captures a so many of the true positives.
Fuzzy matching relies on stringdist
to match observations that are mostly similar but not identical.
library(stringdist) fuzzy <- stringdist::stringsimmatrix(agreements$titles, agreements$titles) diag(fuzzy) <- 0 agreements$fuzzy <- apply(fuzzy, 1, max) agreements$fuzzy <- ifelse(agreements$fuzzy > 0.6, 1, 0) WVPlots::ROCPlot(agreements, "fuzzy", "duplicate", 1, title = "Fuzzy") # WVPlots::PRPlot(agreements, "fuzzy", "duplicate", 1, title = "Fuzzy") # Improvements are also visible in the PR plot.
The risk of finding incorrect matches using fuzzy matching increases in relation to strict matches. That is, although we might do a better job at finding true matches with fuzzy, we also might get more false positives. Moreover, this option is computationally expensive/takes a long time
Another option is to extract key identifying information from the titles and dates, rendering as equivalent any titles that have the same key identifying information.
The question here is which information to extract from the titles. It needs to be enough information to render as distinct those titles that are indeed distinct, but that ignores both less salient information (e.g. "Of" versus "For") and addenda (e.g. the year at the end of the treaty title).
The less information we require for a match, the higher the false positive rate, and the more information we require, the higher the false negative rate.
Using our current version of {manypkgs}
we obtain:
agreements$treatyID <- manypkgs::code_agreements(agreements, agreements$titles, agreements$dates) agreements$idinfo <- ifelse(duplicated(agreements$treatyID) | duplicated(agreements$treatyID, fromLast = TRUE), 1, 0) WVPlots::ROCPlot(agreements, "idinfo", "duplicate", 1, title = "treatyID Strict") # WVPlots::PRPlot(agreements, "idinfo", "duplicate", 1, title = "treatyID Strict") # PR plot illustrate similarities betwwen a perfect match and treatyID match.
The AUC score with code_agreements()
is higher than what we get with fuzzy
and with strict matches.
In relations to other options, code_agreements()
improves the number of true
observations matched while not generating many false positives.
Still, this is not perfect and some true matches are still missing.
This happens when the titles differ a great deal,
even if they are refering to the same agreement.
In those cases, we prefer to stay on the side of caution and refrain from
trying to code these titles as duplicates,
especially as such changes would likely lead to higher rates of false positives.
We can also improve the AUC scores by fuzzy matching the new treatyIDs generated by
code_agreements()
from different datasets with the condense_agreements()
.
qsim <- manypkgs::condense_agreements(idvar = agreements$treatyID) agreements <- dplyr::left_join(agreements, qsim, by = "treatyID") agreements$qsim <- ifelse(duplicated(agreements$manyID) | duplicated(agreements$manyID, fromLast = TRUE), 1, 0) WVPlots::ROCPlot(agreements, "qsim", "duplicate", 1, title = "treatyID Sim") # qsim <- stringdist::stringsimmatrix(agreements$treatyID, agreements$treatyID) # diag(qsim) <- 0 # agreements$qsim <- apply(qsim, 1, max) # WVPlots::ROCPlot(agreements, "qsim", "duplicate", 1, title = "treatyID Sim") # WVPlots::PRPlot(agreements, "qsim", "duplicate", 1, title = "treatyID Sim") # Better than above!
We can compare the specificity versus sensitivity for each of the possible solutions discussed above in the same ROC plot.
WVPlots::ROCPlotList(frame = agreements, xvar_names = c("strict", "fuzzy", "idinfo", "qsim"), truthVar = "duplicate", truthTarget = TRUE, title = "Comparing the different solutions")
A treatyID is a meaningful shorthand ID created from a combination of elements extracted from the agreement title and date. The treatyID allows users to identify date, type and linkage.
Type | Pasting | Pasting --- | --- |--- Bilateral | FRA-TON[DEZ]_1980A | parties[activity]_uID(type) Bilateral + Protocol | RUS-USA[FKC]_1967E:FKC_1965A | parties[activity]_uID(type)[action]:linkage
A bilateral treaty that is an agreement will have the following treatyID: "FRA-TON[DEZ]_1980A". This is a combination of the parties to the agreement (FRA-TON), the activity described in the treaty, which is here "Delimitation of Economic Zone (DEZ)", with the year of signature (1980), followed by the type (A = Agreement).
A bilateral treaty that is any other type than an agreement (e.g. protocol, amendments) will have the treatyID under this format: "RUS-USA[FKC]_1967E:FKC_1965A". It is composed of the parties (RUS-USA), the activity abbreviation from treaty title "Fishing for King Crab (FKC)", the year of signature of the amendment (1967), the type (E) which refers to Amendment. The linkage portion links the agreement to the treatyID its "mother" treaty, but without repeating the parties ([FKC]_1965A).
Type | Pasting | Pasting --- | --- |--- Multilateral | HSPDF_2005A | acronym_uID(type) Multilateral + Protocol | SFDP_2007E2:H08F_1992A | acronym, uID,type(number):linkage
A multilateral treaty that is an agreement will have the following treatyID: "HSPDF_2005A". The treatyID indicates the acronym (HSPDF), the signature year of the agreement (2005) and the type (A).
A multilateral treaty that is not an agreement will have this treatyID format: "SFDP_1992E2:H08F_2007". This represents the acronym (SFDP), the signature year of the amendment (2007), the type (E = Amendment), the number of amendment (2) and the linkage number (H08F_2007) of the "mother" treaty.
Type | Pasting | Pasting --- | --- |--- Known treaties | UNCLOS1982A | abbreviation, uID, type Amendment of known treaties | J09H_1990E2:MARPOL1973A | acronym_uID,type(number):linkage
Famous multilateral treaties have a simplified treatyID with a known abbreviation. For example, the United Nations Convention On The Law Of The Sea will have the following treatyID: "UNCLOC1982A". This is the known abbreviation (UNCLOS) and the signature year (1982) with the type (A).
Protocols or amendments of the known treaties will have this treatyID format: "J09H_1990E2:MARPOL1973A". It indicates the acronym (J09H), the signature year of this specific amendment (1990), the type of treaty (E), its number (2) and the treatyID of the "mother" treaty (MARPOL197A).
code_agreements()
workThe function extracts important information from treaty titles and dates at each step of the way to make sure duplicates are actual duplicates.
The first element extracted from the title are the parties to a treaty.
The function code_states()
returns the 3 digit ISO codes for states when there are
two parties identified (i.e. bilateral treaty).
The three letters abbreviation in brackets pasted next to the parties represent the activity of the title.
The activity is generated with code_activity()
, which uses core words at the end of
trety titles to generate 3 letter acronyms for activity.
Parties and activity are only returned for bilateral treaties.
manypkgs::code_states(agreements$titles)
Multilateral treaties start with the acronyms generated by the code_acronym()
function.
code_acronym()
extracts letters from the title once specific words, numbers and special characters
are deleted from the title.
if the title exceed a certain length (6 letters), some of the acronyms contain the numbers of the ltters ommited
when generating the acronym.
manypkgs::code_acronym(agreements$titles)
The year of the treaty's signature is extracted with code_dates
.
manypkgs::code_dates(agreements$dates)
code_type()
detects the type of the treaty and assigns a letter for each agreement type.
manypkgs::code_type(agreements$titles)
To see which key words are used to identify the treaty type, please run code_type()
.
For amendments or protocol, their ordering numbers, if present, are also extracted in code_type()
Are these a famous agreements for which abbreviations are known? Some treaties already have well-known abbreviations.
manypkgs::code_known_agreements(agreements$titles)
Know Agreements | Abbreviation --- | --- "United Nations Convention On The Law Of The Sea" | UNCLOS1982 "Convention On Biological Diversity" | CBD1992 "Convention On The Conservation Of Antarctic Marine Living Resources" | CCAMLR1980 "Convention On International Trade In Endangered Species Of Wild Fauna And Flora" | CITES1973 "International Convention On Civil Liability For Oil Pollution Damage" | CLC1969 "International Convention For The Prevention Of Pollution From Ships" | MARPOL 1973 "Constitutional Agreement Of The Latin American Organization For Fisheries Development" | OLDEPESCA1982 "Paris Agreement Under The United Nations Framework Convention On Climate Change" | PARIS2015 "Convention On Wetlands Of International Importance Especially As Waterfowl Habitat" | RAMSA1971 "United Nations Framework Convention On Climate Change" | UNFCCC1992
For the complete list of known agreements coded, please run code_known_agreements()
.
Detects the family a treaty might belong to. Treaties from the same family can be detected by removing predictable words that are added to treaty titles (e.g. amendment, protocol, meeting) and identifying duplicates based on the core words used to refer to a main agreement.
manypkgs::code_linkage(agreements$titles, agreements$dates)
Once added together, the information extracted from treaty title and date forms a treatyID which is more authoritative, comprehensible, succinct and intelligible than other conventions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.