RA: Anonymized binarized diagnosis codes from RA study.
In ludic: Linkage Using Diagnosis Codes

Description Usage Format Details References Examples

An anonymized version of the binarized diagnosis code data from the RA1 and RA2 datasets, over both 6-year and 11-year time span.

data(RA)

5 objects

RA1_6y: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 26,681 patients from the dataset RA1 recorded over a 6-year time span.
RA2_6y: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 5,707 patients from the dataset RA2 recorded over a 6-year time span.
RA1_11y: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 26,687 patients from the dataset RA1 recorded over a 11-year time span.
RA2_11y: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 6,394 patients from the dataset RA2 recorded over a 11-year time span.
silverstandard_truematches: a character matrix with two columns containing the identifiers of the 3,831 pairs of silver-standard matches.

The ICD-9 diagnosis codes have also been masked and randomly reordered, replaced by meaningless names. Finally, the silver-standard matching pairs are also provided to allow the benchmarking of methods for probabilistic record linkage using diagnosis codes.

Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi: 10.1038/sdata.2018.298.

Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care & Research 62, 1120-1127 (2010). doi: 10.1002/acr.20184

Liao, K. P. et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS ONE 10, e0136651 (2015). doi: 10.1371/journal.pone.0136651

if(interactive()){
rm(list=ls())
library(ludic)
data(RA)
res_match_6y <- recordLink(data1 = RA1_6y, data2 = RA2_6y, 
                          eps_plus = 0.01, eps_minus = 0.01,
                          aggreg_2ways ="mean",
                          min_prev = 0,
                          use_diff = FALSE)

res_match_11y <- recordLink(data1 = RA1_11y, data2 = RA2_11y, 
                           eps_plus = 0.01, eps_minus = 0.01,
                           aggreg_2ways ="mean",
                           min_prev = 0,
                           use_diff = FALSE)


print.res_matching <- function(res, threshold=0.9, ref=silverstandard_truematches){
 have_match_row <- rowSums(res>threshold)
 have_match_col <- colSums(res>threshold)
 bestmatched_pairs_all <- cbind.data.frame(
   "D1"=rownames(res)[apply(res[,which(have_match_col>0), drop=FALSE], 2, which.max)],
   "D2"=names(have_match_col)[which(have_match_col>0)]
 )
 nTM_all <- nrow(ref)
 nP_all <- nrow(bestmatched_pairs_all)
 TPR_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") 
                %in% apply(ref, 1, paste0, collapse=""))/nTM_all
 PPV_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="") 
                %in% apply(ref, 1, paste0, collapse=""))/nP_all
 cat("threshold: ", threshold, 
     "\nnb matched: ", nP_all,"; nb true matches: ", nTM_all, 
     "\nTPR: ", TPR_all, ";   PPV: ", PPV_all, "\n\n", sep="")
}
print.res_matching(res_match_6y)
print.res_matching(res_match_11y)

}