# https://www.jstatsoft.org/pages/view/authors
# https://www.jstatsoft.org/pages/view/style
# generate replication code (done in Makefile for JSS submission, also manually
# for vignette index)
# knitr::purl("Efficient_calculation_of_comorbidities_from_medical_codes.Rmd")
# build vignette for package (not done by CRAN, devotols, etc), to avoid a
# pandoc 2.something dependency which r-hub, CRAN etc do not have April 2019.
# rmarkdown::render("vignettes/efficiency.Rmd", output_dir = "inst/doc")
# # or, to juggle pkgbuild::build trashing inst/doc..., using .install_extras
# rmarkdown::render("vignettes/efficiency-prebuilt.Rmd", output_dir = "vignettes")
library("icd")
requireNamespace("knitr")
# using this here and not compiling for CRAN etc saves a lot of dependencies
requireNamespace("rticles")

fig.width = 6.5
fig.height = fig.width / ((1 + sqrt(5)) / 2)
bitmap_dpi = 600
knitr::opts_chunk$set(fig.width = fig.width)
knitr::opts_chunk$set(fig.height = fig.height)
knitr::opts_chunk$set(cache = FALSE)
knitr::opts_knit$set(concordance = TRUE) # rstudio setting broken
# the following doesn't seem to insert preamble lines into included PDFs generated by chunks?
# knitr::opts_knit$set(header = "\\pdftrailerid{}\n\\pdfsuppressptexinfo-1\n\\pdfinfoomitdate1")

kable_caption_bottom <- function(x) {
  y <- unlist(strsplit(as.character(x), '\\n'))
  cap_line <- grep("caption", y)
  cap <- y[cap_line]
  last <- y[length(y)]
  y <- y[-c(cap_line, length(y))]
  y <- c(y, cap, last)
  y <- paste(y, sep = "\\n")
  class(y) <- "knitr_kable"
  attr(y, "format") <- "latex"
  y
}

my_kable <- function(x, col.names, caption, ...)
  kable_caption_bottom(
    knitr::kable(x, 
                 col.names = sprintf("%s", col.names),
                 escape = FALSE,
                 format = "latex",
                 longtable = TRUE,
                 caption = caption, 
                 ...))

Introduction

ICD diagnostic codes are used to define tens of thousands of diseases. Originally from the World Health Organization (WHO), their purpose was to allow epidemiologists to study global health [@WHO_Internationalclassificationdiseases_2018]. In the USA, and elsewhere, the WHO ICD codes have been extended for administrative purposes by the government Centers for Medicare and Medicaid Services (CMS), notably for billing [@CMS_CentersMedicareMedicaid_2018]. All over the world, ICD codes are the primary method by which diseases are recorded, often using national variants. This makes ICD codes important in many kinds of medical research, e.g., for defining national health metrics [@Lee_Predictingmortalitypatients_2003], and in many of clinical studies, especially those in which groups of patients are compared [e.g., @frank_risk-adjusted_2014]. \pkg{icd} was made to meet the need for ICD code interpretation in \proglang{R}, and, in particular, it is designed to compute comorbidities quickly, enabling a reproducible workflow when used with big or small data.

What is a comorbidity?

Given there are tens of thousands of diagnostic codes, it is extremely difficult to use them directly in most statistical models: there are a few relatively common ICD codes, but there is a long tail to the frequency distribution (see Figure \ref{fig:longtail} for an example) since even common diagnoses can be finely sub-divided. The almost universal solution is to group these codes in a standardized way [e.g., @quan_coding_2005; @elixhauser_comorbidity_1998], and to use the presence and absence of any disease code in the comorbidity groups as covariates in models. The term comorbidity describes a disease which is present alongside a primary medical problem.

For example, a diabetic patient presents to hospital with a stroke: diabetes is the comorbidity and stroke is the presenting complaint. During or after an hospital admission, medical coders review the records and assign specific ICD (and other) codes. In this example, the patient might get the ICD-10 code E11.2 meaning 'Type 2 diabetes mellitus with renal complications,' and I63.0 meaning 'Cerebral infarction due to thrombosis of precerebral arteries.' In this case, the comorbidities might be regarded as: 'Diabetes' and 'Renal'. The term comorbidity generally refers to classes of diseases, not specific diagnoses, and that is how the term is used in the rest of this paper.

h <- inverse.rle(structure(list(
  lengths = c(
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,
    2, 1, 1, 1, 1, 1, 3, 1, 2, 1, 1, 1, 4, 2, 1, 1, 2, 2, 4, 4, 2, 1,
    1, 2, 1, 1, 1, 3, 1, 2, 1, 4, 1, 2, 4, 4, 3, 2, 3, 7, 5, 2, 10, 4,
    4, 4, 5, 6, 6, 1, 8, 13, 9, 11, 10, 13, 13, 14, 10, 16, 23, 18, 
    24, 27, 32, 39, 45, 33, 51, 87, 86, 76, 126, 166, 240, 365, 632,
    1401),
  values = c(
    525, 405, 296, 283, 281, 268, 227, 209, 193, 180, 168, 166, 156, 
    146, 144, 135, 134, 131, 120, 116, 115, 113, 110, 106, 103, 90, 88,
    86, 85, 82, 81, 80, 78, 77, 76, 75, 73, 72, 71, 70, 69, 68, 67, 64,
    63, 62, 61, 56, 55, 54, 53, 51, 50, 49, 48, 46, 45, 44, 43, 41, 40, 
    39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23,
    22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5,
    4, 3, 2, 1)), class = "rle"))
  plot(h, type = "h", 
       xlab = "ICD code, by descending frequency", 
       ylab = "count", 
       xlim = c(0, 4000),
       log = "y")
termdf <- as.data.frame(
  matrix(byrow = TRUE, ncol = 2,
         data = c(
           "comorbidity", "broad category of disease, e.g., Cancer",
           "comorbidity map", 
           "set of comorbidities, each defined by diagnostic codes",
           "patient-visit",
           paste("record identifier, unique for each", 
                 "encounter with a patient, but could",
                 "represent a patient at one moment, or a",
                 "summation of all conditions a patient has",
                 "ever had"))))
my_kable(termdf, col.names = c("Term", "Description"),
         align = c("l", "p{10 cm}"),
         caption = "Terminology")

\newpage

Uses of comorbidities

Comparing groups

Almost every report of a medical investigation includes a table showing the characteristics of the groups comprising a study population. (For example, see table 2 in @frank_risk-adjusted_2014.) One purpose of this table is to show any differences between the groups under investigation. Things like age and weight can be compared numerically, whereas the tens of thousands of distinct diagnostic codes may only be compared by grouping into comorbidities. It is of central importance to retrospective studies that confounding is minimized by understanding differences in comorbidities between groups, then by risk adjustment. In randomized studies, this is achieved by recruiting enough patients and randomly assigning them to treatment: comorbidities of interest are still needed to demonstrate that recruitment was free from bias and that the sample size large enough. It is helpful if different studies use the same criteria, and ideally the same methods, for establishing these comorbidities. One main purpose of \pkg{icd} is to enable standardized calculations in \proglang{R} using extremely well validated definitions of comorbidities, e.g., @quan_updating_2011 and @AgencyforHealthcareResearchandQuality_Elixhausercomorbiditysoftware_2018 only offer \proglang{SAS} code for this task.

Risk adjustment

Identifying differences between groups in retrospective medical data is not hard: they almost always exist. The eternal challenge of retrospective research is to account for those differences to make causal inferences. A common, somewhat historic strategy is to use scoring systems to encapsulate the severity of the comorbidities in one number, for example the Charlson Score [@charlson_new_1987]. Another approach is to use propensity scores or propensity matching: the inputs to propensity models include comorbidity information, which is a simple score like the Charlson Score, or a binary representation of the presence or absence of comorbidities for each patient or patient-visit (See Terminology). More sophisticated matching also relies on comorbidities for better risk adjustment, \citep[e.g.,][]{diamond_genetic_2012}.

Mapping ICD codes onto comorbidities

The ubiquity of ICD codes means there is a long history of grouping them to form comorbidities, and these standardized groups have been used for decades as the basis for medical studies. As revisions of the ICD codes have appeared, and various diseases have waxed and waned in significance, people have worked to categorize them into these standardized comorbidity groups. The two main groups were developed by Charlson [@charlson_new_1987], and Elixhauser [@elixhauser_comorbidity_1998]. Such mappings between comorbidity categories and sets of diagnostic codes will henceforth be called, comorbidity maps. The benefits of using existing comorbidity maps are:

Charlson did not initially use ICD codes to define the comorbidities, but various authors \citep[e.g.,][]{quan_coding_2005} have since classified ICD codes into those 17 comorbidity categories. Elixhauser [-@elixhauser_comorbidity_1998] developed ICD-based comorbidities for 31 diseases, and the US-based Agency for Healthcare Research and Quality (AHRQ) used this as a foundation for its own comorbidity groups [@AgencyforHealthcareResearchandQuality_Elixhausercomorbiditysoftware_2018]. Elixhauser and Charlson comorbidities have undergone refinement over the years, especially by \citet{quan_updating_2011} whose meticulous work on ICD-9 and ICD-10 codes has been included in \pkg{icd}, alongside the AHRQ mappings.

Medical codes

Medical coding has a complicated history beginning with epidemiology, and snowballing to include medical billing and research. There are several major coding schemes, including the WHO ICD family, various national modifications and extensions of the WHO ICD codes, the extensive USA ICD-CM (Clinical Modification), and several unrelated schemes with similar goals. There are codes for diagnoses, procedures, medical equipment, and causes and circumstances of disease or injury. Codes sometimes have detail beyond what seems useful for routine clinical care, for example:

library("icd")
explain_code("V97.33XD")

Because the WHO ICD is a subset of ICD Clinical Modification, \pkg{icd} is driven by ICD-9-CM and ICD-10-CM, both of which are available in the public domain, allowing analysis of WHO codes and more detailed USA or other national sets of codes. Unfortunately, the WHO exercises copyright over the international ICD-10 scheme, so it cannot be included. This does not affect comorbidity computations.

ICD-9 codes are primarily numeric, have fewer codes defined, and have a more variable format, sometimes with a prefix of V or E. ICD-10 codes have top-level code in the form of a letter, then two numbers, with a longer section of mixed numbers and letters after the decimal divider. Both ICD-9 and ICD-10 codes can be presented with or without a decimal divider, thus there is potential ambiguity between ICD-9 and ICD-10 (e.g., V10 is in both schemes), and between ICD-9 codes without separators (e.g., 0100 and 100).

explain_code(as.icd9("V10"))
explain_code(as.icd10("V10"))
explain_code("0100")
explain_code("100")

Breakdown of an ICD-10 code

This is a breakdown of an ICD-10-CM code, chosen to illustrate the hierarchical nature of the codes, and difference between WHO ICD-10 and ICD-10-CM. The following two codes are shared by the WHO definitions and ICD-10-CM.[^libraryicd]

explain_code(c("S62", "S62.6"))

These three are refinements offered only by ICD-10-CM:

explain_code(c("S62.60", "S62.607", "S62.607S"))

These codes are all contained in the sub-chapter, Injuries To The Wrist, Hand And Fingers, (S60 to S69) and the chapter Injury, poisoning and certain other consequences of external causes. (S00 to T88).

[^libraryicd]: Here the \pkg{icd} function explain_code is used. More detail on this can be found in Section \ref{other-functionality}.

Quirks of ICD codes

There are multiple possible notations of the same ICD codes, and ICD-9-CM codes are particularly variable:

In real data, ICD codes may not be constrained to a definitive list, so may contain frankly invalid codes, or codes which are valid in one edition of ICD, but not another, yet do fall clearly into one comorbidity.

Motivation and goals

The original motivation for \pkg{icd} was lack of any \proglang{R} software in the Comprehensive \proglang{R} Archive Network (CRAN) repository, or elsewhere, to compute comorbidities from ICD codes, nor any data that was easily parsable by \proglang{R} for interpreting ICD codes or representing the Charlson or Elixhauser schemes or the official lists of codes themselves.

The requirements at the outset were:

Handling big data well aligns with another goal of enabling analysis of moderately big health care datasets with modest computing power. WHO ICD codes are used internationally, and it is important to enable their use without expensive computing resources, or reliance on the Internet for cloud computing. An open source licence \citep[GNU General Public License v3.0][]{gplv3} was chosen for this reason. This goal also helps an analyst or researcher to use a laptop to deal with all but the very biggest datasets, avoiding the complexity of using cloud computing and the risks of moving protected health information across the internet.

During development it became clear that the following were also important:

Main computational problems

@. String matching

The central computational problem is one of using string matching to map ICD codes from patient data to the comorbidities, and recording the results. String matching is a slow operation, and memory intensive; the innovation described here is the use of matrix algebra to solve the problem. This requires a brief pre-processing step with string matching, whereas other solutions use string matching throughout.

@. Inexact comorbidity definitions

The comorbidity definitions in published literature do not precisely specify each individual code. This is partly a function of the various annual and international revisions of ICD codes, and also the need for brevity in publications, so ranges of codes are specified. For example, in the Valvular Heart Disease comorbidity in the ICD-9 Elixhauser scheme, we see it contains the range 394.0 -- 397.1 . In the ICD-10 version of the Elixhauser map, there are many top-level codes, such as I34, I35 and I36, which themselves are never or rarely used for diagnostic coding. Therefore, there are few or no exact matches between candidate ICD codes and the codes or ranges in the comorbidities, so some kind of string matching must be done, or a determination of whether a code falls in a specified range. This is dealt with in detail in Section \ref{derivation-of-the-comorbidity-maps}.

@. Big data

Bulk health care data, even for one hospital, often has hundreds of thousands or millions of encounters of different types. National databases and multi-hospital patient registries often have many more. Initial work using pure \proglang{R} code, with some optimization, resulted in computations taking minutes for about ten thousand patients. Although the problem is parallelizable, performance analysis on early versions using string matching showed that there was a lot of pressure on the CPU caches, so massive parallelization -- when even an option -- would have had diminishing returns with scaling. This is demonstrated by the benchmarks in the Results.

The problem of determining which patients have which disease classes can be expressed in pseudocode as nested loops:

\begin{Code} for each patient-visit, get the ICD codes for each ICD code for each comorbidity search the lists of codes for that comorbidity if a matching diagnostic code is found, then record that comorbidity for the current patient-visit \end{Code}\label{pseudocode}

This is an $\mathcal{O}(n)$ computation because the search element is limited to the fixed comorbidity map, thus making it asymptotically unimportant.

The benchmarks in the Results section show \pkg{icd} is much more efficient than string matching approaches.

Big data solutions cannot often be solved simply by increasing hardware capacity. The availability of many computing cores in the cloud does not help an algorithm which is limited to a single thread, or an algorithm which is ignorant of how CPU memory caching works. \pkg{icd} was designed to work efficiently on big health care data using parallel processing, and by minimizing the memory requirements of the problem.

Main features

User-facing

[^crosswalk]: \pkg{icd} does not yet convert (also known as 'cross-walk') between ICD-9-CM and ICD-10-CM.

Internal

Included comorbidity maps

This package contains mappings to convert ICD codes to comorbidities using methods from several sources, based on the AHRQ, Charlson or Elixhauser systems. Updated versions of these lists from @AgencyforHealthcareResearchandQuality_Elixhausercomorbiditysoftware_2018 and @quan_updating_2011 are included, along with the original @elixhauser_comorbidity_1998 mapping. Since some data is provided in \proglang{SAS} source code format, this package has internal functions to parse this \proglang{SAS} source code and generate R data structures. Some lists are transcribed directly from the published articles, but interpretation of \proglang{SAS} code used for the original publications is preferable.

For example, here are the names of the comorbidities in the Charlson map:

names(icd10_map_charlson)

and the ICD-10 codes from the first two comorbidities:

icd10_map_charlson[1:2]

Note that these codes are of icd10 and character class, and that they carry the attribute icd_short_diag which is set to TRUE, since there is no decimal divider.

Methods

The main internal feature of this software is the algorithm for assigning the diagnostic codes of a patient into comorbidities. The core of this implementation is a matrix multiplication, but this is only possible with some pre-processing.

Algorithm for comorbidity computation

  1. \label{itm:dataprep} ICD-9 and ICD-10 data preparation differs
    • ICD-9: all possible ICD-9 codes are pre-calculated in ICD-9 maps.
    • \label{itm:partialstring} ICD-10: Partial matching to create map with relevant codes.[^notregex]: $\mathcal{O}(n)$
  2. Reduce the problem to a matrix multiplication a. \label{itm:intersect} Compute the intersection of ICD codes from the patient data and the comorbidity map of interest: $\mathcal{O}(\log{}n)$ for time with a tree-based data structure. For ICD-10, this is done at the same time as Item \ref{itm:partialstring}, for no additional time complexity. a. \label{itm:encode} Encode the ICD codes as consecutive integers: $\mathcal{O}(n)$ but could be optimized to $\mathcal{O}(1)$ by using the knowledge that there are now no duplicates. a. \label{itm:gensparse} Generate a sparse patient-visit matrix, with one row per patient, and a column for each ICD code defined by the consecutive integers from the previous step: $\mathcal{O}(n)$ a. \label{itm:gendense} Generate a dense comorbidity matrix, with one column for each comorbidity, and one row for each ICD code: $\mathcal{O}(1)$ in relation to $n$ patient-visits.
  3. Perform matrix multiplication: $\mathcal{O}(R_m\cdot n)$ time complexity where $R_m$ is the ratio of non-zero to zero elements per row. For typical patient data, the mean is around five to ten codes per patient-visit, with many administrative data being limited to thirty. This low number compares favorably to the matrix width of tens of thousands.[^crs]

[^notregex]: This partial matching is not based on regular expressions, but simply identifying whether the more significant characters of a longer code match a known parent. e.g., using the previous example, if S62.6 appeared in a comorbidity map, then S62.607S would be matched.

[^crs]: Sparse matrix multiplication with dense matrices is already $\mathcal{O}(n)$ using the widespread Compressed Row -- or Column -- Storage (CRS or CCS) scheme. Even if dense matrix multiplication were used, it would still be $\mathcal{O}(n)$, despite being $\mathcal{O}(n^3)$ in the general case: in this problem, only one dimension of one matrix grows with increasing patient-visits. These are asymptotic limits for time, not for memory or bandwidth.

Data preparation

ICD-9

Step \ref{itm:dataprep} of the algorithm is only applicable to ICD-9 codes. The comorbidity maps for ICD-9 codes already contain all the syntactically valid permutations of child codes according to the ICD-9 specification. This means exact matching can be done in subsequent steps. The pre-calculation of the ICD-9 maps is done at package creation time, using internal functions which are included for reproducibility.

ICD-10

ICD-10 codes are more complex and have many more possible permutations: the same ICD-9 technique could be used, but the ICD-10 maps would be huge, and vulnerable to unexpected modifications; for example, a national ICD-10 variant using a different letter suffix which would not be captured. The solution is to have ICD-10 comorbidity maps which only include the level of details specified by the original author, often just the top-level three-digit codes, and use partial string matching,[^notregex] combining steps \ref{itm:partialstring} and \ref{itm:intersect} during one scan of the data.

Problem reduction

There are some major simplifications:

  1. Only a fraction of possible ICD codes typically appear in health data, in step \ref{itm:intersect}
  2. ICD codes are codified as integers in step \ref{itm:encode}
  3. Sparse matrix representation can be used to dramatically reduce the memory requirement of the patient-visit matrix, in step \ref{itm:gensparse}

Reduce number of codes

Reducing the total number of codes in both patient-visit and comorbidity data is performed in step \ref{itm:intersect} of the algorithm. For matrix multiplication, the columns $n$ of the patient-visit matrix must match the rows $q$ of the comorbidity map both in number. This means a common vocabulary for the codes in the patient-visits and the comorbidity maps must be established. This is accomplished in step \ref{itm:encode} by using a factor,[^factor] where the levels represent the intersection of codes from the patient-visit data and the whole comorbidity map. Thus the integer factor indices become row or column indices in matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ respectively. i.e., we only need to represent codes which are in both the patient-visit dad and the comorbidity map, making matrix $\boldsymbol{A}$ narrower and matrix $\boldsymbol{B}$ shorter. String searching strategies miss this optimization.

Referring back to the pseudocode in Section \ref{pseudocode}, integer representation would allow much faster binary or linear search, but this is obviated by a matrix multiplication. Although the matrix multiplication probably performs more computations than needed, it has simpler flow control and is a problem for which highly-optimized solutions exist.

[^factor]: \proglang{R} uses a data structure called a factor in which each element is stored as an integer index into an array of strings. The strings are unique, whereas the integers may be repeated.

\begin{figure}[ht] \centering \setlength{\fboxsep}{1em} \setlength{\fboxrule}{1pt} \fbox{% \begin{minipage}{12 cm} \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item Prepare \begin{itemize} \itemsep0em \def\labelenumii{\alph{enumii}.} \item ICD-9: All permutations already in maps \item ICD-10: String match to reduce map to relevant codes \end{itemize} \item Reduce the problem \begin{enumerate} \itemsep0em \def\labelenumii{\alph{enumii}.} \item Intersection of patient and map codes \item Encode as integer indices \item Generate sparse patient-visit matrix \item Generate dense comorbidity matrix \end{enumerate} \item Perform matrix multiplication \end{enumerate} \end{minipage} }% \caption{Recap of algorithm} \label{algorecap} \end{figure}

Sparse matrix representation

In a large data set, less common codes are more likely to appear, resulting in more nearly empty columns in the patient-visit matrix. Since this representation is used primarily to facilitate calculation of huge datasets, we ignore the fact that sparse format may be less efficient than dense for small datasets. Row-major representation also makes sense, with each row corresponding to a patient-visit. This is done in step \ref{itm:gensparse} of the algorithm.

The comorbidity matrix is more sparse than the patient-visit matrix, but is implemented as a dense data structure in step \ref{itm:gendense} of the algorithm for two reasons: firstly, linear algebra software is often optimized for row-major sparse multiplications with dense matrices; secondly, this matrix has a maximum size limited by the comorbidity definitions, so does not need to scale with very large numbers of patients. For example, ignoring the problem reduction step described above, the maximum size of the ICD-9 AHRQ comorbidity matrix is: $14678 \times 30 \times 4 = 1.8 \times 10^6$, i.e., less than two megabytes,[^wordagain] which compares favorably to the eight megabyte CPU cache in the modest workstation used in the benchmarks presented in Section \ref{results}.

[^wordagain]: Again the calculation is done using a 32-bit word for each flag.

There is a detailed example below in Section \ref{workedex}, but a better idea of the bulk structure of this matrix can be seen below for a fictitious comorbidity map with five categories, and also in Figure \ref{fig:comorbidmatrix}. Note that the code represented by the last row appears in the first and third comorbidities, whereas the others are all unique to one comorbidity. $$ B_{p,q} = \begin{matrix} 1 & 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 \ 1 & 0 & 1 & 0 & 0 \ \vdots & \vdots & \vdots & \vdots & \vdots \end{matrix}$$\label{eq:bulk}

cmbimage <- function(map, xlab, ...) {
  code <- unique(unlist(map))
  x <- comorbid(map = map, x = data.frame(ptid = seq_along(code), code))
  mode(x) <- "integer"
  image(z = t(x), ylim = c(1, 0),
        col = c("#FFFFFF", "#000000"),
        xaxt = 'n', yaxt = 'n',
        useRaster = TRUE, ...)
  mtext(text = xlab, side = 1, line = 1.5, outer = FALSE)
}
{
  par(mfcol = c(1, 3), cex = 1,
      mar = c(2, 2, 2, 1) + 0.1, oma = rep(0.5, 4))
  cmbimage(icd9_map_ahrq, xlab = "ICD-9 AHRQ\ncomorbidities")
  mtext(text = "ICD codes", side = 2, line = 0.5, outer = FALSE)
  cmbimage(icd10_map_ahrq, xlab = "ICD-10 AHRQ\ncomorbidities")
  cmbimage(icd9_map_multi_ccs[["lvl1"]], xlab = "ICD-9 CCS\ncategories")
}

Matrix multiplication

This section first describes steps \ref{itm:gensparse} and \ref{itm:gendense} of the algorithm in Section \ref{algorithm-for-comorbidity-computation}. Let $\boldsymbol{A}$ represent the matrix of comorbidities associated with each patient-visit, where each row, $m$, is a patient-visit, and each column, $n$, represents a different code. Each cell of the matrix is therefore either unity or zero. Unity indicating that the patient-visit on row $m$ is associated with the code on column $n$; or zero, if not.

$$\boldsymbol{A}{m,n} = \begin{pmatrix} a{1,1} & a_{1,2} & \cdots & a_{1,n} \ a_{2,1} & a_{2,2} & \cdots & a_{2,n} \ \vdots & \vdots & \ddots & \vdots \ a_{m,1} & a_{m,2} & \cdots & a_{m,n} \end{pmatrix}$$\label{eq:A}

Let matrix B be the comorbidity map, where each row, $p$, represents a different code, and each column, $q$, represents a comorbidity.

$$ \boldsymbol{B}{p,q} = \begin{pmatrix} b{1,1} & b_{1,2} & \cdots & b_{1,q} \ b_{2,1} & b_{2,2} & \cdots & b_{2,q} \ \vdots & \vdots & \ddots & \vdots \ b_{p,1} & b_{p,2} & \cdots & b_{p,q} \end{pmatrix}$$\label{eq:B}

Given there are tens or hundreds of thousands of possible ICD-9 or ICD-10 codes, the possible width of $\boldsymbol{A}$ is large ($n$ columns). There are also many ICD codes for each comorbidity, so the height of $\boldsymbol{B}$ is large ($p$ rows), although typical comorbidity maps only cover a subset of possible codes. Many data sets have tens of millions of patient-visits, each with typically up to 30 diagnostic codes, the mean being around five to ten, depending on the dataset, so the memory requirement using lower estimates, using four bytes per flag,[^word] is $10^7 \times 10^4 \times 4 = 4 \times 10^{11}$, which is 400 gigabytes simply to represent the patient-visit to disease relationships.

[^word]: A \proglang{C++} \code{int} is used for each comorbidity flag, which is typically, in 2018, a four-byte word. However, the matrices used by \pkg{icd} hold only true or false values, represented as 1 or 0. It appears inefficient to use an entire \code{int}. An alternative is bit-packing of 32 bool bits into each \code{int}, which has a stormy history in the \proglang{C++} Standard Template Library [@Jarvi_AlgorithmSpecializationGeneric_2006], and, more importantly, is not supported by any major linear algebra library.

Since the integer levels of the factor of ICD codes is common between the patient-visit and comorbidity matrices, $n = p$ and the comorbidities for each patient-visit is their matrix product.

$$\boldsymbol{C}{m,q} = \boldsymbol{A}\boldsymbol{B} = \begin{pmatrix} c{1,1} & c_{1,2} & \cdots & c_{1,q} \ c_{2,1} & c_{2,2} & \cdots & c_{2,q} \ \vdots & \vdots & \ddots & \vdots \ c_{m,1} & c_{m,2} & \cdots & c_{m,q} \end{pmatrix}$$\label{eq:C}

Worked example using ICD-10 codes {#workedex}

Take four patient-visits with the following ICD-10 codes in wide format, seen in Table \ref{tab:workedexinput}, and this simple comorbidity map:

list(
  "Rheumatic Heart Disease" = "I098",
  "Hypertension" = c("I10", "I11"),
  "Heart failure" = c("I50", "I110")
)
workedexinput <- matrix(
  ncol = 4,
  data = c(
    paste("Encounter", c("one", "two", "three", "four")),
    "K401", "I0981", "M352", "I110",
    "", "C450", "I10", "H40001",
    "", "", "", "I10"))
my_kable(
  workedexinput, 
  col.names = c("Patient-Visit", sprintf("Code %i", 1:3)),
  caption = paste("Four patient-visits with some ICD-10 codes in",
                  "`wide' format for worked example"))

There are several things to note, which represent common features in real health care data:

Let $\boldsymbol{A}$ be a simplified set of patient-visits, where the columns represent the ICD-10 codes I0981 (rheumatic heart failure), I10 (essential hypertension), and I110 (hypertensive heart disease with heart failure). Again, each row is a different patient-visit.

Let $\boldsymbol{B}$ be a simplified comorbidity map, where the columns represent congestive heart failure and hypertension, in that order. Note that 110 is found in both these comorbidities.

$$\boldsymbol{A} = \begin{pmatrix} 0 & 0 & 0 \ 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 1 & 1 \end{pmatrix} \qquad \boldsymbol{B} = \begin{pmatrix} 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 1 & 1 \end{pmatrix} \qquad \boldsymbol{C} = \boldsymbol{A}\boldsymbol{B} = \begin{pmatrix} 0 & 0 & 0 \ 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 2 & 1 \end{pmatrix}$$\label{eq:ABC}

Note that cell $\boldsymbol{C}_{4,2} = 2$, because the condition I110 is in two comorbidities, so the final result can be given as a logical matrix $\boldsymbol{C} \neq 0$ Thus, the final result is:

$$(\boldsymbol{C} \neq 0) = \begin{pmatrix} 0 & 0 & 0 \ 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 1 & 1 \end{pmatrix}$$\label{eq:Clogical}

Table \ref{tab:cmbresult} shows how this can be represented to the user.

workedexinput <- matrix(
  ncol = 4,
  data = c(
    paste("Encounter", c("one", "two", "three", "four")),
    "", "yes", "", "",
    "", "", "yes", "yes",
    "", "", "", "yes"
  ))
my_kable(
  workedexinput, 
  col.names = c("Patient-Visit", "Rheum", "HTN", "CHF"),
  caption = paste("Output of the worked example using ICD-10 codes.",
                  "`Rheum' is Rheumatic Disease, `HTN' is",
                  "hypertension, `CHF' is Congestive Heart Failure."))

Worked example with anonymous patient data

The US State of Vermont offers anonymized public hospital discharge data [@VermontDepartmentofHealth_VermontHospitalDischarge_2016]. A sample is included in \pkg{icd} and is used here to illustrate a real-world comorbidity calculation.[^disclaimer]

head(vermont_dx[1:10])

The data is in 'wide' format. ICD codes have the 'short' structure, without the decimal divider. Diagnoses are spread across the DX columns.

v <- vermont_dx[-c(2:5)]
v[1:5, 1:5]
v_cmb <- comorbid_charlson(v, return_df = TRUE)
print(head(v_cmb), row.names = FALSE)
comorbid_matrix <- comorbid_ahrq(v)
{
  image(comorbid_matrix, col = c("#FFFFFF", "#000000"),
        xaxt = 'n', yaxt = 'n', useRaster = FALSE)
  mtext(text = "Comorbidity", side = 2, line = 1)
  mtext(text = "Patient-visit", side = 1, line = 1)
}

[^disclaimer]: A condition of use of this data requires the following text be included: "Hospital discharge data for use in this study were supplied by the Vermont Association of Hospitals and Health Systems-Network Services Organization (VAHHS-NSO) and the Vermont Green Mountain Care Board (GMCB). All analyses, interpretations or conclusions based on these data are solely that of [the requestor]. VAHHS-NSO and GMCB disclaim responsibility for any such analyses, interpretations or conclusions. In addition, as the data have been edited and processed by VAHHS-NSO, GMCB assumes no responsibility for errors in the data due to coding or processing by hospitals, VAHHS-NSO or any other organization, including [the requestor]."

Results

There are now three active CRAN packages which calculate comorbidities: \pkg{icd}, \pkg{medicalrisk} [@medicalriskpkg], and \pkg{comorbidity} [@comorbiditypkg]. The latter two work using the strategy described in the pseudocode in Section \ref{pseudocode}.

The following is a limited performance comparison using synthetic data. This synthetic data shares some characteristics of real data: firstly, the more patients there are, the more coverage there is of the entire code space of the ICD scheme; secondly, there are both valid and invalid ICD codes; thirdly, each patient is assigned twenty codes. From the author's experience, the mean number of codes per patient is around five to ten, so twenty per patient represents twenty a mild stress-test of the algorithms.

Figure \ref{fig:versus} shows time to compute comorbidites for increasing numbers of rows of data, showing \pkg{icd} is dramatically faster than the alternatives. Lines are fitted from where the relationship becomes linear at 10,000 rows of data, and these are used to extrapolate to estimates of the much larger computations seen in Figure \ref{fig:bigdata}. Figure \ref{fig:speedup} shows the relative speed up \pkg{icd} offers in comparison to the alternatives.

fac <- 1e-3
res <- structure(list(
  datarows = c(10, 100, 1000, 10000, 1e+05, 1e+06, 1e+07), 
  icd = c(
    0.0010278929839842,
    0.00120355357648805,
    0.00180120998993516,
    0.008042294124607, 
    0.062866612104699, 
    1.573,
    15.609
    ),
  comorbidity = c(
    0.0132215245394036, 
    0.0231163185089827,
    0.111484928464051, 
    1.00265117513482,
    5.36533053999301, 
    87.162, 
    10986.539
    ),
  medicalrisk = c(
    0.00351565598975867, 
    0.0102955044712871, 
    0.0715644441079348, 
    0.756538210087456,
    7.0441670359578, 
    72.286, 
    854.759
    )), 
  row.names = c(NA, -7L), 
  class = "data.frame")
fit_res <- log10(res[4:7, ])
icd_model <- lm(icd ~ datarows, data = fit_res)
cmb_model <- lm(comorbidity ~ datarows, data = fit_res)
mdr_model <- lm(medicalrisk ~ datarows, data = fit_res)
pred_hours <- function(model)
  10 ^ predict(model, data.frame(datarows = log10(20 * 1e8))) / 3600
preds <- vapply(list(icd_model, cmb_model, mdr_model),
                FUN = pred_hours, FUN.VALUE = numeric(1))
names(preds) <- c("icd", "comorbidity", "medicalrisk")
xseq = seq(0, 7)
yseq = seq(-3, 3, 3)
ysrr = 0:4
logxaxis <- sapply(paste0("expression(10^", xseq, ")"),
                   function(x) eval(parse(text = x)))
logyaxis <- sapply(paste0("expression(10^", yseq, ")"),
                   function(x) eval(parse(text = x)))
logyaxrr <- sapply(paste0("expression(", 10^ysrr, ")"),
                   function(x) eval(parse(text = x)))
colours <- c(comorbidity = 'darkred',
             icd = 'black',
             medicalrisk = 'darkblue')
rr <- data.frame(datarows = res$datarows,
                       icd = 1,
                       comorbidity = res$comorbidity / res$icd,
                       medicalrisk = res$medicalrisk / res$icd)

```r was run with and without parallel option, and the best strategy was chosen for each number of iterations."} plt_inset <- 0.05 xmin = 10 { plot(NA, NA, log = "xy", type = "l", col = 'darkred', xlab = "rows of data", ylab = "seconds", xlim = c(xmin, max(res$datarows)), ylim = c(fac, max(c(res$medicalrisk, res$comorbidity))), xaxt = "n", yaxt = "n" )
axis(1, 10^xseq, logxaxis) axis(2, 10^yseq, logyaxis) lines(res$datarows, res$comorbidity, col = colours["comorbidity"]) lines(res$datarows, res$icd, col = colours["icd"]) lines(res$datarows, res$medicalrisk, col = colours["medicalrisk"]) abline(icd_model, col = colours["icd"], lty = 3) abline(cmb_model, col = colours["comorbidity"], lty = 3) abline(mdr_model, col = colours["medicalrisk"], lty = 3) legend("topleft", inset = plt_inset, legend = names(res[-1]), fill = colours[names(res[-1])]) }

```r."}
{
  ymaxrr <- 10^ceiling(
    log10(
      max(c(rr$medicalrisk, rr$comorbidity))
    )
  )
  plot(NA, NA, log = "xy",
       type = "l", col = 'darkred',
       xlab = "rows of data",
       ylab = "time / time using 'icd'",
       xlim = c(xmin, max(rr$datarows)),
       ylim = c(1, ymaxrr),
       xaxt = "n", yaxt = "n"
  )
  axis(1, 10^xseq, logxaxis)
  axis(2, 10^ysrr, logyaxrr)
  lines(rr$datarows, rr$icd, col = colours["icd"])
  lines(rr$datarows, rr$comorbidity, col = colours["comorbidity"])
  lines(rr$datarows, rr$medicalrisk, col = colours["medicalrisk"])
  legend("topleft", inset = plt_inset, 
         legend = names(rr[-1]), 
         fill = colours[names(rr[-1])])
}
{
  barplot(preds,
          col = colours[c("icd", "comorbidity", "medicalrisk")],
          ylab = "hours", log = "y", ylim = c(1, 100000))
  legend("topleft", inset = plt_inset, legend = names(preds), fill = colours[names(preds)])
}

Implementation Details

Derivation of the comorbidity maps

It is important to have a reproducible audit trail for foundational work, so \pkg{icd} contains code which parses \proglang{SAS} source code in order to derive the original intent of the author in how the comorbidity maps were implemented. The AHRQ and \citeauthor{quan_updating_2011} provide such \proglang{SAS} source code, whereas other maps are only available in the form of tabulated data in journal articles.

Parsing original SAS code

For example, \citeauthor{quan_updating_2011} offer the following code for pulmonary disease in the ICD-9 Charlson map. The lines split for clarity.

%LET DC6=%STR('4168','4169',
'490','491','492','493','494','495','496',
'500','501','502','503','504','505',
'5064','5081','5088');

Note that 497 -- 499 are undefined in ICD-9. What should be done if 497 appears in a data set? Here an argument is made that this is completely invalid. Firstly, see the sub-chapter definitions:

sc <- c("Chronic Obstructive Pulmonary Disease And Allied Conditions",
        "Pneumoconioses And Other Lung Diseases Due To External Agents")
icd9_sub_chapters[sc]

497 is not a valid code itself. It could be a typo for any of the other combinations, of which four are completely different, yet valid codes:

explain_code(c("497", "479", "947", "974", "749", "794"), warn = FALSE)

There are also many other ways the coder may have made the error other than a permutation of the intended code. Given the wide range of disease processes, and no guarantee at all that the coder made a mistake only in the last digit, \pkg{icd} discards the code to avoid giving a false positive comorbidity flag.

"497" %in% icd9_map_charlson

The comorbidity maps are therefore constructed by generously including all children (valid or invalid) of the explicitly defined three-digit codes, but they do not extrapolate to other three-digit codes. Suppose additional digits were defined by a country's extension of ICD-9, but do not appear in the WHO or ICD-9-CM definitions. \pkg{icd} already includes all possible structurally valid ICD-9 child codes in the map. This can be seen here, where two fictitious decimal places are seen in the map, but not three:

"49699" %in% icd9_map_quan_deyo[["Pulmonary"]]
"496999" %in% icd9_map_charlson

More generosity is offered when calculating the comorbidities, so even if invalid codes appear, the intent of this code to fall within the pulmonary comorbidity hierarchy is clear enough:

alice <- data.frame(id = "alice", icd9 = "49699")
comorbid_charlson(alice, return_df = TRUE)[["Pulmonary"]]

Parsing range definitions

For some mappings, no source code was available, but the comorbidities are described in journals using ranges. e.g., in the ICD-9 Elixhauser mapping, we find 243 -- 244.2 in the thyroid disease definition. This is a subset of the entire range of thyroid diseases in ICD-9: 244.3, 244.8 and 244.9 also exist. \pkg{icd} takes great care with false positives here by carefully excluding parent codes which might have been captured by a pattern matching approach. In this case, 244 should not be considered a match because it includes codes where which were clearly excluded. A number of ICD code range operators are defined to facilitate this:

head("243" %i9da% "244.2")
"244" %in% ("243" %i9da% "244.2")

Note that 244 is too broad to fit the original Elixhauser description, because it implies all its children, which would go beyond 244.2.

Other Functionality

Validation

\pkg{icd} allows checking whether codes are valid, and, in the USA, whether they are 'billable', i.e., leaf nodes, or merely intermediate members of the hierarchy. Research and clinical data may also contain relevant non-billable codes, and this is accounted for by the comorbidity calculations.

is_valid(c("441", "441.0", "441.01", "XXX"))
is_billable(c("441", "441.0", "441.01", "XXX"))
head(
  data.frame(code = children("441"),
             billable = is_billable(children("441"))))

Hierarchy and explanation

Functions are provided to navigate the ICD-9 and ICD-10 hierarchies. This example first takes the children of the ICD-9 code 441.0:

children("441")

Several five-digit codes begin with 4410. explain_code condenses all these children to their common parent, before interpreting the ICD code:

explain_code(children("4410"))

What does each one mean?

explain_code(children("4410"), condense = FALSE)

Software libraries

This package is built on the strong foundations of \proglang{R} [@R_2018], \pkg{Rcpp} [@Eddelbuettel_RcppSeamlessIntegration_2011], \pkg{RcppEigen} [@Eddelbeuttel_FastElegantNumerical_2013], and \pkg{Eigen} \citep*{Guennebaud_Eigen_2017}, the highly optimized \proglang{C++} linear algebra library. Eigen was chosen because of its performance oriented approach to matrix multiplication using advanced x86 instructions when possible, and, in the case of the row-major sparse multiplication with a dense matrix, a multi-threaded solution. Once this was done, the bottlenecks moved to the data preparation before the matrix multiplication, which itself was optimized by simple R techniques such as vectorization.

Extensions

\proglang{R}'s S3 class system is used for extensibility, so it is straightforward to include additional ICD schemes, e.g., for different national systems. Likewise, it is also easy to add new comorbidity maps, or use user-defined ones. Several authors have contributed code which extends \pkg{icd} to solve other problems:

Limitations and future work

In general, computation of comorbidities as used in medical research ignores the fact that both the comorbidity maps and the ICD schemes change from year-to-year. \pkg{icd} takes the approach to be inclusive where appropriate, so extra or missing codes result in the same or very similar comorbidity results. An extension could compute comorbidities using the correct annual revision according to the year of the patient encounter.

Many countries have their own variations on the WHO ICD codes, and these are not yet included in \pkg{icd}. These may affect comorbidity calculations. Also missing are WHO versions of ICD codes, which may be possible in the future, dependent upon licensing restrictions.

The synthetic data used for benchmarking contained real and invalid codes. More benchmarking could be done using real data with different proportions of invalid codes, and different distributions of codes.

Conclusions

\pkg{icd} gives \proglang{R} the ability to do a common medical research task by doing fast and accurate conversion of ICD-9 and ICD-10 into comorbidities. The key innovations are: the reduction of the problem to an equivalent smaller task; and the use of sparse matrix multiplication to compute comorbidities. This solution offers an efficient and elegant method that reduces time and memory complexity. The benchmarks show that this technique scales to the biggest health care data sets, and answers the goal of making this possible with modest computing power in a reproducible workflow.

\section*{Acknowledgments}

I am deeply grateful to Steve Frank and Mohamed Rehman for their support. Thanks to my talented colleagues at the Children's Hospital of Philadelphia, notably Aaron Masino, Allan Simpao, Jorge Galvez and Jonathan Tan. Thanks also to several people who have contributed code, including the work on Van Walraven scores (William Murphy), HCC (Anobel Odisho), CCS (Vitaly Druker) and an alternative ICD decoding format (Ed Lee). A full list of contributors may be seen on the \pkg{icd} CRAN page, and on the project page.

References



jackwasey/icd documentation built on Nov. 23, 2021, 9:56 a.m.