Quick Introduction to icd

    loadNamespace("knitr") # for opts_chunk only

  collapse = TRUE,
  comment = "#>"

patients_icd9 <- data.frame(
  visit_id = c(1000, 1000, 1000, 1000, 1001, 1001, 1002),
  icd9 = as.icd9(c("40201", "2258", "7208", "25001", "34400", "4011", "4011")),
  poa = c("Y", NA, "N", "Y", "X", "Y", "E"),
  stringsAsFactors = FALSE

Quick Start

head(uranium_pathology, 10)
comorbid_charlson(uranium_pathology, return_df = TRUE)[1:5, 1:5]



When calculating which patients have which comorbidities, the data are typically structured in long or wide formats:

# long format ICD-9-CM codes, with present-on-arrival flags

# long format ICD-10 codes, real mortality data
uranium_pathology[1:5, ]

# wide format, real ICD-9 discharge diagnoses
vermont_dx[1:5, c(1, 6:15)]

In real life, there are often problems with the data, such as NA values, out-of-order visit_ids, non-existent or invalid ICD codes, etc.. Although standard R tools or the tidyverse can be used to clean the data, knowing the specific validation rules for ICD-9 and ICD-10 codes, as well as the standardized structure of healthcare data enables faster and more accurate data cleaning.

# use AHRQ revision of Elixhauser comorbidities, show only first eight columns
comorbid_ahrq(patients_icd9)[, 1:8]

Things work beautifully using magrittr %>% to chain functions together. magrittr is useful for chains of commands, such as the following:

# find Elixhauser comorbidities which were present-on-arrival
patients_icd9 %>% filter_poa %>% comorbid_elix

# same as above, then summarize five columns:
patients_icd9 %>%
  filter_poa %>%
  comorbid_elix %>%
  extract(, 5:10) %>%

# convert vermont discharge data to wide format, 
# find comorbidities, convert TRUE to 1 and show first few
vermont_cmb <- vermont_dx %>% wide_to_long %>% 
  icd9_comorbid_quan_deyo %>%
  apply(2, as.integer) # convert logical to integer


        main = "Histogram of Elixhauser Comorbidities in Vermont data")

The above can be rewritten in classic R with many parentheses:

head(apply(icd9_comorbid_quan_deyo(wide_to_long(vermont_dx)), 2, as.integer))

Specifying data types

icd will guess the type and form of input data when possible, but there are sometimes ambiguities when ICD-9 and ICD-10 codes are mixed.

is_valid("100") # valid ICD-9 code
is_valid("A1001") # valid ICD-10 code
is_valid(c("100", "A1001")) # they can't both be valid

You can let icd guess types, or specify the type of your data explicitly:

# decimal format ICD-10 codes
codes <- c("A10.01", "L40.50", "Z77.098")
# set class to be icd10cm (and implicitly icd10)
# indicate decimal code and icd10 (not necessarily icd10cm)
codes %>% as.decimal_diag %>% as.icd10

Doing this avoids mistakes in guessing type. For example code V10 is valid in both ICD-9 and ICD-10.

Converting ICD codes between types

ICD codes are usually presented in decimal format (beware, for this is not a number), e.g., the ICD-9 code 003.21, or ICD-10 code T81.10XD, whereas most electronic records seem to use the short form without a decimal place. These are not interchangeable simply by removing the decimal place, and great care is taken to do this correctly. Most ICD-9 codes do not have a letter prefix, so there is possible ambiguity here. icd was designed to deal with the common problem of incorrectly formatted ICD codes. The assumption is made that short codes of three or fewer characters are describing only the 'major' part: there is no other reasonable interpretation. For example, 020 must be taken to mean 20, not 2.0 or even 0.20. In most cases, when icd works on ICD-9 codes, it will convert any codes of fewer than three characters into zero-padded three-digit codes.

decimal_to_short(c("1", "10.20", "100", "123.45"))
short_to_decimal(c("1", "22", "2244", "1005"))

# similar operations with magrittr, also showing invalid codes
codes <- as.icd9(c("87.65", "9999", "Aesop", -100, "", NA))

# ICD-10

Validation of ICD-9 codes

# guess both ICD version (9, but could be 10?), and decimal vs short form

# state we are using short or decimal codes:
is_valid(c("099.17", "-1"), short_code = TRUE)
is_valid(c("099.17", "-1.1"), short_code = FALSE)
is_valid(c("1", "001", "100", "123456", "003.21"), short_code = TRUE)

Decoding ICD-9 codes to descriptions

There are various ways of extracting the description of the condition described by an ICD-9 code. the explain group of functions return a data frame with a column for the ICD-9 code, a column for the full length Diagnosis, and a column for the short Description.

explain("1.0") # 'decimal' format code inferred
explain("0019") # 'short' format code inferred
# we can be explicit about short vs decimal
explain("434.00", short_code = FALSE)
explain(c("43410", "43491"), short_code = TRUE)
#explain top level code with children
"391" %>% explain # single three-digit code
"391" %>% children # let's see the child codes
"391" %>% children %>% explain # children condensed to parent code
"391" %>% children %>% explain(condense = FALSE) # prevent condense

Arbitrary named list(s) of codes:

explain(list(somecodes = as.icd9(c("001", "391")),
                 morecodes = as.icd9cm(c("001.1", "001.9"))))

001 (Cholera) isn't itself a diagnostic code, i.e. leaf node in the hierarchy, but 390 (Rheumatic fever without heart involvement) is. Both are explained correctly:

explain(list(cholera = "001", rheumatic_heart = "390"))

Now try to explain on a non-existent (but 'valid') ICD-9 code:

s <- explain("001.5") # gives warning

As we have just seen, explain can convert lists of ICD-9 or ICD-10 codes to a human-readable format. Let's apply the explain to a list of comorbidity ICD-9 codes in one of the commonly-used mappings. This makes comprehending a complicated list much easier. Taking the list for dementia:

length(icd9_map_quan_deyo[["Dementia"]]) # 133 possible ICD-9 codes
length(icd10_map_quan_deyo[["Dementia"]]) # the ICD-10 map is different
# explain summarizes these to just two groups:
icd9_map_quan_deyo[["Dementia"]] %>% explain(warn = FALSE)
# contrast with:
icd9_map_quan_deyo[["Dementia"]] %>% explain(condense = TRUE, warn = FALSE)

Use a range with more than two hundred ICD-9 codes (most of them not real):

length("390" %i9da% "392.1")
"390" %i9da% "392.1" %>% explain(warn = FALSE)

The warnings here are irrelevant because we know that `%i9da% produces codes which do not correspond to diagnoses. However, in other usage, the user would typically expect the ICD-9 codes he or she is using to be diagnostic, hence the default to warn.

Filtering by Present-on-Arrival

This flag is recorded with each ICD-9 code, indicating whether that diagnosis was present on admission. With some caution, codes flagged specifically not POA can be treated as new diseases during an admission.

Present-on-arrival (POA) is typically a factor, or vector of values such as "Y", "N", "X", "E", or NA. Intermediate codes, such as "exempt", "unknown" and NA mean that "yes" is not the same as "not no." This requires four functions to cover the possibilities stored in poa_choices:


Filter for present-on-arrival being "Y"

patients_icd9 %>% filter_poa_yes

Show that yes is not equal to not no (e.g. due to NA in poa field)

patients_icd9 %>% filter_poa_not_no


The comorbidities from different sources are provided as lists. At present only the most recent mapping of ICD-9 codes to comorbidities is provided. See these github issues.

This package contains ICD-9-CM to comorbidity mappings from several sources, based on either the Charlson or Elixhauser lists of comorbidities. Updated versions of these lists from AHRQ and Quan et al are included, along with the original Elixhauser mapping . Since some data is provided in SAS source code format, this package has internal functions to parse this SAS source code and generate R data structures. This processing is limited to what is needed for this purpose, although may be generalizable and useful in other contexts. Other lists are transcribed directly from the published articles, but interpretation of SAS code used for the original publications is preferable.

AHRQ comorbidity classification

The AHRQ keeps an updated version of the Elixhauser classification of ICD-9-CM codes into comorbidities, useful for research. They provide the data in the form of SAS code. The names of the comorbidities derived from ICD-9 and ICD-10 codes are the same. Maps contain the ICD code to comorbidity mappings; the functions that apply those mappings are called things like icd10_comorbid_ahrq.

#icd9_map_ahrq <- icd:::sas_parse_ahrq() # user doesn't need to do this

Elixhauser comorbidities

Elixhauser originally devleoped this set of comorbidities to predict long term mortality based on hospital ICD-9-CM coding records. The AHRQ comorbidities are an updated version of this, however the original Elixhauser have been used in many publications. The ICD-9-CM codes have changed slightly over the years.

# the names of the comorbidities in each map are available as named lists:
# The map contents have ICD codes with the class set


Quan's paper looked at indices using both ICD-10 and ICD-9-CM. Quan generated updated ICD-9-CM codes for all 30 of Elixhauser and all 17 of Charlson/Deyo's comorbidities. Thus there are two 'Quan' comorbidity mappings.



Filter patients and create comorbidities

Take my patients, find the ones where there definitely or maybe was a diagnosis present on admission, then generate comorbidities based on the AHRQ mapping. N.b. NotNo is not the same as Yes because of some exempt, unclassifiable conditions, or NA values for the present-on-admission flag.

patients_icd9 %>%
  filter_poa_not_no %>%
  icd9_comorbid_ahrq %>%

Compare two comorbidity definitions

We will find the differences between some categories of the original Elixhauser and the updated version by Quan. Just taking the select few comorbidity groups for brevity:

difference <- diff_comorbid(icd9_map_elix, icd9_map_quan_elix,
                            all_names = c("CHF", "PHTN", "HTN", "Valvular"))
# reuslts also returned as data

Which pulmonary hypertension codes are only in Quan's version?

difference$PHTN$only.y %>% get_defined %>% explain

(Passing through get_defined stops explain complaining that some of the input codes don't exist. This is because the comorbidity mappings have every possible numerical ICD-9 code, not just the official ones. Could also use warn = FALSE option in explain)

Find cardiac ICD-9 codes

  grepl(pattern = "(heart)|(cardiac)",
        x = c(icd9cm_hierarchy$long_desc, icd9cm_hierarchy$short_desc),
        ignore.case = TRUE),
  "code"] %>% unique -> cardiac

then explain the list, just showing the first ten:

as.icd9(cardiac) %>% explain(warn = FALSE) %>% head(10)

Find comorbidities for a large number of patients

I understand that comorbiditity assignment using SAS is a lengthy business. Let's generate 100,000 patients with a random selection of comorbidities:

# codes selected from AHRQ mapping
many_patients <- icd:::generate_random_pts(1e7)

This takes about five seconds on a 2016 8-core workstation. More detailed benchmarks can be seen comparing icd to other packages in the vignette article in this package.

Use an arbitrary ICD-9 mapping

The user can provide any ICD-9, ICD-10 or other code mapping to comorbidities they wish. Submissions of other peer-reviewed published mappings could be included in this package, if their license permits. Create an issue in github or email me at [email protected]) Included in this package is a small data set called icd9_chapters, which lists the ICD-9-CM (and indeed ICD-9) Chapters. These can easily be expanded out and used as a mapping, so instead of a comorbidity, we see which patients have codes in each chapter of the ICD-9 defintion.

names(icd9_chapters)[c(1:5, 14)]
my_map <- icd:::icd9_chapters_to_map(icd9_chapters[c(2, 5, 14)])
icd9_comorbid(patients_icd9, my_map) # no positive 

Reduce comorbidity mapping from possible values to defined diagnostic codes

Suppose we want to exact match only real ICD-9 codes when looking up comorbdities for some patients. E.g. if the coder accidentally omitted a trailing zero, e.g. code 003.20 (Localized salmonella infection, unspecified) might have been written as 003.2 which has a heading (Localized salmonella infections) but is not itself billable. Use of ICD-9 codes for comorbidities generally assumes the codes are either right or wrong. How do we match only real codes, for a strict interpretation of comorbidities? It's one line or R code:

ahrq_strict <- lapply(icd9_map_ahrq, get_defined)
str(icd9_map_ahrq[1:5]) # first five of the original:
str(ahrq_strict[1:5]) # and first five of the result:

Note the much smaller numbers of codes in each group, now we have discarded all the ones which are not defined as diagnoses.

Which three-character ICD-9 codes have no child codes

The ICD-9-CM scheme is structured as follows: - Chapter - Sub-chapter - Major part (three-digit codes) - sub-division (first decimal place) - sub-sub-division (second decimal place)

For most combinations of zero to nine, nothing is defined. Sometimes, nodes at one level in the hierarchy are descriptive only of their children (branch nodes), whereas some are themselves billable. For this example, let's find those numeric-only codes which have no children, and by implication are themselves directly billable codes. Here are the first ten:

icd9cm_hierarchy$code %>% get_defined -> all_real
# select the non-V and non-E codes
three_digit_real <- all_real[icd9_is_n(all_real)]
# display
three_digit_df <- data.frame(code = three_digit_real, description = explain(three_digit_real, condense = FALSE))
print(three_digit_df[1:10, ], row.names = FALSE)

Which ICD-9 codes have changed between versions of ICD-9

new_since_27 <- setdiff(icd9cm_billable[["32"]][["code"]],
                         icd9cm_billable[["27"]][["code"]]) %>% head
lost_since_27 <- setdiff(icd9cm_billable[["27"]][["code"]],
                         icd9cm_billable[["32"]][["code"]]) %>% tail
# we know this is an ICD-9-CM code, so declare this using nice magrittr motif:
lost_since_27 %<>% as.icd9cm
lost_since_27 %<>% as.icd9cm

# these are a few which were gained since v27
data.frame(code = new_since_27, desc = new_since_27 %>% explain)
# these are a few which were lost since v27
data.frame(code = lost_since_27, desc = lost_since_27 %>% explain)


This package allows fluid, fast and accurate manipulation of ICD-9 and ICD-10 codes, especially when combined with magrittr. Suggestions, contributions and comments are welcome via github.

Try the icd package in your browser

Any scripts or data that you put into this service are public.

icd documentation built on May 9, 2018, 9:04 a.m.