Introducing icd9: working with ICD-9 codes and comorbidities in R

    #library(knitr, warn.conflicts = FALSE) # for opts_chunk only

  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"

patientData <- data.frame(
  visitId = c(1000, 1000, 1000, 1000, 1001, 1001, 1002),
  icd9 = c("40201", "2258", "7208", "25001", "34400", "4011", "4011"),
  poa = c("Y", NA, "N", "Y", "X", "Y", "E"),
  stringsAsFactors = FALSE


Calculate comorbidities, and perform fast and accurate validation, conversion, manipulation, filtering and comparison of ICD-9-CM (clinical modification) codes. ICD-9 codes appear numeric but leading and trailing zeroes, and both decimal and non-decimal "short" format codes exist. The package enables a work flow from raw lists of ICD-9 codes from hospital billing databases to comorbidities. ICD-9 to comorbidity mappings from Quan (Deyo and Elixhauser versions), Elixhauser and AHRQ included.

When calcuating which patients have which comorbidities, the input data is typically structured as follows:



head(vermont_dx[c(1, 6:15)])

Of course, in real life, there are many problems with the data, such is NA entries, out-of-order visitIds, non-existent or invalid ICD-9 codes, etc.. Tools are provided to clean up the mess, and discover the comorbidities before admission or after discharge for these patients (or simply for any list of patients and ICD-9 codes: there is no requirement to work in the hospital admission paradigm.) The optional poa field indicates whether the code was determined to be present on arrival. The implicit default is to ignore it, and give ICD-9 code regardless of POA status, but filtering functions are provided and demonstrated later in this vignette.

The comorbidities can be determined as follows (showing the first few columns for brevity):

icd9ComorbidAhrq(patientData)[, 1:8]


icd9ComorbidQuanDeyo(patientData)[, 1:8]

and things work beautifully using magrittr %>% to chain functions together. This is not a dependency for this package, but is recommended because of the frequent need to chain together icd9 commands, and greater clarity.

patientData %>% icd9FilterPoaYes %>% icd9ComorbidAhrq %>% extract(1:8)

Converting ICD-9 codes between types

ICD-9 codes are usually presented in decimal format (beware, for this is not a number), e.g. 003.21, whereas most electronic records seem to use the short form without a decimal place. These are not interchangeable simply by removing the decimal place, and great care is taken to do this correctly. The functions were also designed to deal with the common problem of incorrectly formatted ICD-9 codes. The assumption is made that short codes of three or fewer characters are describing only the 'major' part: there is no other reasonable interpretation. For example, 020 must be taken to mean 20, not 2.0 or even 0.20. In most cases, when icd9 works on ICD-9 codes, it will convert any codes of fewer than three characters into zero-padded three-digit codes.

icd9DecimalToShort(c("1", "10.20", "100", "123.45"))
icd9ShortToDecimal(c("1", "22", "2244", "1005"))

# similar operations with magrittr, also showing invalid codes
codes <- c("87.65", "9999", "Aesop", -100, "", NA)

Validation of ICD-9 codes

icd9IsValidShort(c("099.17", "-1"))
icd9IsValidDecimal(c("099.17", "-1.1"))
icd9IsValidShort(c("1", "001", "100", "123456", "003.21"))

Validation forces the package user to provide character format ICD-9 codes. If great care is taken, passing some integers could be valid, but given the high chance of mistakes, and the simplicity of dealing entirely with character input, character or factor is enforced. Furthermore, integer values could never include V or E codes (and all ICD-10 codes have non-numeric characters).

#icd9IsValidShort(100) # gives an error

Ranges of ICD-9 codes

These functions generate syntactically valid ICD-9 codes, without including parent codes when the range limit would subset the parent. E.g. "100.99" %i9da% "101.01" does not include 100 or 100.0, both of which imply larger subsets than requested by the range command (i.e. every code up to 100.99). The shorter forms %i9s% and %i9d% return only real codes (i.e. listed in the CMS definitions as either three-digit codes or diagnoses), whereas %i9sa% and %i9da% return all possible syntactically valid ICD-9 codes:

# get all possible codes
"003" %i9sa% "0033" %>% head(9) # show first 9 of 111 values
# just get the ones which correspond to diagnoses (keeping the 3-digit chapters)
"494" %i9s% "4941"

"10099" %i9sa% "10101"
"V10" %i9da% "V10.02"
"E987" %i9da% "E988.1"

# can't range between different types:
# "V10" %i9s% "E800" # throws an error

This is used internally to interpret ranges of ICD-9 codes specified in the literature. Sometimes it is not clear exactly what an ICD-9 range presented in a paper means, but at least we can explicitly decide what should be included in our interpretation, and the ranges can be reused even when the underlying codes may be different, as codes are added and removed from time-to-time, and although the original papers would have been based on their ICD-9 ranges resolving to a specific set of codes, they are likely to be valid for new diagnoses in the given subgroups. Ultimately, this needs detailed attention, but the strategy in \code{icd9} is to give a good best guess, given these limitations.

Another way of specifying ranges are to use function calls. These are exactly equivalent to the %i9s% and %i9d% range operators. This example shows the result when the user specifies a range which would include parents but not all their children:

icd9ExpandRangeShort("4820", "4823") # default, equivalent to %i9s%
icd9ExpandRangeShort("4820", "4823", onlyReal = FALSE)
# see the first few differences (which are by definition not 'real' codes):
setdiff(icd9ExpandRangeShort("4820", "4823", onlyReal = FALSE),
        icd9ExpandRangeShort("4820", "4823")) %>% head

It is easy to find the children of a higher-level ICD-9 code:

# mid-level code
# leaf node has no children
# pneumococcal pneumonia is a three-digit code with no descendants

By adding onlyReal = TRUE, all syntactically valid ICD-9 codes are returned, even if not defined by CMS as diagnoses. This is relevant because of minor coding errors, or coding in a different year to the master list. A planned feature is to allow testing of an ICD-9 code against the valid codes for the year it was entered, but at present only the 2014 master list is used. This means that some older valid codes may no longer be on the list. However, there have been very few changes to ICD-9-CM in the last five years with ICD-10-CM in the wings.

# first ten possible ICD-9 child codes from 391
icd9Children("391", onlyReal = FALSE)[1:10]

Decoding ICD-9 codes to descriptions

There are various ways of extracting the description of the condition described by an ICD-9 code. the icd9Explain group of functions return a data frame with a column for the ICD-9 code, a column for the full length Diagnosis, and a column for the short Description.

icd9Explain("1.0") # 'decimal' format code inferred
icd9Explain("0019") # 'short' format code inferred
# we can be explicit about short vs decimal
icd9Explain("434.00", isShort = FALSE) 
icd9Explain(c("43410","43491"), isShort = TRUE)
#explain top level code with children
"391" %>% icd9Explain # single three-digit code
"391" %>% icd9Children # let's see the child codes
"391" %>% icd9Children %>% icd9Explain # children condensed to parent code
"391" %>% icd9Children %>% icd9Explain(doCondense = FALSE) # prevent condense

Arbitrary named list(s) of codes:

icd9Explain(list(somecodes = c("001", "391.0"), 
                 morecodes = c("001.1", "001.9")))

001 (Cholera) isn't itself a diagnostic code, i.e. leaf node in the hierarchy, but 390 (Rheumatic fever without heart involvement) is. Both are explained correctly:

icd9Explain(list(cholera = "001", rheumatic_heart = "390"))

Now try to explain on a non-existent (but 'valid') ICD-9 code:

s <- icd9ExplainDecimal("001.5") # gives warning

As we have just seen, icd9Explain can convert lists of ICD-9 codes to a human-readable format. Let's apply the icd9Explain to a list of comorbidity ICD-9 codes in one of the commonly-used mappings. This makes comprehending a complicated list much easier. Taking the list for dementia:

length(quanDeyoComorbid[["Dementia"]]) # 133 possible ICD-9 codes
# icd9Explain summarizes these to just two groups:
quanDeyoComorbid[["Dementia"]] %>% icd9Explain(warn = FALSE)
# contrast with:
quanDeyoComorbid[["Dementia"]] %>% icd9Explain(doCondense = FALSE, warn = FALSE)

Use a range with more than two hundred ICD-9 codes (most of them not real):

length("390" %i9da% "392.1")
"390" %i9da% "392.1" %>% icd9Explain(warn = FALSE)

The warnings here are irrelevant because we know that `%i9da% produces codes which do not correspond to diagnoses. However, in other usage, the user would typically expect the ICD-9 codes he or she is using to be diagnostic, hence the default to warn.

Filtering by Present-on-Arrival

This flag is recorded with each ICD-9 code, indicating whether that diagnosis was present on admission. With some caution, codes flagged specifically not POA can be treated as new diseases during an admission.

Present-on-arrival (POA) is typically a factor, or vector of values such as "Y", "N", "X", "E", or NA. Intermediate codes, such as "exempt", "unknown" and NA mean that "yes" is not the same as "not no." This requires four functions to cover the possibilities stored in icd9PoaChoices:


Filter for present-on-arrival being "Y"

patientData %>% icd9FilterPoaYes

Show that yes is not equal to not no (e.g. due to NA in poa field)

patientData %>% icd9FilterPoaNotNo


The comorbidities from different sources are provided as lists. At present only the most recent mapping of ICD-9 codes to comorbidities is provided. See these github issues.

This package contains ICD-9-CM to co-morbidity mappings from several sources, based on either the Charlson or Elixhauser lists of co-morbidities. Updated versions of these lists from AHRQ and Quan et al are included, along with the original Elixhauser mapping . Since some data is provided in SAS source code format, this package has internalfunctions to parse this SAS source code and generate R data structures. This processing is limited to what is needed for this purpose, although may be generalizable and useful in other contexts. Other lists are transcribed directly from the published articles, but interpretation of SAS code used for the original publications is prefererable.

AHRQ comorbidity classification

The AHRQ keeps an updated version of the Elixhauser classification of ICD-9-CM codes into comorbidities, useful for research. They provide the data in the form of SAS code.

#ahrqComorbid <- icd9:::parseAhrqSas() # user doesn't need to do this

Elixhauser co-morbidities

Elixhauser originally devleoped this set of co-morbidities to predict long term mortality based on hospital ICD-9-CM coding records. The AHRQ comorbidities are an updated version of this, however the original Elixhauser have been used in many publications. The ICD-9-CM codes have changed slightly over the years.



Quan's paper looked at indices using both ICD-10 and ICD-9-CM. Quan generated updated ICD-9-CM codes for all 30 of Elixhauser and all 17 of Charlson/Deyo's co-morbidities. Thus there are two 'Quan' comorbidity mappings.



Filter patients and create comorbidities

Take my patients, find the ones where there definitely or maybe was a diagnosis present on admission, then generate comorbidities based on the AHRQ mapping. N.b. NotNo is not the same as Yes because of some exempt, unclassifiable conditions, or NA values for poa.

patientData %>%
  icd9FilterPoaNotNo %>%
  icd9ComorbidAhrq %>%

Compare two comorbidity defintions

We will find the differences between some categories of the original Elixhauser and the updated version by Quan. Just taking the select few comorbidity groups for brevity:

difference <- icd9DiffComorbid(elixComorbid, quanElixComorbid, 
                 names = c("CHF", "PHTN", "HTN", "Valvular"))
# reuslts also returned as data

Which pulmonary hypertension codes are only in Quan's version?

difference$PHTN$only.y %>% icd9GetReal %>% icd9Explain

(Passing through icd9GetReal stops icd9Explain complaining that some of the input codes don't exist. This is because the comorbidity mappings have every possible numerical ICD-9 code, not just the official ones. Could also use warn = FALSE option in icd9Explain)

Find cardiac-related ICD-9 codes:

  grepl(pattern = "(heart)|(cardiac)",
        x = c(icd9Hierarchy$descLong, icd9Hierarchy$descShort), = TRUE),
  "icd9"] %>% unique -> cardiac

then explain the list, just showing the first ten:

cardiac %>% icd9Explain(warn = FALSE) %>% head(10)

Find comorbidities for a large number of patients.

I understand that comorbiditity assignment using SAS is a lengthy business. Let's generate 100,000 patients with a random selection of comorbidities:

# codes selected from AHRQ mapping
many_patients <- icd9:::randomPatients(100000) 

# result in seconds (platform and load dependent, of course)

Don't want to stress CRAN, so no larger numbers demonstrated, but I do consistently get about 5 seconds for 10 million rows of comorbidities on a moderately powerful workstation.

Arbitrary ICD-9 mapping

The user can provide any ICD-9, ICD-10 or other code mapping to comorbidities they wish. Submissions of other peer-reviewed published mappings could be included in this package, if their license permits. Create an issue in github or email me at [email protected]) Included in this package is a small data set called icd9Chapters, which lists the ICD-9-CM (and indeed ICD-9) Chapters. These can easily be expanded out and used as a mapping, so instead of a comorbidity, we see which patients have codes in each chapter of the ICD-9 defintion.

names(icd9Chapters)[c(1:5, 14)]
myMap <- icd9:::icd9ChaptersToMap(icd9Chapters[c(2, 5, 14)])
icd9Comorbid(patientData, myMap) # no +ve 

Reduce comorbidity mapping from possible values to defined diagnostic codes.

Suppose we want to exact match only real ICD-9 codes when looking up comorbdities for some patients. E.g. if the coder accidentally omited a trailing zero, e.g. code 003.20 (Localized salmonella infection, unspecified) might have been written as 003.2 which has a heading (Localized salmonella infections) but is not itself billable. Use of ICD-9 codes for comorbidities generally assumes the codes are either right or wrong. How do we match only real codes, for a strict interpretation of comorbidities? It's one line or R code:

ahrqStrict <- lapply(ahrqComorbid, icd9GetReal)
str(ahrqComorbid[1:5]) # first five of the original:
str(ahrqStrict[1:5]) # and first five of the result:

Note the much smaller numbers of codes in each group, now we have discarded all the ones which are not defined as diagnoses.

Which three-character ICD-9 codes have no subcodes

The ICD-9-CM scheme is structured as follows: - Chapter - Sub-chapter - Major part (3 digit codes) - sub-division (1st decimal place) - sub-sub-division (2nd decimal place)

For most combinations of 0 to 9, nothing is defined. Sometimes, nodes at one level in the hierarchy are descriptive only of their children (branch nodes), whereas some are themselves billable. For this example, let's find those numeric-only codes which have no children, and by implication are themselves directly billable codes. Here are the first ten:

allreal <- icd9::icd9Hierarchy[["icd9"]]
# select the non-V and non-E codes with three characters (zeroes already prefixed)
threedigitreal <- allreal[nchar(allreal) == 3 & icd9IsN(allreal)]
# display
threedigitdf <- data.frame(icd9 = threedigitreal,  description = icd9Explain(threedigitreal))
print(threedigitdf[1:10, ], row.names = FALSE)

Which ICD-9 codes have changed between versions of ICD-9?

new_since_27 <- setdiff(icd9Billable[["32"]][["icd9"]],
                         icd9Billable[["27"]][["icd9"]]) %>% head
lost_since_27 <- setdiff(icd9Billable[["27"]][["icd9"]],
                         icd9Billable[["32"]][["icd9"]]) %>% tail

# these are a few which were gained since v27
data.frame(icd9 = new_since_27, desc = new_since_27 %>% icd9Explain)
# these are a few which were lost since v27
data.frame(icd9 = lost_since_27, desc = lost_since_27 %>% icd9Explain)


This package allows fluid, fast and accurate manipulation of ICD-9 codes, especially when combined with magrittr. Suggestions, contributions and comments are welcome via github.

Try the icd9 package in your browser

Any scripts or data that you put into this service are public.

icd9 documentation built on May 30, 2017, 2:25 a.m.