Introduction to CCHIC critical care data

CCHIC data strongly impacts the design of cleanEHR. Complex, heterogeneous, and high resolution longitudinal data is a trend of EHR analysis, which takes advantage of the ever more sophisticated statistical techniques and growing computation capability. CCHIC is a representative example database of such kind. It records 263 fields including 154 time-varying fields of patients during their stay in intensive care units across five NHS trusts in England. The recorded variables include patient demographics, time of admission and discharge, survival status, diagnosis, physiology, laboratory, nursing and drug. The latest database contains 22,628 admissions from 2014 to 2016, with about 119 million data points (~6k per patient).

The anonymised data subset can be obtained here. The selected number of researchers can get the access to the identifiable data UCL IDHS.

Episode and ICU stay timeline

We introduced the concept of 'episode' as the fundamental entity of EHRs, which comprises all the data being recorded during the ICU stay. Each episode also contains the demographic information of the patient, ward transferring origin and destination within a hospital and diagnosis information. It allows us to link episode data from a single patient across the entire multi-centre database. The key dates and times are recorded as follow.


# Extract all non-longitudinal data (demographic, time, survival status, diagnosis)
dt <- ccd_demographic_table(ccd, dtype=TRUE)

ccd_demographic_table function returns a table of all non-time-varying fields alongside with several derived fields -- the fields that are not directly recorded in the original data. Each row of the table is a unique admission, and every column is a non-time-varying data field.

print(dt[1:3, ])

Data missing are caused by many reasons. We have to understand that in such a large database, the data quality varies. In the anonymised dataset, data can be missing due to security reason. Missing data are either NULL or "NA" depending on the field data type.

Demographic data

Every patient in England has a unique NHS number and PAS (Patient Admission System) number. These can be used to identify a unique patient. Other demographic information can also be found in the CCHIC dataset such as age, sex, GP code, postcode and so on. Most of the demographic data will be removed, pseudonimised, or modified in the anonymised dataset to protect the patient privacy.


Ward transferring within or beyond ICUs are counted as different episodes respectively. In some research, one may be interested in looking at the sickness development and the treatment history beyond each ICU stays. A spell includes number of episodes from a unique patient which occured in a user defined period One can link episodes by spell ID.

head(ccd_unique_spell(ccd, duration=1)[, c("episode_id", "spell")])

Diagnosis data

Instead of using free text, we adopted ICNARC diagnosis code system to record the diagnosis. The full ICNARC code is a five digit number separated with dots. From left to right each digit represents a higher category of diagnosis. Due to the privacy concerns, in the anonysmised dataset, last two digits will be removed. One may use the function icnarc2diagnosis to look up the diagnosis code.


Longitudinal data

CCHIC measures 154 longitudinal data. The full list of longitudinal data are shown below.

We can also easily plot the data from a single selected admission.

plot_episode(ccd@episodes[[7]], c("h_rate",  "bilirubin", "fluid_balance_d"))

Try the cleanEHR package in your browser

Any scripts or data that you put into this service are public.

cleanEHR documentation built on May 2, 2019, 6:40 a.m.