data_env: ICU datasets

View source: R/data-env.R

dataR Documentation

ICU datasets


The Laboratory for Computational Physiology (LCP) at MIT hosts several large-scale databases of hospital intensive care units (ICUs), two of which can be either downloaded in full (MIMIC-III and eICU ) or as demo subsets (MIMIC-III demo and eICU demo), while a third data set is available only in full (HiRID ). While demo data sets are freely available, full download requires credentialed access which can be gained by applying for an account with PhysioNet . Even though registration is required, the described datasets are all publicly available. With AmsterdamUMCdb , a non-PhysioNet hosted data source is available as well. As with the PhysioNet datasets, access is public but has to be granted by the data collectors.




The exported data environment contains all datasets that have been made available to ricu. For datasets that are attached during package loading (see attach_src()), shortcuts to the datasets are set up in the package namespace, allowing the object ricu::data::mimic_demo to be accessed as ricu::mimic_demo (or in case the package namespace has been attached, simply as mimic_demo). Datasets that are made available after the package namespace has been sealed will have their proxy object by default located in .GlobalEnv. Datasets are represented by src_env objects, while individual tables are src_tbl and do not represent in-memory data, but rather data stored on disk, subsets of which can be loaded into memory.


Setting up a dataset for use with ricu requires a configuration object. For the included datasets, configuration can be loaded from

system.file("extdata", "config", "data-sources.json", package = "ricu")

by calling load_src_cfg() and for dataset that are external to ricu, additional configuration can be made available by setting the environment variable RICU_CONFIG_PATH (for more information, refer to load_src_cfg()). Using the dataset configuration object, data can be downloaded (download_src()), imported (import_src()) and attached (attach_src()). While downloading and importing are one-time procedures, attaching of the dataset is repeated every time the package is loaded. Briefly, downloading loads the raw dataset from the internet (most likely in .csv format), importing consists of some preprocessing to make the data available more efficiently (by converting it to .fst format) and attaching sets up the data for use by the package. For more information on the individual steps, refer to the respective documentation pages.

A dataset that has been successfully made available can interactively be explored by typing its name into the console and individual tables can be inspected using the $ function. For example for the MIMIC-III demo dataset and the icustays table, this gives

#> <mimic_demo_env[25]>
#>         admissions            callout         caregivers        chartevents 
#>         [129 x 19]          [77 x 24]        [7,567 x 4]     [758,355 x 15] 
#>          cptevents              d_cpt    d_icd_diagnoses   d_icd_procedures 
#>       [1,579 x 12]          [134 x 9]       [14,567 x 4]        [3,882 x 4] 
#>            d_items         d_labitems     datetimeevents      diagnoses_icd 
#>      [12,487 x 10]          [753 x 6]      [15,551 x 14]        [1,761 x 5] 
#>           drgcodes           icustays     inputevents_cv     inputevents_mv 
#>          [297 x 8]         [136 x 12]      [34,799 x 22]      [13,224 x 31] 
#>          labevents microbiologyevents       outputevents           patients 
#>       [76,074 x 9]       [2,003 x 16]      [11,320 x 13]          [100 x 8] 
#>      prescriptions procedureevents_mv     procedures_icd           services 
#>      [10,398 x 19]         [753 x 25]          [506 x 5]          [163 x 6] 
#>          transfers 
#>         [524 x 13]
#> # <mimic_tbl>: [136 x 12]
#> # ID options:  subject_id (patient) < hadm_id (hadm) < icustay_id (icustay)
#> # Defaults:    `intime` (index), `last_careunit` (val)
#> # Time vars:   `intime`, `outtime`
#>     row_id subjec~ hadm_id icusta~ dbsour~ first_~ last_c~ first_~ last_w~
#>      <int>   <int>   <int>   <int> <chr>   <chr>   <chr>     <int>   <int>
#>   1  12742   10006  142345  206504 carevue MICU    MICU         52      52
#>   2  12747   10011  105331  232110 carevue MICU    MICU         15      15
#>   3  12749   10013  165520  264446 carevue MICU    MICU         15      15
#>   4  12754   10017  199207  204881 carevue CCU     CCU           7       7
#>   5  12755   10019  177759  228977 carevue MICU    MICU         15      15
#> ...
#> 132  42676   44083  198330  286428 metavi~ CCU     CCU           7       7
#> 133  42691   44154  174245  217724 metavi~ MICU    MICU         50      50
#> 134  42709   44212  163189  239396 metavi~ MICU    MICU         50      50
#> 135  42712   44222  192189  238186 metavi~ CCU     CCU           7       7
#> 136  42714   44228  103379  217992 metavi~ SICU    SICU         57      57
#> # ... with 126 more rows, and 3 more variables: intime <dttm>, outtime <dttm>,
#> #   los <dbl>

Table subsets can be loaded into memory for example using the base::subset() function, which uses non-standard evaluation (NSE) to determine a row-subsetting. This design choice stems form the fact that some tables can have on the order of 10^8 rows, which makes loading full tables into memory an expensive operation. Table subsets loaded into memory are represented as data.table objects. Extending the above example, if only ICU stays corresponding to the patient with subject_id == 10124 are of interest, the respective data can be loaded as

subset(mimic_demo$icustays, subject_id == 10124)
#>    row_id subject_id hadm_id icustay_id dbsource first_careunit last_careunit
#> 1:  12863      10124  182664     261764  carevue           MICU          MICU
#> 2:  12864      10124  170883     222779  carevue           MICU          MICU
#> 3:  12865      10124  170883     295043  carevue            CCU           CCU
#> 4:  12866      10124  170883     237528  carevue           MICU          MICU
#>    first_wardid last_wardid              intime             outtime     los
#> 1:           23          23 2192-03-29 10:46:51 2192-04-01 06:36:00  2.8258
#> 2:           50          50 2192-04-16 20:58:32 2192-04-20 08:51:28  3.4951
#> 3:            7           7 2192-04-24 02:29:49 2192-04-26 23:59:45  2.8958
#> 4:           23          23 2192-04-30 14:50:44 2192-05-15 23:34:21 15.3636

Much care has been taken to make ricu extensible to new datasets. For example the publicly available ICU database AmsterdamUMCdb provided by the Amsterdam University Medical Center, currently is not part of the core datasets of ricu, but code for integrating this dataset is available on github.


The Medical Information Mart for Intensive Care (MIMIC) database holds detailed clinical data from roughly 60,000 patient stays in Beth Israel Deaconess Medical Center (BIDMC) intensive care units between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). For further information, please refer to the MIMIC-III documentation.

The corresponding demo dataset contains the full data of a randomly selected subset of 100 patients from the patient cohort with conformed in-hospital mortality. The only notable data omission is the noteevents table, which contains unstructured text reports on patients.


More recently, Philips Healthcare and LCP began assembling the eICU Collaborative Research Database as a multi-center resource for ICU data. Combining data of several critical care units throughout the continental United States from the years 2014 and 2015, this database contains de-identified health data associated with over 200,000 admissions, including vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, and treatment information. For further information, please refer to the eICU documentation .

For the demo subset, data associated with ICU stays for over 2,500 unit stays selected from 20 of the larger hospitals is included. An important caveat that applied to the eICU-based datasets is considerable variability among the large number of hospitals in terms of data availability.


Moving to higher time-resolution, HiRID is a freely accessible critical care dataset containing data relating to almost 34,000 patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland. The dataset contains de-identified demographic information and a total of 681 routinely collected physiological variables, diagnostic test results and treatment parameters, collected during the period from January 2008 to June 2016. Dependent on the type of measurement, time resolution can be on the order of 2 minutes.


With similar time-resolution (for vital-sign measurements) as HiRID, AmsterdamUMCdb contains data from 23,000 admissions of adult patients from 2003-2016 to the department of Intensive Care, of Amsterdam University Medical Center. In total, nearly 10^9^ individual observations consisting of vitals signs, clinical scoring systems, device data and lab results data, as well as nearly 5*10^6^ million medication entries, alongside de-identified demographic information corresponding to the 20,000 individual patients is spread over 7 tables.


With the recent v1.0 release of MIMIC-IV, experimental support has been added in ricu. Building on the success of MIMIC-III, this next iteration contains data on patients admitted to an ICU or the emergency department between 2008 - 2019 at BIDMC. Therefore, relative to MIMIC-III, patients admitted prior to 2008 (whose data is stored in a CareVue-based system) has been removed, while data onward of 2012 has been added. This simplifies data queries considerably, as the CareVue/MetaVision data split in MIMIC-III no longer applies. While addition of ED data is planned, this is not part of the initial v1.0 release and currently is not supported by ricu. For further information, please refer to the MIMIC-III documentation .


Johnson, A., Pollard, T., & Mark, R. (2016). MIMIC-III Clinical Database (version 1.4). PhysioNet.

MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). DOI: 10.1038/sdata.2016.35.

Johnson, A., Pollard, T., Badawi, O., & Raffa, J. (2019). eICU Collaborative Research Database Demo (version 2.0). PhysioNet.

The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG and Badawi O. Scientific Data (2018). DOI:

Faltys, M., Zimmermann, M., Lyu, X., Hüser, M., Hyland, S., Rätsch, G., & Merz, T. (2020). HiRID, a high time-resolution ICU dataset (version 1.0). PhysioNet.

Hyland, S.L., Faltys, M., Hüser, M. et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 26, 364–373 (2020).

Thoral PJ, Peppink JM, Driessen RH, et al (2020) AmsterdamUMCdb: The First Freely Accessible European Intensive Care Database from the ESICM Data Sharing Initiative.

Elbers, Dr. P.W.G. (Amsterdam UMC) (2019): AmsterdamUMCdb v1.0.2. DANS.

Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2021). MIMIC-IV (version 1.0). PhysioNet.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation (Online). 101 (23), pp. e215–e220.

ricu documentation built on Oct. 31, 2022, 1:08 a.m.