README.md

inspectEHR

Build
Status codecov Lifecycle
Status

Overview

inspectEHR is a data wrangling, cleaning and reporting tool for CC-HIC. It is designed to run against the CC-HIC EAV table structure (which at present exists in PostgreSQL and SQLite flavours). We are about to undergo a major rewrite to a OHDSI CDM version 6, so this package will be in flux. Once these functions have been ported across and tested, we will aim to submit to CRAN. Please see the R vignettes for further details on how to use the package to perform the most common tasks:

Installation

# install directly from github with
remotes::install_github("cc-hic/inspectEHR")

A copy should already be installed into the group library for the CC-HIC team inside the UCL safe haven. If you are having problems with this, please contact me directly.

Usage

A synthetic database with 1000 patients ships with inspectEHR for you to explore. Actual values are garbage, but everything is logically consitent (e.g. patients discharge after they arrive). The code to produce a more comprehensive synthetic test database is found in data-raw/write_synthetic_data.R. We have embedded the first 1000 patients so as to not make the synehtic database combersome.

library(inspectEHR)

# Synthetic database ships with inspectEHR
db_pth <- system.file("testdata/synthetic_db.sqlite3", package = "inspectEHR")
ctn <- connect(sqlite_file = db_pth)

# Extract static variables. Rename on the fly.
dtb <- extract_demographics(
  connection = ctn,
  episode_ids = 13639:13643,
  code_names = c("NIHR_HIC_ICU_0017", "NIHR_HIC_ICU_0019"),
  rename = c("height", "weight")
)

head(dtb)
#> # A tibble: 5 x 2
#>   episode_id height
#>        <int>  <dbl>
#> 1      13639   150.
#> 2      13640   173.
#> 3      13641   143.
#> 4      13642   156.
#> 5      13643   168.

# Extract time varying variables. Rename on the fly.
ltb <- extract_timevarying(
  ctn,
  episode_ids = 13639:13643,
  code_names = "NIHR_HIC_ICU_0108",
  rename = "hr")

head(ltb)
#> # A tibble: 6 x 3
#>    time    hr episode_id
#>   <dbl> <int>      <int>
#> 1     0    99      13639
#> 2     1    84      13639
#> 3     2   102      13639
#> 4     3    95      13639
#> 5     4    69      13639
#> # … with 1 more row

# Pull out to any arbitrary temporal resolution and custom define the
# behaviour for information recorded at resolution higher than you are sampling.
# only extract the first 24 hours of data

ltb_2 <- extract_timevarying(
  ctn,
  episode_ids = 13639:13643,
  code_names = "NIHR_HIC_ICU_0108",
  rename = "hr",
  cadance = 2, # 1 row every 2 hours
  coalesce_rows = mean, # use mean to downsample to our 2 hour cadence
  time_boundaries = c(0, 24)
  )

head(ltb_2)
#> # A tibble: 6 x 3
#>    time    hr episode_id
#>   <dbl> <int>      <int>
#> 1     0    99      13639
#> 2     2   102      13639
#> 3     4    95      13639
#> 4     6    90      13639
#> 5     8    89      13639
#> # … with 1 more row
DBI::dbDisconnect(ctn)

Getting help

If you find a bug, please file a minimal reproducible example on github.

Reporting Data Quality Issues

Please submit an issue and tag it with “data quality”. Data quality issues are often related to a specific site. If this is the case, please also tag the site.

Data Quality Rules

The data quality rules are largely based upon standards set by OHDSI1, and the consensus guidelines by Khan et al.2. The CC-HIC currently uses an episode centric model. As such, many of the data quality checks are based around this way of thinking. As we move to OMOP (a patient centric model) many of these will change accordingly. General conventions follow that:

Episode Verification

Event/Value Verification

Event Validation

The same principles listed above can be retested under a validation framework. That is to seak an external resource to validate the values in question. An aim is to check the data against the ICNARC reports, which will allows for some external validation against a gold standard resource. At present, the validation that is performed is to compaire all sites against each other. In this sense, each site is used as a validation check against the others. Distrepancies should either be due to systemic errors, or differences in casemix.

Synthetic Database

There is a copy of the CC-HIC database located in data-raw/synthetic_db.sqlite3. This is a structurally sound copy of the real database, but entirely hand crafted (so there is no patient data here). It is quite sparse at present, but I will add more variables as time goes on. For now, it acts as a test database to make sure inspectEHR works as it should.

  1. https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/
  2. Kahn, Michael G.; Callahan, Tiffany J.; Barnard, Juliana; Bauck, Alan E.; Brown, Jeff; Davidson, Bruce N.; Estiri, Hossein; Goerg, Carsten; Holve, Erin; Johnson, Steven G.; Liaw, Siaw-Teng; Hamilton-Lopez, Marianne; Meeker, Daniella; Ong, Toan C.; Ryan, Patrick; Shang, Ning; Weiskopf, Nicole G.; Weng, Chunhua; Zozus, Meredith N.; and Schilling, Lisa (2016) “A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data,” eGEMs (Generating Evidence & Methods to improve patient outcomes): Vol. 4: Iss. 1, Article 18.


CC-HIC/inspectEHR documentation built on Jan. 16, 2020, 11:24 p.m.