extract_timevarying: Extract & Reshape Timevarying Dateitems
In DocEd/wranglEHR: Standardised Data Extraction and Wrangling

View source: R/extract_timevarying.R

extract_timevarying

R Documentation

Extract & Reshape Timevarying Dateitems

Description

This is the workhorse function of wranglEHR that transcribes 2d data from CC-HIC to a table with 1 column per dataitem (and any metadata if relevant) and 1 row per time per patient.

Usage

extract_timevarying(
  connection = NULL,
  episode_ids = NA_integer_,
  code_names = NA_character_,
  rename = NA_character_,
  coalesce_rows = dplyr::first,
  chunk_size = 5000,
  cadence = 1,
  time_boundaries = c(-Inf, Inf)
)

Arguments

`connection`	a CC-HIC database connection.
`episode_ids`	an integer vector of episode_ids or NULL. If NULL (the default) then all episodes are extracted.
`code_names`	a string vector of CC-HIC codes names to be extracted.
`rename`	a character vector, of the same length as `code_names`, with names to relabel extracted CC-HIC dataitems, or NULL (the default) to retain the original code names. Given in the same order as `code_names`.
`coalesce_rows`	a function vector of summary functions to summarise data that is contributed at a higher frequency than the set `cadence`. Must be the same length, and in the same order as `code_names`.
`chunk_size`	an integer scalar. Chunks the extraction process by this many episodes to help manage memory constraints. The default (5000) works well for most desktop computers. If RAM is not a major limitation, setting this to `Inf` may improve performance.
`cadence`	a numerical scalar >= 0 or the string "timestamp". If a numerical scalar is used, it will describe the base time unit to build each row of the extracted table, in divisions of an hour. For example: 1 = 1 hour, 0.5 = 30 mins, 2 = 2 hourly. If cadence = "timestamp", then the precise datetime will be used to generate the time column. This is likely to generate a large table, so use cautiously.
`time_boundaries`	a numeric vector of length 2 containing the start and end times (in hours) relative to the ICU admission time, for which the data extraction should occur. For example, `c(0, 24)` will return the first 24 hours of data after admission. The default `c(-Inf, Inf)` will return all data.

Details

The time unit is user definable, and set by the cadence argument. The default behaviour is to produce a table with 1 row per hour per patient. If there are duplicates/conflicts (e.g more than 1 event for a given hour), then only the first result for that hour is returned. If extracting at a lower cadence than is naturally recorded in the database, one can specify a vector of summary function to the coalesce_rows argument. These summary functions must *always* return a vector of length 1, in the same data type and must be able to handle vectors entirely of NAs.

Many events inside CC-HIC occur on a greater than hourly basis. Depending upon the chosen analysis, one may which to modify the cadence. 0.5 for example will produce a table with 1 row per 30 minutes per patient.

Choose what variables one wishes to extract wisely. This function is quite efficient considering what it needs to do, but it can take a very long time if extracting lots of data. It is a strong recommendation that the database is optimised with indexes prior to using this function. It is sensible to test the extraction with 100 or so patients before committing to a full extraction.

It is possible for this function to produce negative time rows (e.g. rows that occurred prior to ICU admission). If, for example a patient had a measure taken in the hours before they were admitted, then this would be added to the table with a negative time value. As a concrete example, if a patient had a sodium measured at 08:00, and they were admitted to the ICU at 20:00 the same day, then the sodium would be displayed at time = -12. This is normal behaviour and it is left to the end user to determine how best they wish to account for this.

Value

sparse tibble with an hourly cadence as rows, and unique data items as columns. Data items that contain metadata are reallocated to their own columns.

Examples

con <- setup_dummy_db()
df <- extract_timevarying(
  connection = con,
  episode_ids = 1:10,
  code_names = "NIHR_HIC_ICU_0108"
  )
head(df)
DBI::dbDisconnect(con)

DocEd/wranglEHR documentation built on May 28, 2022, 1:50 p.m.