View source: R/extract_timevarying.R
extract_timevarying | R Documentation |
This is the workhorse function of wranglEHR that transcribes 2d data from CC-HIC to a table with 1 column per dataitem (and any metadata if relevant) and 1 row per time per patient.
extract_timevarying( connection = NULL, episode_ids = NA_integer_, code_names = NA_character_, rename = NA_character_, coalesce_rows = dplyr::first, chunk_size = 5000, cadence = 1, time_boundaries = c(-Inf, Inf) )
connection |
a CC-HIC database connection. |
episode_ids |
an integer vector of episode_ids or NULL. If NULL (the default) then all episodes are extracted. |
code_names |
a string vector of CC-HIC codes names to be extracted. |
rename |
a character vector, of the same length as |
coalesce_rows |
a function vector of summary functions to summarise data
that is contributed at a higher frequency than the set |
chunk_size |
an integer scalar. Chunks the extraction process by this
many episodes to help manage memory constraints. The default (5000) works
well for most desktop computers. If RAM is not a major limitation, setting
this to |
cadence |
a numerical scalar >= 0 or the string "timestamp". If a numerical scalar is used, it will describe the base time unit to build each row of the extracted table, in divisions of an hour. For example: 1 = 1 hour, 0.5 = 30 mins, 2 = 2 hourly. If cadence = "timestamp", then the precise datetime will be used to generate the time column. This is likely to generate a large table, so use cautiously. |
time_boundaries |
a numeric vector of length 2 containing the start and
end times (in hours) relative to the ICU admission time, for which the data
extraction should occur. For example, |
The time unit is user definable, and set by the cadence
argument. The
default behaviour is to produce a table with 1 row per hour per patient. If
there are duplicates/conflicts (e.g more than 1 event for a given hour), then
only the first result for that hour is returned. If extracting at a lower
cadence than is naturally recorded in the database, one can specify a vector
of summary function to the coalesce_rows
argument. These summary
functions must *always* return a vector of length 1, in the same data type
and must be able to handle vectors entirely of NAs
.
Many events inside CC-HIC occur on a greater than hourly basis. Depending upon the chosen analysis, one may which to modify the cadence. 0.5 for example will produce a table with 1 row per 30 minutes per patient.
Choose what variables one wishes to extract wisely. This function is quite efficient considering what it needs to do, but it can take a very long time if extracting lots of data. It is a strong recommendation that the database is optimised with indexes prior to using this function. It is sensible to test the extraction with 100 or so patients before committing to a full extraction.
It is possible for this function to produce negative time rows (e.g. rows that occurred prior to ICU admission). If, for example a patient had a measure taken in the hours before they were admitted, then this would be added to the table with a negative time value. As a concrete example, if a patient had a sodium measured at 08:00, and they were admitted to the ICU at 20:00 the same day, then the sodium would be displayed at time = -12. This is normal behaviour and it is left to the end user to determine how best they wish to account for this.
sparse tibble with an hourly cadence as rows, and unique data items as columns. Data items that contain metadata are reallocated to their own columns.
con <- setup_dummy_db() df <- extract_timevarying( connection = con, episode_ids = 1:10, code_names = "NIHR_HIC_ICU_0108" ) head(df) DBI::dbDisconnect(con)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.