Work with INCA data"

Background

Working with data from INCA or Rockan can be a pain! Not only are some formats strange (such as Boolean and dates), sometimes the formats also differ internally in INCA compared to after exportation. The incadata package is aimed to streamline the process of reading and using RCC data (from INCA and Rockan).

Example data

This vignette will use some example data ex_data found in the package:

library(dplyr)
library(incadata)
dim(ex_data)

It's a data set with many columns with all types of synthetic INCA-data (it is based on real data but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).

Le's chose a subset of columns for illustrative purpose:

x <- 
  ex_data %>% 
  dplyr::select(
    a_lkf,
    a_inrappdatum,
    a_inrappsjh,
    a_inrappklk, 
    a_kompl,
    a_rappSjHemSj_Beskrivning
  )

Now, how are these variables stored?

dplyr::glimpse(x)

We can see that:

We now want to change these formats to get something more natural to work with.

Function as.incadata

as.incadata is one of the main functions of the package. It takes either a single vector or a data frame and converts it to a format more relevant for RCC data.

The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.

x2 <- as.incadata(x)

Let's have a closer look at the result:

dplyr::glimpse(x2)

Some things have happened:

Function use_incadata

Another important function from the package is use_incadata. It could be thought of as read.incadata but it is constructed to work also on INCA (where the data is already available in a data frame named "df" and therefore not read from disk).

This function has three main advantages:

  1. It can (in contrast to read.csv2 or similar) be used both locally and in INCA so there is no need to have different scripts for development and production.
  2. It uses a cache mechanism to increase speed. If the data set is big, the use of as.incadata might be slow. use_incadata only perform this coercion once, and then use a cache mechanism automatically. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparison of MD5 check sums. (The cache mechanism is intentionally ignored if calling the function from INCA, where the data should always be fresh).
  3. Also, as noted above, the output from as.incadata is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to report these messages every time, which use_incadata does not.

Example

Let's use the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.

# Save data as csv2 in temp file
fl <- tempfile("ex_data", fileext = ".csv2")
write.csv2(incadata::ex_data, fl, row.names = FALSE)

Let us now use the data for the "first time". The process will be verbose (but we omit it here just to save space). When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We time the process to compare the speed with later attempt:

system.time(
  x <- use_incadata(fl)
)

Now, let's assume that we for some reason has to restart the process all over again (and let´s time it again for the sake of comparison):

system.time(
  x <- use_incadata(fl)
)

Voila! Data is already in a good format and process was faster than before!



Try the incadata package in your browser

Any scripts or data that you put into this service are public.

incadata documentation built on April 14, 2020, 6:08 p.m.