Work with INCA data"
In incadata: Recognize and Handle Data in Formats Used by Swedish Cancer Centers

Background

Working with data from INCA or Rockan can be a pain! Not only are some formats strange (such as Boolean and dates), sometimes the formats also differ internally in INCA compared to after exportation. The incadata package is aimed to streamline the process of reading and using RCC data (from INCA and Rockan).

Example data

This vignette will use some example data ex_data found in the package:

library(dplyr)
library(incadata)
dim(ex_data)

It's a data set with many columns with all types of synthetic INCA-data (it is based on real data but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).

Le's chose a subset of columns for illustrative purpose:

x <- 
  ex_data %>% 
  dplyr::select(
    a_lkf,
    a_inrappdatum,
    a_inrappsjh,
    a_inrappklk, 
    a_kompl,
    a_rappSjHemSj_Beskrivning
  )

Now, how are these variables stored?

dplyr::glimpse(x)

We can see that:

a_inrappdatum looks like a date but is treated as character
a_lkf, a_inrappsjh and a_inrappklk look like numerics but are treated as characters.
a_kompl looks like a Boolean but is a factor
a_rappSjHemSj_Beskrivning looks like a factor and is ... a factor :-)

We now want to change these formats to get something more natural to work with.

Function `as.incadata`

as.incadata is one of the main functions of the package. It takes either a single vector or a data frame and converts it to a format more relevant for RCC data.

The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.

x2 <- as.incadata(x)

Let's have a closer look at the result:

dplyr::glimpse(x2)

Some things have happened:

All variable names are transformed to lower case since these are generally easier to work with (note especially a_rappSjHemSj_Beskrivning -> a_rappsjhemsj_beskrivning). If two (or more) variable names differ only with regard to case, this will be handled adequately.
a_inrappdatum is now a date! To recognize dates, especially from Rockan, but sometimes also from INCA has a vignette of its own.
Some numeric variables are now numeric but some are (still) treated as character. The problem with some INCA unit codes is that a leading zero might bear meaning. To treat such a variable as numeric would drop the zero and the codes would be messed up. as.incadata therefore only treat numbers with non-leading zeroes as numeric (it also distinguish between integers and decimal numbers and it translates the Swedish decimal coma to an English decimal point.
a_kompl is now Boolean and this will happen regardless if we work on INCA (where Booleans are stored as 0/1 or locally where the same values are transformed to "True" or blanks).
There is a new id column pointing to individual patients. This variable will be based on either personal identification number, patient id or a simple row number. The idea is that this variable have different names depending on the source (INCA/Rockan) and it is easier to always have an id column with the same name. Also if a personal identification number is included in the data, this will be checked (by sweidnumbr), while the id column will not.
There are also some new columns names like a_lkf_xxx_beskrivning. These are all based on the fact that a_lkf is a code variable recognized by the decoder package. It was only included as a numeric (coded) variable in the original data. It has now been supplemented with descriptive names of different regions based on the LKF code.
There are no factor variables! Factor variables can sometimes be useful but often not. We chose to avoid them by default.

Function `use_incadata`

Another important function from the package is use_incadata. It could be thought of as read.incadata but it is constructed to work also on INCA (where the data is already available in a data frame named "df" and therefore not read from disk).

This function has three main advantages:

It can (in contrast to read.csv2 or similar) be used both locally and in INCA so there is no need to have different scripts for development and production.
It uses a cache mechanism to increase speed. If the data set is big, the use of as.incadata might be slow. use_incadata only perform this coercion once, and then use a cache mechanism automatically. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparison of MD5 check sums. (The cache mechanism is intentionally ignored if calling the function from INCA, where the data should always be fresh).
Also, as noted above, the output from as.incadata is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to report these messages every time, which use_incadata does not.

Example

Let's use the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.

# Save data as csv2 in temp file
fl <- tempfile("ex_data", fileext = ".csv2")
write.csv2(incadata::ex_data, fl, row.names = FALSE)

Let us now use the data for the "first time". The process will be verbose (but we omit it here just to save space). When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We time the process to compare the speed with later attempt: