Working with data from INCA or Rockan can be a pain!
Not only are some formats strange (such as Boolean and dates), sometimes the formats also differ internally in INCA compared to after exportation. The incadata package is aimed to streamline the process of reading and using RCC data (from INCA and Rockan).
This vignette will use some example data ex_data found in the package:
library(dplyr) library(incadata) dim(ex_data)
It's a data set with many columns with all types of synthetic INCA-data (it is based on real data but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).
Le's chose a subset of columns for illustrative purpose:
x <- ex_data %>% dplyr::select( a_lkf, a_inrappdatum, a_inrappsjh, a_inrappklk, a_kompl, a_rappSjHemSj_Beskrivning )
Now, how are these variables stored?
dplyr::glimpse(x)
We can see that:
a_inrappdatum looks like a date but is treated as charactera_lkf, a_inrappsjh and a_inrappklk look like numerics but are treated as characters.a_kompl looks like a Boolean but is a factora_rappSjHemSj_Beskrivning looks like a factor and is ... a factor :-)We now want to change these formats to get something more natural to work with.
as.incadataas.incadata is one of the main functions of the package. It takes either a single vector or a data frame and converts it to a format more relevant for RCC data.
The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.
x2 <- as.incadata(x)
Let's have a closer look at the result:
dplyr::glimpse(x2)
Some things have happened:
a_rappSjHemSj_Beskrivning -> a_rappsjhemsj_beskrivning). If two (or more) variable names differ only with regard to case, this will be handled adequately.a_inrappdatum is now a date! To recognize dates, especially from Rockan, but sometimes also from INCA has a vignette of its own.as.incadata therefore only treat numbers with non-leading zeroes as numeric (it also distinguish between integers and decimal numbers and it translates the Swedish decimal coma to an English decimal point.a_kompl is now Boolean and this will happen regardless if we work on INCA (where Booleans are stored as 0/1 or locally where the same values are transformed to "True" or blanks).id column pointing to individual patients. This variable will be based on either personal identification number, patient id or a simple row number. The idea is that this variable have different names depending on the source (INCA/Rockan) and it is easier to always have an id column with the same name. Also if a personal identification number is included in the data, this will be checked (by sweidnumbr), while the id column will not.a_lkf_xxx_beskrivning. These are all based on the fact that a_lkf is a code variable recognized by the decoder package. It was only included as a numeric (coded) variable in the original data. It has now been supplemented with descriptive names of different regions based on the LKF code.use_incadataAnother important function from the package is use_incadata. It could be thought of as read.incadata but it is constructed to work also on INCA (where the data is already available in a data frame named "df" and therefore not read from disk).
This function has three main advantages:
read.csv2 or similar) be used both locally and in INCA so there is no need to have different scripts for development and production. as.incadata might be slow. use_incadata only perform this coercion once, and then use a cache mechanism automatically. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparison of MD5 check sums. (The cache mechanism is intentionally ignored if calling the function from INCA, where the data should always be fresh).as.incadata is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to report these messages every time, which use_incadata does not.Let's use the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.
# Save data as csv2 in temp file fl <- tempfile("ex_data", fileext = ".csv2") write.csv2(incadata::ex_data, fl, row.names = FALSE)
Let us now use the data for the "first time". The process will be verbose (but we omit it here just to save space). When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We time the process to compare the speed with later attempt:
system.time( x <- use_incadata(fl) )
Now, let's assume that we for some reason has to restart the process all over again (and let´s time it again for the sake of comparison):
system.time( x <- use_incadata(fl) )
Voila! Data is already in a good format and process was faster than before!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.