Working with data from INCA or Rockan can be a pain!
Not only are some formats strange (such as Boolean and dates), sometimes the formats also differ internally in INCA compared to after exportation. The incadata
package is aimed to streamline the process of reading and using RCC data (from INCA and Rockan).
This vignette will use some example data ex_data
found in the package:
library(dplyr) library(incadata) dim(ex_data)
It's a data set with many columns with all types of synthetic INCA-data (it is based on real data but everything is randomized and scrambled not to give any details about real patients, doctors, hospitals et cetera).
Le's chose a subset of columns for illustrative purpose:
x <- ex_data %>% dplyr::select( a_lkf, a_inrappdatum, a_inrappsjh, a_inrappklk, a_kompl, a_rappSjHemSj_Beskrivning )
Now, how are these variables stored?
dplyr::glimpse(x)
We can see that:
a_inrappdatum
looks like a date but is treated as charactera_lkf
, a_inrappsjh
and a_inrappklk
look like numerics but are treated as characters.a_kompl
looks like a Boolean but is a factora_rappSjHemSj_Beskrivning
looks like a factor and is ... a factor :-)We now want to change these formats to get something more natural to work with.
as.incadata
as.incadata
is one of the main functions of the package. It takes either a single vector or a data frame and converts it to a format more relevant for RCC data.
The output message is quite verbose. This is intended since it is probably a good idea to check that all columns are coerced to reasonable formats.
x2 <- as.incadata(x)
Let's have a closer look at the result:
dplyr::glimpse(x2)
Some things have happened:
a_rappSjHemSj_Beskrivning
-> a_rappsjhemsj_beskrivning
). If two (or more) variable names differ only with regard to case, this will be handled adequately.a_inrappdatum
is now a date! To recognize dates, especially from Rockan, but sometimes also from INCA has a vignette of its own.as.incadata
therefore only treat numbers with non-leading zeroes as numeric (it also distinguish between integers and decimal numbers and it translates the Swedish decimal coma to an English decimal point.a_kompl
is now Boolean and this will happen regardless if we work on INCA (where Booleans are stored as 0/1 or locally where the same values are transformed to "True" or blanks).id
column pointing to individual patients. This variable will be based on either personal identification number, patient id or a simple row number. The idea is that this variable have different names depending on the source (INCA/Rockan) and it is easier to always have an id column with the same name. Also if a personal identification number is included in the data, this will be checked (by sweidnumbr), while the id column will not.a_lkf_xxx_beskrivning
. These are all based on the fact that a_lkf
is a code variable recognized by the decoder
package. It was only included as a numeric (coded) variable in the original data. It has now been supplemented with descriptive names of different regions based on the LKF code.use_incadata
Another important function from the package is use_incadata
. It could be thought of as read.incadata
but it is constructed to work also on INCA (where the data is already available in a data frame named "df" and therefore not read from disk).
This function has three main advantages:
read.csv2
or similar) be used both locally and in INCA so there is no need to have different scripts for development and production. as.incadata
might be slow. use_incadata
only perform this coercion once, and then use a cache mechanism automatically. If the original data file is changed (a new export from INCA), the cache will be updated automatically after comparison of MD5 check sums. (The cache mechanism is intentionally ignored if calling the function from INCA, where the data should always be fresh).as.incadata
is quite verbose (for good reason) but if using the same data over and over again, it might not be meaningful to report these messages every time, which use_incadata
does not.Let's use the same data as above. We save the data to disk as a csv2-file to simulate an exported INCA file.
# Save data as csv2 in temp file fl <- tempfile("ex_data", fileext = ".csv2") write.csv2(incadata::ex_data, fl, row.names = FALSE)
Let us now use the data for the "first time". The process will be verbose (but we omit it here just to save space). When working locally, the cache will be saved next to the original file (from where it can be copied or removed as a regular file). We time the process to compare the speed with later attempt:
system.time( x <- use_incadata(fl) )
Now, let's assume that we for some reason has to restart the process all over again (and let´s time it again for the sake of comparison):
system.time( x <- use_incadata(fl) )
Voila! Data is already in a good format and process was faster than before!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.