In acobos/ox: Reads OpenClinica xml export files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ">"
)

The ox package provides functions to get OpenClinica data into R. OpenClinica offers several export file formats, the most complete ones being the CDISC ODM formats.

The ox package has been developped and tested with the three availabe ODM 1.3 export formats in OpenClinica version 3.3:

CDISC ODM XML 1.3 Clinical Data
CDISC ODM XML 1.3 Clinical Data with OpenClinica extensions
CDISC ODM XML 1.3 Full with OpenClinica extensions

To use the ox package, you will need to dowload on of these file formats (.zip file), and unzip the downloaded file.

In the following, we assume some familiarity with OpenClinica OIDs. In an attempt to follow the tidyverse style guide, we named objects using snake_case (lower case letters, and underscores to separate words). OpenClinica uses CamelCase style instead. So, StudyOID in OpenClinica becomes study_oid in ox, and so on. Less obvious equivalences between OpenClinica OIDs (and other terms used in ODM XML 1.3 export files) and ox object names are:

library(dplyr)
data.frame(OpenClinica = c("StudySubjectID", "Status", "StudyEventOID", "StudyEventRepeatKey", "Status.1",
  "Version", "ItemGroupOID", "ItemGroupRepeatKey"),
  ox = c("subject_id", "subject_status", "event_oid", "event_repeat_key", "form_status",
  "form_version", "group_oid", "group_repeat_key")) %>% knitr::kable()

Quick start

First, load the XML package and parse the xml file with function xmlParse() from this package.

# your (unziped) xml export file
xml_file <- system.file("extdata",
                        "odm1.3_full_example.xml",
                        package = "ox",
                        mustWork = TRUE)


library(XML)                    # load the XML package 
doc <- xmlParse(xml_file)       # parse the xml file

Then, load the ox package, and use function ox_all() (passing it the parsed xml file as argument) to create an ox_all object. This is a VERY slow process, even for small studies. A progress bar, and some messages, will be displayed in the console (not shown in the example below).

library(ox)                     # load the ox package
d <- ox_all(doc)                # create an ox_all object
class(d)                        # see its class

The result is an object of class ox_all, which is nothing but a specialized list, containing all the relevant data and metadata (for details on its contents, see section on ox_all objects below).

This might be enough to meet the needs of some users, but many others will prefer to have the data in the more familiar format of tidy dataframes of related items. In OpenClinica, related items are organised in item groups, each of which is identified in the ox_all object by a group_oid. Let's see what are the `group_oid``s in the data:

unique(d$data$group_oid)      # to access the group_oid's

Note that one of the group_oid's is "IG_DEMO_DEMOGRAPHICDATA". The following code uses function ox_xtract_group() to get a tidy dataframe with all items in the specified group. The first seven variables in this dataframe are keys identifying the site (study_oid), subject (subject_key), event (event_oidand event_repeat_key), form (form_oid), item group (group_oid), and repetition number (group_repeat_key). Because demographic data was (as usually is) recorded just once for each patient, all keys but the study_oid and the subject_key are constant in this dataframe. We do not show these constant keys, just to keep the output more readable.

demo <- ox_xtract_group(d, group = "IG_DEMO_DEMOGRAPHICDATA")

class(demo)

names(demo)

# a subset of demo (variables)constant keys not shown)
head(demo[, c(1,3,9,10)])

The last two variables, I_DEMO_DEMO_AGE and I_DEMO_DEMO_MENSTRUAL, are the actual items in this item group, and have been named using the item_oid. This is default behaviour of the ox_xtract_group() function, but we can use their item_name instead:

demo <- ox_xtract_group(d, group = "IG_DEMO_DEMOGRAPHICDATA",
                        use_item_names = TRUE)

head(demo[,c(1,3,9,10)])

Last, we may want to have factors defined for items having an associated codelist, as is the case of the I_DEMO_DEMO_MENSTRUAL item.

demo <- ox_xtract_group(d, group = "IG_DEMO_DEMOGRAPHICDATA",
                        use_item_names = TRUE,
                        define_factors = TRUE)

head(demo[,c(1,3,9,10)])

class(demo$demo_menstrual)

levels(demo$demo_menstrual)

Tidy dataframes for all

Very likely, you want a tidy dataframe for each and every item group. The following code does the job.

# create a vector with all the group_oid's in the data 
grps <- unique(d$data$group_oid)

# create an empty list of same length as grps to collect results 
res <- vector("list", length(grps)) 

# loop over grps and extract a dataframe for each
for (i in 1:length(grps)) {

  # get dataframe for the i-th grps, using item names, defining factors, 
  # and save as i-th res element 
  res[[i]] <- ox_xtract_group(d, grps[i], TRUE, TRUE)

  # name the i-th res as the i-th grps 
  names(res)[i] <- grps[i]
}

Note we got the warnings on Unknown columns: ... . This is because some items have no data at all, and are excluded from the dataframe of the corresponding group.

You can now access any dataframe in the res list. Let's see, for instance, a bit of the dataframe containing demo data. If you dislike having all dataframes in a list, and prefer to have them all in the Global Environment, use list2env().

# a bit of the demo data 
head(res$IG_DEMO_DEMOGRAPHICDATA)[,c(1,3,9,10)]

# res list elements to the Global Environment (and removing res)
list2env(res, envir=.GlobalEnv)
rm(res)

# removing other unneded objects
rm(demo, doc, grps, i, xml_file)

# see what's in the Global Environment
ls()

Contents of `ox_all` objects

ox_all objects are lists with two elements, named data and metadata.

length(d)
names(d)

The data element is a dataframe containing all the data collected on study subjects, in vertical (normalized) form. The number of columns in this dataframe depends on the 1.3 export format but, in all of them, item data values are found in a valuecolumn, as well as all necessary keys to identidy the vallue (study_oid, subject_key, event_oid, etc.).

the metadata element is a list, most of whose elements are dataframes describing events, forms, item groups, items, and codelists (the first two elements however, are lists containing file information and global variables). It is then easy to access any of those dataframes.

class(d$metadata)
names(d$metadata)

# a look to the study event definitions (cols 4 to 7)
d$metadata$event_def

Synthetic (and limited) information of an ox_all object can be obtained with function ox_info().

ox_info(d)

Functions in `ox`

The following functions are provided in the ox package:

To create ox_all objects and work with them:
- ox_all
- ox_xtract_group
- ox_info
To get data from a parsed xml file:
- ox_data
- ox_audit_log
To get metadata from a parsed xml file:
- ox_metadata
- ox_event_def
- ox_event_ref
- ox_file_info
- ox_form_def
- ox_form_ref
- ox_global_vars
- ox_group_def
- ox_group_ref
- ox_group_repeat
- ox_item_def
- ox_item_ref
- ox_codelist
- ox_codelist_item
- ox_codelist_ref
- ox_units
- ox_sites
- ox_subjects