In biostatUniBS/RDataXMan: Data Extraction Management

knitr::opts_chunk$set(eval = FALSE)

Folder Organisation

RDataXMan requires a specific folder structure for the functions to work:

Working directory
  |-- public_data
  |-- public_data_template
  `-- research
        |-- research folder1 (e.g., requestnum001)
              |-- private_data
              |-- private_data_template
              |-- request_input
              `-- request_output
        |-- research folder2
        `-- ...

Workflow with `RDataXMan` package

The workflow using RDataXMan is illustrated using an arbitrary example.

Step 1: Set working directory and research folder

This step creates the folder structure required by the package, as illustrated in the diagram above. Users may create the required subfolders manually, yet are highly recommended to use functions initWkdir and initResearchFolder for convenience.

User may select an existing folder as the working directory, or specify the directory to a new folder, in which case a new working directory will be created.

library(RDataXMan)
initWkdir(wkdir = "Working directory")
initResearchFolder(wkdir = "Working directory", research.folder = "requestnum001")

After setting working directory and research folder, all public datasets that are saved as flat files should be stored in subfolder public_data of working directory, and all flat tables that are only meant to be accessible by this particular research project should be stored in subfolder private_data of selected research folder.

Step 2: Generate the inclusion criteria and variable list

genInclusion(wkdir = "Working directory", research.folder = "requestnum001", 
             table_name, data.type, database, 
             key.var, key.desc, identifier.var, 
             count, overwrite, 
             username, password)
genVariable(wkdir = "Working directory", research.folder = "requestnum001",  
            table_name, data.type, database, 
            identifier.var, omit.var, overwrite,
            username, password)

Parameter data.type should take value public or private if using flat tables, to indicate whether database is saved in public_data or private_data in the specified research folder. If linking to an actual database, data.type should be the type of database to link to, i.e., SQL.

The Excel files generated by genInclusion and genVariable will be saved into
private_data_template folder if data.type is private, and public_data_template otherwise.

After generating the request forms, users need to indicate their selection in each of the forms and save the modified copies in request_input folder within the research folder. See sections below for how to make selection in request forms. Note that the package requires users to leave the original copy of request forms in the template folders.

Step 3: Extract data

After filling in all request forms and saving them in request_input folder, data extraction can be done using function extract_data:

extract_data(wkdir = "Working directory", research.folder = "requestnum001", 
             inclusion.xls.file, dataLogic, variable.xls.file, 
             select.output, overwrite, 
             username, password)

inclusion.xls.file and variable.xls.file can be vectors if multiple inclusion criteria and/or variable lists are to be used for extraction. Users must specify dataLogic to be either intersection or union for multiple inclusion criteria.

select.output indicates output files to be generated by the function, where multiple selections can be indicated by a vector:

1: generate lists of identifier variables from merged inclusion criteria.
2: generate an Excel file with summary statistics for both inclusion criteria and variable lists.
3: generate csv files with data extracted based on each request form.
4: generate a single csv file for the final merged data.

Selected output files will be written into the request_output folder within the research folder, and meanwhile returned as a list of data.frames in the R environment except for 2.

In the final merged data, if there are multiple columns with the same name, each column name will be appended with the index of its corresponding variable list in the variable.xls.file argument.

Things to note when filling in request forms

Do not modify the first row and the first column in "overall list" sheet, otherwise this will cause error in data extraction.
Do not modify the name of sheet "overall list".
Names of request forms should not be changed when saved to request_input folder, and their file type should be "Excel 97-2003 Workbook" (i.e. with xls extension). Users are hence highly recommended to modify request forms and then use Save As... option of Excel to save the modified versions in request_input folder for convenience.

Specifying Inclusion Criteria Using "overall list"

In this section we use an example to illustrate how to specify inclusion criteria by filling up the sheet "overall list" in long or short format.

Suppose we want to select all patients in a dataset as the starting cohort and eventually reduce the sample with other inclusion criteria. In this case, patient Id (denoted as patient NRIC in the example data) should be used as both key variable and identifier variable. There are two ways to specify such selection, one via the selection column of the request form, the other via the logic column.

Using `selection` column

The "overall list" sheet contains all ICD records in the current database. The remarks column (i.e. Column C) is for notes and comments. To select all patients, we put an x for each row in column selection of sheet overall list:

Using `logic` column

Instead of going through each and every row to select all patients, users may also use R logical expression as a shortcut by writing !is.na(PATIENT_NRIC) in the first row of logic column, meaning that any row with a valid patient NRIC, effectively every row, should be selected:

Users mush make sure that the logical statement specified is following valid R syntax, and need to pay attention to data type of key variable. For instance, as.numeric(DIAGNOSIS_CD) >= 174 & as.numeric(DIAGNOSIS_CD) < 175 is advised instead of DIAGNOSIS_CD >= 174 & DIAGNOSIS_CD < 175 if the inclusion criteria is to select any ICD code from 174.0 to 174.9, and users are not sure whether the column for ICD code is numeric. Note that only the first row of this column will be interpreted.

Selecting Variable Using "overall list"

In this section we use an example to illustrate how to select variables for the final output by filling up the sheet "overall list" of variable list in long or short format.

To select specific variables listed in variable (Column B), put an x in the corresponding row in Column D, and leave Column D blank if a variable is not wanted.

Suppose we are interested in race, birth year and death indicator of patients, the sheet "overall list" of the variable list should be filled in the following way: