load_all_data: Loads all RPDR text outputs into R.

View source: R/load_all_data.R

load_all_dataR Documentation

Loads all RPDR text outputs into R.

Description

Loads all RPDR text outputs into R and returns a list of data tables processed. If multiple text files of the same type are available (if the query is larger than 25000 patients), then add a "_" and a number to merge the same data sources into a single output in the order of the provided number.

Usage

load_all_data(
  folder,
  which_data = c("mrn", "con", "dem", "all", "bib", "dia", "enc", "lab", "lno", "mcm",
    "med", "mic", "phy", "prc", "prv", "ptd", "rdt", "rfv", "trn", "car", "dis", "end",
    "hnp", "opn", "pat", "prg", "pul", "rad", "vis"),
  old_dem = FALSE,
  merge_id = "EMPI",
  sep = ":",
  id_length = "standard",
  perc = 0.6,
  na = TRUE,
  identical = TRUE,
  nThread = parallel::detectCores() - 1,
  many_sources = TRUE,
  load_report = TRUE,
  format_orig = FALSE
)

Arguments

folder

string, full folder path to RPDR text files.

which_data

string vector, an array of abbreviation corresponding to the datasources wished to load.

old_dem

boolean, should old load_dem function be used for loading demographic data. Defaults to TRUE, should be set to FALSE for Dem.txt datasets prior to 2022.

merge_id

string, column name to use to create ID_MERGE column used to merge different datasets. Defaults to EMPI, as it is the preferred MRN in the RPDR system. In case of mrn dataset, leave at EMPI, as it is automatically converted to: "Enterprise_Master_Patient_Index".

sep

string, divider between hospital ID and MRN. Defaults to :.

id_length

string, indicating whether to modify MRN length based-on required values id_length = standard, or to keep lengths as is id_length = asis. If id_length = standard then in case of MGH, BWH, MCL, EMPI and PMRN the length of the MRNs are corrected accordingly by adding zeros, or removing numeral from the beginning. In other cases the lengths are unchanged. Defaults to standard.

perc

numeric, a number between 0-1 indicating which parsed ID columns to keep. Data present in perc x 100% of patients are kept.

na

boolean, whether to remove columns with only NA values. Defaults to TRUE.

identical

boolean, whether to remove columns with identical values. Defaults to TRUE.

nThread

integer, number of threads to use for parallelization.

many_sources

boolean, if TRUE, then parallelization is done on the level of the datasources. If FALSE, then parallelization is done within the datasources. If there are many datasources, then it is advised to set this TRUE, as then each different datasource will be processed in parallel. However, if there are only a few datasources selected to load, but many files per datasource (result of large queries), then it may be faster to parallelize within each datasource and therefore should be set to FALSE. If there are only a few sources each with one file then set to TRUE.

load_report

boolean, should the report text be returned for notes. Defaults to TRUE.

format_orig

boolean, should report be returned in its original formatting or should white spaces used for formatting be removed. Defaults to FALSE.

Value

list of parsed data tables containing the information.

Examples

## Not run: 
#Load all Con, Dem and Mrn datasets processing all files within given datasource in parallel
load_all_data(folder = folder_rpdr, which_data = c("con", "dem", "mrn"),
nThread = 2, many_sources = FALSE)

#Load all supported file types parallelizing on the level of datasources
load_all_data(folder = folder_rpdr, nThread = 2, many_sources = TRUE,
format_orig = TRUE)

## End(Not run)

parseRPDR documentation built on June 24, 2024, 5:16 p.m.