Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.
The DAD database is large and the flat SPSS sav format is not amenable to fast processing and data mining for clinical insights. dadR
uses Apache Spark to parallelize search and extraction. Most functions return a Spark data frame. This includes some innovative clustering and other machine learning functions.
devtools::install_github("E-Health/dadR")
library(SparkR)
library(data.table)
library(foreign)
library(dadR)
# Change Master UI here
sparkR.session(
master = "localhost",
sparkConfig = list(
spark.driver.memory = "3g",
spark.executor.memory = "3g")
)
DADSparkInit(savFile = "path/to/dad_sample_2015.sav")
# csv file with the filename dadr will be automatically created the first time
# This can be used for future analysis
DADSparkInit(csvFile = "path/to/dadr.csv")
spark_df <- DADSameDisease("J08")
r_df <- collect(spark_df)
# All records with the diagnosis J08
(r_dt <- as.data.table(r_df))
devtools::load_all() # Repeat on error
devtools::test()
Please cite dadR in your publications if it helped your research. Here is an example BibTeX entry:
@misc{eapenbr2018,
title={dadR - Spark enabled R package for analyzing discharge abstract database.},
author={Eapen, Bell Raj and contributors},
year={2018},
publisher={GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/E-Health/dadR}}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.