This vignette covers the data cleaning process from the raw data provided by North West London Hospitals to a clean dataset ready for analysis.
The raw data is contained in the file /data/admission_data.rda
. This file
contains a single dataframe, admission_data, which is created from the original
raw data files in /data-raw
by the script /data-raw/load-raw-data.R
which
simply stitches the separate files together.
The functions required for cleaning the data are all contained within the
data_cleaning.R
script.
All dates in the raw data are expressed as character strings starting
yyyy-mm-dd. We therefore use the regex "^[0-9]{4}-[0-9]{2}-[0-9]{2}"
to extract the date part of strings in the following columns of the raw data:
AdmissionDate
, DischargeDate
, EpisodeStartDate
, EpisodeEndDate
.
As part of the data extract and pseudonymisation process, missing data has been
represented as both "NULL"
and "#N/A"
in the raw data. The clean_data
function resolves both of these to NA
, resulting in the following distribution
of missing data across columns.
admission_data_clean <- clahrcnwlhf::clean_data(clahrcnwlhf::admission_data, restrict_disch_date = FALSE) library(knitr) library(scales) nac <- clahrcnwlhf::na_count(admission_data_clean) n <- nrow(admission_data_clean) nact <- data.frame(nac,nac/n*100) colnames(nact) <- c("Number_of_NAs", "Percent_NAs") knitr::kable(nact[order(nact[,2]),])
For some fields, it is not important that there is a significant amount of missing data; e.g. the fact that many of the secondary diagnosis fields are missing may be entirely natural - not all episodes of care involve nine different diagnoses!
To check for any patterns in missing data over time, let's look at this by year.
#admission_data_clean <- clahrcnwlhf::clean_data(clahrcnwlhf::admission_data, # restrict_disch_date = FALSE) library(knitr) library(scales) md <- clahrcnwlhf::missing_data_table(admission_data_clean) knitr::kable(t(md))
The percentage of missing data for the following fields looks higher for 2016
compared with previous years: CSPAdmissionTime
, CSPDischargeTime
,
CSPLastWard
, PrimaryDiagnosis
. Let's look at these percentages at a more
granular timescale.
#admission_data_clean <- clahrcnwlhf::clean_data(clahrcnwlhf::admission_data, # restrict_disch_date = FALSE) p <- clahrcnwlhf::missing_data_table(admission_data_clean, split_by = '%Y-%m', result = 'percent_no_total') pmd_monthly <- p[,colnames(p) %in% c("CSPAdmissionTime","CSPDischargeTime", "PrimaryDiagnosis","CSPLastWard")] matplot(pmd_monthly, type = c("b"),pch=1:4,col = 1:4, xaxt="n") axis(side=1,at=1:nrow(pmd_monthly),labels=rownames(pmd_monthly), las = 2, cex.axis =0.6) legend("topleft", legend = colnames(pmd_monthly), col=1:4, pch=1:4)
This shows two things:
PrimaryDiagnosis
field goes up dramatically for episodes with DischargeDate in the last two months of data (October and November 2016)CSPAdmissionTime
, CSPDischargeTime
, CSPLastWard
, go up by an order of magnitude for episodes with DischargeDate in the last seven months of data (May - November 2016).The increase in missing PrimaryDiagnosis
values in the last two months (1) is
likely to be a result of the usual delay in coding. Therefore these months
will not be used in the analysis.
The missing data in the fields CSPAdmissionTime
, CSPDischargeTime
,
CSPLastWard
is not so easily explained, and Steve Hiles from North West
London Hospitals is investigating this at present (25th November 2016).
For now, let's look at the new missing data summary excluding those last two months:
#admission_data_clean <- clahrcnwlhf::clean_data(clahrcnwlhf::admission_data, # restrict_disch_date = TRUE) library(knitr) library(scales) nac <- clahrcnwlhf::na_count(admission_data_clean) n <- nrow(admission_data_clean) nact <- data.frame(nac,nac/n*100) colnames(nact) <- c("Number_of_NAs", "Percent_NAs") knitr::kable(nact[order(nact[,2]),])
Vignettes are long form documentation commonly included in packages. Because they are part of the distribution of the package, they need to be as compact as possible. The html_vignette
output type provides a custom style sheet (and tweaks some options) to ensure that the resulting html is as small as possible. The html_vignette
format:
Note the various macros within the vignette
section of the metadata block above. These are required in order to instruct R how to build the vignette. Note that you should change the title
field and the \VignetteIndexEntry
to match the title of your vignette.
The html_vignette
template includes a basic CSS theme. To override this theme you can specify your own CSS in the document metadata as follows:
output: rmarkdown::html_vignette: css: mystyles.css
The figure sizes have been customised so that you can easily put two images side-by-side.
plot(1:10) plot(10:1)
You can enable figure captions by fig_caption: yes
in YAML:
output: rmarkdown::html_vignette: fig_caption: yes
Then you can use the chunk option fig.cap = "Your figure caption."
in knitr.
You can write math expressions, e.g. $Y = X\beta + \epsilon$, footnotes^[A footnote here.], and tables, e.g. using knitr::kable()
.
knitr::kable(head(mtcars, 10))
Also a quote using >
:
"He who gives up [code] safety for [code] speed deserves neither." (via)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.