To compile this report with a specific sub-coordination:
double-click on the open.Rproj
files at the root of the reportfactory to open Rstudio
in Rstudio, run the command:
reportfactory::compile_report("compare_vhf_2019-10-13", params = list(sc = "beni"))`
where you can replace beni
by the sub-coordination you want. Make sure the
data to compare are stored in data/data_comparison/
.
What is the source of your data? Does it need running some previous reports like preliminary data cleaning? If so, list these reports here.
The data preparation involves the following steps, detailed in the following tabs:
Load scripts: loads libraries and useful scripts used in the analyses; all
.R
files contained in scripts
at the root of the factory are automatically
loaded
Load data: imports datasets, and may contain some ad hoc changes to the data such as specific data cleaning (not used in other reports), new variables used in the analyses, etc.
Clean data: this section contains ad hoc data cleaning, i.e. which is not used in other reports (otherwise cleaning should be done in a dedicated report); this section is also used to create new variables used in the analyses
These scripts will load:
.R
filesinside /scripts/
../scripts/
x
## read scripts path_to_scripts <- here::here("scripts") scripts_files <- dir(path_to_scripts, pattern = ".R$", full.names = TRUE) for (file in scripts_files) source(file, local = TRUE)
In this section, we:
data/data_comparison
folderx_old
and x_new
## load the data data_comparison_folder <- here("data", "data_comparison") ## find order of the files all_files <- list.files(data_comparison_folder, pattern = params$sc, ignore.case = TRUE) file_order <- all_files %>% guess_dates() %>% order(decreasing = TRUE) new_file <- here("data", "data_comparison", all_files[file_order[1]]) old_file <- here("data", "data_comparison", all_files[file_order[2]]) x_new <- custom_import(new_file) %>% as_tibble() x_old <- custom_import(old_file) %>% as_tibble() ## extract database date from the file name new_file_date <- gsub("^[^.]+/", "", new_file) %>% guess_dates() old_file_date <- gsub("^[^.]+/", "", old_file) %>% guess_dates()
The completion dates of the databases are:
r format(old_file_date, format = "%A %d %b %Y")
.r format(new_file_date, format = "%A %d %b %Y")
.To keep the original data unchanged as much as possible, we only clean the dates
of date_report
, needed for further subsetting of the data:
x_new <- x_new %>% mutate(date_report = guess_dates(DateReport)) x_old <- x_old %>% mutate(date_report = guess_dates(DateReport))
Here we retain the last 42 days of data, using the most recent data as reference. This data will be used for entry comparison, i.e. seeing what cases changed between versions, as this is a very resource-intensive work. Other comparisons will be made on all data.
start_at <- new_file_date - 42 x_new_recent <- x_new %>% filter(date_report >= start_at) x_old_recent <- x_old %>% filter(date_report >= start_at)
Cases retained in the recent
datasets were reported from the
r format(start_at, "%d %B %Y")
.
Again to make the investigation of changes in cases faisible, we retain a smaller subset of variables to enable the comparison of entries:
x_new_recent <- x_new_recent %>% select(1:60, DateOutcomeComp, FinalStatus, date_report) x_old_recent <- x_old_recent %>% select(1:60, DateOutcomeComp, FinalStatus, date_report)
We define custom colors for the output of compare_df
.
comp_colors <- c(addition = "#1FC46F", removal = "#F94444", unchanged_cell = "#7C8AA4", unchanged_row = "#7C8AA4")
The data comparison includes the following:
structure comparison: look for differences in dimensions, and names, ordering and types of the variables
duplicates: look for duplicated IDs, and identify the corresponding entries; this includes a comparison of the duplicates between old and new datasets
changes in cases: look for changes in cases, showing changes in colors into a separate html table
In this part we use linelist's function compare_data
to compare the
structures of the two datasets: which variables, of which types, did some
variable disappear etc.
compare_data(x_old, x_new, use_values = FALSE)
In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.
to_keep <- x_old %>% filter(duplicated(ID)) %>% pull(ID) duplicates_old <- x_old %>% filter(ID %in% to_keep) if (nrow(duplicates_old) > 0) { duplicates_old %>% show_table() }
In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.
to_keep <- x_new %>% filter(duplicated(ID)) %>% pull(ID) duplicates_new <- x_new %>% filter(ID %in% to_keep) if (nrow(duplicates_new) > 0) { duplicates_new %>% show_table() }
In this part we compare duplicated entries between the old and the new dataset. Note that the table display in this document is sub-optimal. Click on the link below to open the table.
Color code:
Important:the file will open in your web browser by
default; for even better visualisation, we recommend going to the tables/
folder and opening comparison_duplicates_table.html
with Excel
duplicates_new_select <- duplicates_new %>% select(1:60, DateOutcomeComp, FinalStatus, date_report) duplicates_old_select <- duplicates_old %>% select(1:60, DateOutcomeComp, FinalStatus, date_report) comparison_duplicates <- compareDF::compare_df( duplicates_new_select, duplicates_old_select, group_col = "ID", limit_html = 1000, color_scheme = comp_colors, keep_unchanged_rows = TRUE, stop_on_error = FALSE)
In this section we look for changes in cases between the two versions of the dataset, using the compareDF package to check which entries of the data have changed. This is computer-intensive work. To make it run under reasonable time, we restrict the comparisons to cases reported within the last 42 days (Filter section).
Note that the table display in this document is sub-optimal. Click on the link below to open the table.
Color code:
Important: the file will open in your web browser by
default; for even better visualisation, we recommend going to the tables/
folder and opening comparison_duplicates_table.html
with Excel
comparison <- compareDF::compare_df( x_new_recent, x_old_recent, group_col = "ID", limit_html = 1000, color_scheme = comp_colors, stop_on_error = FALSE)
We create a directory called tables/
to store output files, if it does not
exist.
if (!dir.exists("tables")) { dir.create("tables") }
The following items are exported:
writeLines(comparison_duplicates$html_output, con = file.path("tables", "comparison_duplicates_table.html"), useBytes = TRUE) writeLines(comparison$html_output, con = file.path("tables", "comparison_table.html"), useBytes = TRUE) if (nrow(duplicates_old) > 0) { rio::export(duplicates_old, file = file.path("tables", "duplicates_old.xlsx")) } if (nrow(duplicates_new) > 0) { rio::export(duplicates_new, file = file.path("tables", "duplicates_new.xlsx")) }
Click on the links below to open items:
Changes in cases:
comparison_duplicates_table: html table of comparisons amongst old and new duplicates
comparison_table: html table of comparisons amongst old and new cases
Duplicates:
duplicates_old: Excel file containing duplicated entries (old dataset)
duplicates_new: Excel file containing duplicated entries (new dataset)
The following information documents the system on which the document was compiled.
This provides information on the operating system.
Sys.info()
This provides information on the version of R used:
R.version
This provides information on the packages used:
sessionInfo()
This provides information on the parameters (passed through params
) used for
compiling this document:
params
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.