**Notice**: this is a **stable, routine report**. **Do not touch it unless it is broken.** To make a contribution, **carefully read the [README](../../../../../README.html) file**. **Maintainer:** Thibaut Jombart (thibautjombart@gmail.com) **Code contributors:** Thibaut Jombart **Data contributors:** Surveillance teams from the sub-coordination. **Idea contributors:** Xavier de Radiguès, Mathias Mossoko, Olivier le Polain **Version:** 1.0.0 (only required for routine reports) **Reviewed by:** recommended, but only required for routine reports **Notice**: this is a **stable, routine report**. **Do not touch it unless it is broken.** To make a contribution, carefully read the [README](../../../../../README.html) file.


To compile this report with a specific sub-coordination:

  1. double-click on the open.Rproj files at the root of the reportfactory to open Rstudio

  2. in Rstudio, run the command:

reportfactory::compile_report("compare_vhf_2019-10-13", params = list(sc = "beni"))`

where you can replace beni by the sub-coordination you want. Make sure the data to compare are stored in data/data_comparison/.

Data preparation {.tabset .tabset-fade .tabset-pills}

Outline

Data used

What is the source of your data? Does it need running some previous reports like preliminary data cleaning? If so, list these reports here.

Method

The data preparation involves the following steps, detailed in the following tabs:

Load scripts

These scripts will load:

## read scripts
path_to_scripts <- here::here("scripts")
scripts_files <- dir(path_to_scripts, pattern = ".R$",
                     full.names = TRUE)
for (file in scripts_files) source(file, local = TRUE)

Load data

In this section, we:

## load the data
data_comparison_folder <- here("data", "data_comparison")

## find order of the files
all_files <- list.files(data_comparison_folder,
                        pattern = params$sc,
                        ignore.case = TRUE)
file_order <- all_files %>%
  guess_dates() %>%
  order(decreasing = TRUE)

new_file <- here("data", "data_comparison", all_files[file_order[1]])
old_file <- here("data", "data_comparison", all_files[file_order[2]])


x_new <- custom_import(new_file) %>% as_tibble()
x_old <- custom_import(old_file) %>% as_tibble()


## extract database date from the file name
new_file_date <- gsub("^[^.]+/", "", new_file) %>% guess_dates()
old_file_date <- gsub("^[^.]+/", "", old_file) %>% guess_dates()

The completion dates of the databases are:

Data cleaning

To keep the original data unchanged as much as possible, we only clean the dates of date_report, needed for further subsetting of the data:

x_new <- x_new %>%
  mutate(date_report = guess_dates(DateReport))

x_old <- x_old %>%
  mutate(date_report = guess_dates(DateReport))

Filter

Here we retain the last 42 days of data, using the most recent data as reference. This data will be used for entry comparison, i.e. seeing what cases changed between versions, as this is a very resource-intensive work. Other comparisons will be made on all data.

start_at <- new_file_date - 42

x_new_recent <- x_new %>%
  filter(date_report >= start_at)

x_old_recent <- x_old %>%
  filter(date_report >= start_at)

Cases retained in the recent datasets were reported from the r format(start_at, "%d %B %Y").

Subset variables

Again to make the investigation of changes in cases faisible, we retain a smaller subset of variables to enable the comparison of entries:

x_new_recent <- x_new_recent %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)

x_old_recent <- x_old_recent %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)

Scales and colors

We define custom colors for the output of compare_df.

comp_colors <- c(addition = "#1FC46F",
                 removal = "#F94444",
                 unchanged_cell = "#7C8AA4",
                 unchanged_row = "#7C8AA4")

Data comparison {.tabset .tabset-fade .tabset-pills}

Explanations

The data comparison includes the following:

Struture comparison

In this part we use linelist's function compare_data to compare the structures of the two datasets: which variables, of which types, did some variable disappear etc.

compare_data(x_old, x_new, use_values = FALSE)

Looking for duplicates: old data

In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.

to_keep <- x_old %>%
  filter(duplicated(ID)) %>%
  pull(ID)

duplicates_old <- x_old %>%
  filter(ID %in% to_keep)

if (nrow(duplicates_old) > 0) {
  duplicates_old %>%
    show_table()
}

Looking for duplicates: new data

In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.

to_keep <- x_new %>%
  filter(duplicated(ID)) %>%
  pull(ID)

duplicates_new <- x_new %>%
  filter(ID %in% to_keep)

if (nrow(duplicates_new) > 0) {
  duplicates_new %>%
    show_table()
}

Changes in duplicates

In this part we compare duplicated entries between the old and the new dataset. Note that the table display in this document is sub-optimal. Click on the link below to open the table.

Color code:

Open comparison table

Important:the file will open in your web browser by default; for even better visualisation, we recommend going to the tables/ folder and opening comparison_duplicates_table.html with Excel

duplicates_new_select <- duplicates_new %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)


duplicates_old_select <- duplicates_old %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)


comparison_duplicates <- compareDF::compare_df(
                                        duplicates_new_select,
                                        duplicates_old_select,
                                        group_col = "ID",
                                        limit_html = 1000,
                                        color_scheme = comp_colors,
                                        keep_unchanged_rows = TRUE,
                                        stop_on_error = FALSE)

Changes in cases

In this section we look for changes in cases between the two versions of the dataset, using the compareDF package to check which entries of the data have changed. This is computer-intensive work. To make it run under reasonable time, we restrict the comparisons to cases reported within the last 42 days (Filter section).

Note that the table display in this document is sub-optimal. Click on the link below to open the table.

Color code:

Open comparison table

Important: the file will open in your web browser by default; for even better visualisation, we recommend going to the tables/ folder and opening comparison_duplicates_table.html with Excel

comparison <- compareDF::compare_df(
                             x_new_recent,
                             x_old_recent,
                             group_col = "ID",
                             limit_html = 1000,
                             color_scheme = comp_colors,
                             stop_on_error = FALSE)

Export tables {.tabset .tabset-fade .tabset-pills}

Create output directory

We create a directory called tables/ to store output files, if it does not exist.

if (!dir.exists("tables")) {
  dir.create("tables")
}

Create output files

The following items are exported:

writeLines(comparison_duplicates$html_output,
           con = file.path("tables",
                           "comparison_duplicates_table.html"),
    useBytes = TRUE)

writeLines(comparison$html_output,
           con = file.path("tables",
                            "comparison_table.html"),
           useBytes = TRUE)

if (nrow(duplicates_old) > 0) {
  rio::export(duplicates_old, file = file.path("tables",
                                               "duplicates_old.xlsx"))
}

if (nrow(duplicates_new) > 0) {
  rio::export(duplicates_new, file = file.path("tables",
                                               "duplicates_new.xlsx"))
}

Links to output files

Click on the links below to open items:

Changes in cases:

Duplicates:

System information {.tabset .tabset-fade .tabset-pills}

Outline

The following information documents the system on which the document was compiled.

System

This provides information on the operating system.

Sys.info()

R environment

This provides information on the version of R used:

R.version

R packages

This provides information on the packages used:

sessionInfo()

Compilation parameters

This provides information on the parameters (passed through params) used for compiling this document:

params


reconhub/datacomparator documentation built on Nov. 5, 2019, 3:05 a.m.