In reconhub/datacomparator: datacomparator

**Notice**: this is a **stable, routine report**. **Do not touch it unless it is broken.** To make a contribution, **carefully read the [README](../../../../../README.html) file**. **Maintainer:** Thibaut Jombart (thibautjombart@gmail.com) **Code contributors:** Thibaut Jombart **Data contributors:** Surveillance teams from the sub-coordination. **Idea contributors:** Xavier de Radiguès, Mathias Mossoko, Olivier le Polain **Version:** 1.0.0 (only required for routine reports) **Reviewed by:** recommended, but only required for routine reports **Notice**: this is a **stable, routine report**. **Do not touch it unless it is broken.** To make a contribution, carefully read the [README](../../../../../README.html) file.

To compile this report with a specific sub-coordination:

double-click on the open.Rproj files at the root of the reportfactory to open Rstudio
in Rstudio, run the command:

reportfactory::compile_report("compare_vhf_2019-10-13", params = list(sc = "beni"))`

where you can replace beni by the sub-coordination you want. Make sure the data to compare are stored in data/data_comparison/.

Data preparation {.tabset .tabset-fade .tabset-pills}

Outline

Data used

What is the source of your data? Does it need running some previous reports like preliminary data cleaning? If so, list these reports here.

Method

The data preparation involves the following steps, detailed in the following tabs:

Load scripts: loads libraries and useful scripts used in the analyses; all .R files contained in scripts at the root of the factory are automatically loaded
Load data: imports datasets, and may contain some ad hoc changes to the data such as specific data cleaning (not used in other reports), new variables used in the analyses, etc.
Clean data: this section contains ad hoc data cleaning, i.e. which is not used in other reports (otherwise cleaning should be done in a dedicated report); this section is also used to create new variables used in the analyses

Load scripts

These scripts will load:

all local scripts, stored as .R filesinside /scripts/
all global scripts, i.e. stored outside the factory in ../scripts/
the path to the cleaned VHF data stored as x

## read scripts
path_to_scripts <- here::here("scripts")
scripts_files <- dir(path_to_scripts, pattern = ".R$",
                     full.names = TRUE)
for (file in scripts_files) source(file, local = TRUE)

Load data

In this section, we:

look for data files in the data/data_comparison folder
find the two latest datasets
read them into x_old and x_new

## load the data
data_comparison_folder <- here("data", "data_comparison")

## find order of the files
all_files <- list.files(data_comparison_folder,
                        pattern = params$sc,
                        ignore.case = TRUE)
file_order <- all_files %>%
  guess_dates() %>%
  order(decreasing = TRUE)

new_file <- here("data", "data_comparison", all_files[file_order[1]])
old_file <- here("data", "data_comparison", all_files[file_order[2]])


x_new <- custom_import(new_file) %>% as_tibble()
x_old <- custom_import(old_file) %>% as_tibble()


## extract database date from the file name
new_file_date <- gsub("^[^.]+/", "", new_file) %>% guess_dates()
old_file_date <- gsub("^[^.]+/", "", old_file) %>% guess_dates()

The completion dates of the databases are:

old data: r format(old_file_date, format = "%A %d %b %Y").
new data: r format(new_file_date, format = "%A %d %b %Y").

Data cleaning

To keep the original data unchanged as much as possible, we only clean the dates of date_report, needed for further subsetting of the data:

x_new <- x_new %>%
  mutate(date_report = guess_dates(DateReport))

x_old <- x_old %>%
  mutate(date_report = guess_dates(DateReport))

Filter

Here we retain the last 42 days of data, using the most recent data as reference. This data will be used for entry comparison, i.e. seeing what cases changed between versions, as this is a very resource-intensive work. Other comparisons will be made on all data.

start_at <- new_file_date - 42

x_new_recent <- x_new %>%
  filter(date_report >= start_at)

x_old_recent <- x_old %>%
  filter(date_report >= start_at)

Cases retained in the recent datasets were reported from the r format(start_at, "%d %B %Y").

Subset variables

Again to make the investigation of changes in cases faisible, we retain a smaller subset of variables to enable the comparison of entries:

x_new_recent <- x_new_recent %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)

x_old_recent <- x_old_recent %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)

Scales and colors

We define custom colors for the output of compare_df.

comp_colors <- c(addition = "#1FC46F",
                 removal = "#F94444",
                 unchanged_cell = "#7C8AA4",
                 unchanged_row = "#7C8AA4")

Data comparison {.tabset .tabset-fade .tabset-pills}

Explanations

The data comparison includes the following:

structure comparison: look for differences in dimensions, and names, ordering and types of the variables
duplicates: look for duplicated IDs, and identify the corresponding entries; this includes a comparison of the duplicates between old and new datasets
changes in cases: look for changes in cases, showing changes in colors into a separate html table

Struture comparison

In this part we use linelist's function compare_data to compare the structures of the two datasets: which variables, of which types, did some variable disappear etc.

compare_data(x_old, x_new, use_values = FALSE)

Looking for duplicates: old data

In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.

to_keep <- x_old %>%
  filter(duplicated(ID)) %>%
  pull(ID)

duplicates_old <- x_old %>%
  filter(ID %in% to_keep)

if (nrow(duplicates_old) > 0) {
  duplicates_old %>%
    show_table()
}

Looking for duplicates: new data

In this section, we look for duplicated identifiers, and output a table of the corresponding individuals.

to_keep <- x_new %>%
  filter(duplicated(ID)) %>%
  pull(ID)

duplicates_new <- x_new %>%
  filter(ID %in% to_keep)

if (nrow(duplicates_new) > 0) {
  duplicates_new %>%
    show_table()
}

Changes in duplicates

In this part we compare duplicated entries between the old and the new dataset. Note that the table display in this document is sub-optimal. Click on the link below to open the table.

Color code:

No change
Removal (in old file, not in the new one)
Addition (in new file, not in old one)

Open comparison table

Important:the file will open in your web browser by default; for even better visualisation, we recommend going to the tables/ folder and opening comparison_duplicates_table.html with Excel

duplicates_new_select <- duplicates_new %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)


duplicates_old_select <- duplicates_old %>%
  select(1:60,
         DateOutcomeComp,
         FinalStatus,
         date_report)


comparison_duplicates <- compareDF::compare_df(
                                        duplicates_new_select,
                                        duplicates_old_select,
                                        group_col = "ID",
                                        limit_html = 1000,
                                        color_scheme = comp_colors,
                                        keep_unchanged_rows = TRUE,
                                        stop_on_error = FALSE)

Changes in cases

In this section we look for changes in cases between the two versions of the dataset, using the compareDF package to check which entries of the data have changed. This is computer-intensive work. To make it run under reasonable time, we restrict the comparisons to cases reported within the last 42 days (Filter section).

Note that the table display in this document is sub-optimal. Click on the link below to open the table.

Color code:

No change
Removal (in old file, not in the new one)
Addition (in new file, not in old one)

Open comparison table

Important: the file will open in your web browser by default; for even better visualisation, we recommend going to the tables/ folder and opening comparison_duplicates_table.html with Excel

comparison <- compareDF::compare_df(
                             x_new_recent,
                             x_old_recent,
                             group_col = "ID",
                             limit_html = 1000,
                             color_scheme = comp_colors,
                             stop_on_error = FALSE)

Export tables {.tabset .tabset-fade .tabset-pills}

Create output directory

We create a directory called tables/ to store output files, if it does not exist.

if (!dir.exists("tables")) {
  dir.create("tables")
}

Create output files

The following items are exported:

writeLines(comparison_duplicates$html_output,
           con = file.path("tables",
                           "comparison_duplicates_table.html"),
    useBytes = TRUE)

writeLines(comparison$html_output,
           con = file.path("tables",
                            "comparison_table.html"),
           useBytes = TRUE)

if (nrow(duplicates_old) > 0) {
  rio::export(duplicates_old, file = file.path("tables",
                                               "duplicates_old.xlsx"))
}

if (nrow(duplicates_new) > 0) {
  rio::export(duplicates_new, file = file.path("tables",
                                               "duplicates_new.xlsx"))
}

Links to output files

Click on the links below to open items:

Changes in cases:

comparison_duplicates_table: html table of comparisons amongst old and new duplicates
comparison_table: html table of comparisons amongst old and new cases

Duplicates:

duplicates_old: Excel file containing duplicated entries (old dataset)
duplicates_new: Excel file containing duplicated entries (new dataset)

System information {.tabset .tabset-fade .tabset-pills}

Outline

The following information documents the system on which the document was compiled.

System

This provides information on the operating system.

Sys.info()

R environment

This provides information on the version of R used:

R.version

R packages

This provides information on the packages used:

sessionInfo()

Compilation parameters

This provides information on the parameters (passed through params) used for compiling this document:

params

reconhub/datacomparator documentation built on Nov. 5, 2019, 3:05 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

reconhub/datacomparator
datacomparator

In reconhub/datacomparator: datacomparator

Data preparation {.tabset .tabset-fade .tabset-pills}

Outline

Data used

Method

Load scripts

Load data

Data cleaning

Filter

Subset variables

Scales and colors

Data comparison {.tabset .tabset-fade .tabset-pills}

Explanations

Struture comparison

Looking for duplicates: old data

Looking for duplicates: new data

Changes in duplicates

Changes in cases

Export tables {.tabset .tabset-fade .tabset-pills}

Create output directory

Create output files

Links to output files

System information {.tabset .tabset-fade .tabset-pills}

Outline

System

R environment

R packages

Compilation parameters

R Package Documentation

Browse R Packages

We want your feedback!

reconhub/datacomparator datacomparator

In reconhub/datacomparator: datacomparator

Data preparation {.tabset .tabset-fade .tabset-pills}

Outline

Data used

Method

Load scripts

Load data

Data cleaning

Filter

Subset variables

Scales and colors

Data comparison {.tabset .tabset-fade .tabset-pills}

Explanations

Struture comparison

Looking for duplicates: old data

Looking for duplicates: new data

Changes in duplicates

Changes in cases

Export tables {.tabset .tabset-fade .tabset-pills}

Create output directory

Create output files

Links to output files

System information {.tabset .tabset-fade .tabset-pills}

Outline

System

R environment

R packages

Compilation parameters

R Package Documentation

Browse R Packages

We want your feedback!

reconhub/datacomparator
datacomparator