test_two_datasets: Regression testing for dataset migration

Description Usage Arguments Examples

Description

This is a package of regression testing for dataset migration. Given a scenario where a legacy dataset will be replaced by a target dataset, we will analyze the difference between them based on following tests: 1. Distribution test: Kolmogorov-Smirnov test; 2. Correlation tests: Pearson correlation coefficient and Spearman's correlation; 3. Different variables and records; 4. Magnitude comparison; 5. Mean relative errors; 6. The difference between two hierarchical pairs in Spearman's test; 7. Features that have NA values; 8. Hybrid tests, which shows features that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests; 9. Ranking, which shows the ranking of variables that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests. The final report will be written into a user-specified xlsx file or/and an object (which is stored in an RData file). Users can choose the test results produced in the final report.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
test_two_datasets(legacy_file = NULL, legacy_df = NULL,
  target_file = NULL, target_df = NULL, hier = NULL,
  hier_df = NULL, hier_pair = NULL, hier_pair_df = NULL,
  thresholds = NULL, thresholds_df = NULL, final_report = NULL,
  final_data = NULL, key_col, hier_col = NULL, report_join = TRUE,
  report_hybrid = TRUE, report_magnitude = TRUE, report_mre = TRUE,
  report_spearman = TRUE, report_pearson = TRUE,
  report_distribution = TRUE, report_rank = TRUE,
  report_spearman_diff = TRUE, report_na = TRUE,
  report_var_attr = TRUE)

Arguments

legacy_file

Full path of the input legacy dataset (csv)

legacy_df

Data frame contained the input legacy dataset

target_file

Full path of the input target dataset (csv)

target_df

Data frame contained the input target dataset

hier

Full path of the hierarchy file (csv)

hier_df

Data frame contained the hierarchy

hier_pair

Full path of the hierarchical pair file (csv)

hier_pair_df

Data frame contained the hierarchical pair

thresholds

Full path of the file contained thresholds

thresholds_df

Data frame contained thresholds

final_report

Full path of the output file (xlsx)

final_data

Full path of the output file (RData)

key_col

Key column in the two datasets

hier_col

Column name that contains hierarchies

report_join

Boolean variable to control the report of joined metrics. TRUE - generate the report; FALSE - the report will not be generated.

report_hybrid

Boolean variable to control the report of hybrid metrics. TRUE - generate the report; FALSE - the report will not be generated.

report_magnitude

Boolean variable to control the report of magnitude metrics. TRUE - generate the report; FALSE - the report will not be generated.

report_mre

Boolean variable to control the report based on mean relative errors. TRUE - generate the report; FALSE - the report will not be generated.

report_spearman

Boolean variable to control the report based on Spearman's test. TRUE - generate the report; FALSE - the report will not be generated.

report_pearson

Boolean variable to control the report based on Pearson test. TRUE - generate the report; FALSE - the report will not be generated.

report_distribution

Boolean variable to control the report based on distribution test. TRUE - generate the report; FALSE - the report will not be generated.

report_rank

Boolean variable to control the report of appearances. TRUE - generate the report; FALSE - the report will not be generated.

report_spearman_diff

Boolean variable to control the report of Spearman's test difference on different hierarchies. TRUE - generate the report; FALSE - the report will not be generated.

report_na

Boolean variable to control the report of features with NA values. TRUE - generate the report; FALSE - the report will not be generated.

report_var_attr

Boolean variable to control the report of variables' attributes. TRUE - generate the report; FALSE - the report will not be generated.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
library("rio")

# Let us first look at the hierarchical case

old_file <- '../data/restore_old.RData'
new_file <- '../data/restore_new.RData'
geo_hier <- '../data/restore_geo_hierarchies.RData'
geo_pair <- '../data/restore_geo_pairs.RData'
thresholds <- '../data/restore_thresholds.RData'
final_report <- '../inst/extdata/analysis_results_hierarchy.xlsx'
key <- 'CODE'
hierarchy <-'GEO'

old_file <- import(old_file)
new_file <- import(new_file)
geo_hier <- import(geo_hier)
geo_pair <- import(geo_pair)
thresholds <- import(thresholds)

test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  hier_df = geo_hier,
                  hier_pair_df = geo_pair,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key,
                  hier_col = hierarchy)
                  
# Now let us consider the flat hierarchy case
# To save space, we will reuse old_file, new_file, thresholds, and key variables.

# Remove the hierarchy columns:
old_file$GEO <- NULL
new_file$GEO <- NULL

final_report <- '../inst/extdata/analysis_results_flat_hierarchy.xlsx'

# Note that a dummy column GEO will be generated in the report

test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key)

miranska/restore documentation built on May 8, 2019, 1:21 p.m.