test_two_datasets: Regression testing for dataset migration
In miranska/restore: Regression testing for dataset migration

This is a package of regression testing for dataset migration. Given a scenario where a legacy dataset will be replaced by a target dataset, we will analyze the difference between them based on following tests: 1. Distribution test: Kolmogorov-Smirnov test; 2. Correlation tests: Pearson correlation coefficient and Spearman's correlation; 3. Different variables and records; 4. Magnitude comparison; 5. Mean relative errors; 6. The difference between two hierarchical pairs in Spearman's test; 7. Features that have NA values; 8. Hybrid tests, which shows features that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests; 9. Ranking, which shows the ranking of variables that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests. The final report will be written into a user-specified xlsx file or/and an object (which is stored in an RData file). Users can choose the test results produced in the final report.

test_two_datasets(legacy_file = NULL, legacy_df = NULL,
  target_file = NULL, target_df = NULL, hier = NULL,
  hier_df = NULL, hier_pair = NULL, hier_pair_df = NULL,
  thresholds = NULL, thresholds_df = NULL, final_report = NULL,
  final_data = NULL, key_col, hier_col = NULL, report_join = TRUE,
  report_hybrid = TRUE, report_magnitude = TRUE, report_mre = TRUE,
  report_spearman = TRUE, report_pearson = TRUE,
  report_distribution = TRUE, report_rank = TRUE,
  report_spearman_diff = TRUE, report_na = TRUE,
  report_var_attr = TRUE)

`legacy_file`	Full path of the input legacy dataset (csv)
`legacy_df`	Data frame contained the input legacy dataset
`target_file`	Full path of the input target dataset (csv)
`target_df`	Data frame contained the input target dataset
`hier`	Full path of the hierarchy file (csv)
`hier_df`	Data frame contained the hierarchy
`hier_pair`	Full path of the hierarchical pair file (csv)
`hier_pair_df`	Data frame contained the hierarchical pair
`thresholds`	Full path of the file contained thresholds
`thresholds_df`	Data frame contained thresholds
`final_report`	Full path of the output file (xlsx)
`final_data`	Full path of the output file (RData)
`key_col`	Key column in the two datasets
`hier_col`	Column name that contains hierarchies
`report_join`	Boolean variable to control the report of joined metrics. TRUE - generate the report; FALSE - the report will not be generated.
`report_hybrid`	Boolean variable to control the report of hybrid metrics. TRUE - generate the report; FALSE - the report will not be generated.
`report_magnitude`	Boolean variable to control the report of magnitude metrics. TRUE - generate the report; FALSE - the report will not be generated.
`report_mre`	Boolean variable to control the report based on mean relative errors. TRUE - generate the report; FALSE - the report will not be generated.
`report_spearman`	Boolean variable to control the report based on Spearman's test. TRUE - generate the report; FALSE - the report will not be generated.
`report_pearson`	Boolean variable to control the report based on Pearson test. TRUE - generate the report; FALSE - the report will not be generated.
`report_distribution`	Boolean variable to control the report based on distribution test. TRUE - generate the report; FALSE - the report will not be generated.
`report_rank`	Boolean variable to control the report of appearances. TRUE - generate the report; FALSE - the report will not be generated.
`report_spearman_diff`	Boolean variable to control the report of Spearman's test difference on different hierarchies. TRUE - generate the report; FALSE - the report will not be generated.
`report_na`	Boolean variable to control the report of features with NA values. TRUE - generate the report; FALSE - the report will not be generated.
`report_var_attr`	Boolean variable to control the report of variables' attributes. TRUE - generate the report; FALSE - the report will not be generated.

library("rio")

# Let us first look at the hierarchical case

old_file <- '../data/restore_old.RData'
new_file <- '../data/restore_new.RData'
geo_hier <- '../data/restore_geo_hierarchies.RData'
geo_pair <- '../data/restore_geo_pairs.RData'
thresholds <- '../data/restore_thresholds.RData'
final_report <- '../inst/extdata/analysis_results_hierarchy.xlsx'
key <- 'CODE'
hierarchy <-'GEO'

old_file <- import(old_file)
new_file <- import(new_file)
geo_hier <- import(geo_hier)
geo_pair <- import(geo_pair)
thresholds <- import(thresholds)

test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  hier_df = geo_hier,
                  hier_pair_df = geo_pair,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key,
                  hier_col = hierarchy)
                  
# Now let us consider the flat hierarchy case
# To save space, we will reuse old_file, new_file, thresholds, and key variables.

# Remove the hierarchy columns:
old_file$GEO <- NULL
new_file$GEO <- NULL

final_report <- '../inst/extdata/analysis_results_flat_hierarchy.xlsx'

# Note that a dummy column GEO will be generated in the report

test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key)