Description Usage Arguments Examples
This is a package of regression testing for dataset migration. Given a scenario where a legacy dataset will be replaced by a target dataset, we will analyze the difference between them based on following tests: 1. Distribution test: Kolmogorov-Smirnov test; 2. Correlation tests: Pearson correlation coefficient and Spearman's correlation; 3. Different variables and records; 4. Magnitude comparison; 5. Mean relative errors; 6. The difference between two hierarchical pairs in Spearman's test; 7. Features that have NA values; 8. Hybrid tests, which shows features that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests; 9. Ranking, which shows the ranking of variables that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests. The final report will be written into a user-specified xlsx file or/and an object (which is stored in an RData file). Users can choose the test results produced in the final report.
| 1 2 3 4 5 6 7 8 9 10 | test_two_datasets(legacy_file = NULL, legacy_df = NULL,
  target_file = NULL, target_df = NULL, hier = NULL,
  hier_df = NULL, hier_pair = NULL, hier_pair_df = NULL,
  thresholds = NULL, thresholds_df = NULL, final_report = NULL,
  final_data = NULL, key_col, hier_col = NULL, report_join = TRUE,
  report_hybrid = TRUE, report_magnitude = TRUE, report_mre = TRUE,
  report_spearman = TRUE, report_pearson = TRUE,
  report_distribution = TRUE, report_rank = TRUE,
  report_spearman_diff = TRUE, report_na = TRUE,
  report_var_attr = TRUE)
 | 
| legacy_file | Full path of the input legacy dataset (csv) | 
| legacy_df | Data frame contained the input legacy dataset | 
| target_file | Full path of the input target dataset (csv) | 
| target_df | Data frame contained the input target dataset | 
| hier | Full path of the hierarchy file (csv) | 
| hier_df | Data frame contained the hierarchy | 
| hier_pair | Full path of the hierarchical pair file (csv) | 
| hier_pair_df | Data frame contained the hierarchical pair | 
| thresholds | Full path of the file contained thresholds | 
| thresholds_df | Data frame contained thresholds | 
| final_report | Full path of the output file (xlsx) | 
| final_data | Full path of the output file (RData) | 
| key_col | Key column in the two datasets | 
| hier_col | Column name that contains hierarchies | 
| report_join | Boolean variable to control the report of joined metrics. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_hybrid | Boolean variable to control the report of hybrid metrics. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_magnitude | Boolean variable to control the report of magnitude metrics. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_mre | Boolean variable to control the report based on mean relative errors. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_spearman | Boolean variable to control the report based on Spearman's test. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_pearson | Boolean variable to control the report based on Pearson test. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_distribution | Boolean variable to control the report based on distribution test. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_rank | Boolean variable to control the report of appearances. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_spearman_diff | Boolean variable to control the report of Spearman's test difference on different hierarchies. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_na | Boolean variable to control the report of features with NA values. TRUE - generate the report; FALSE - the report will not be generated. | 
| report_var_attr | Boolean variable to control the report of variables' attributes. TRUE - generate the report; FALSE - the report will not be generated. | 
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | library("rio")
# Let us first look at the hierarchical case
old_file <- '../data/restore_old.RData'
new_file <- '../data/restore_new.RData'
geo_hier <- '../data/restore_geo_hierarchies.RData'
geo_pair <- '../data/restore_geo_pairs.RData'
thresholds <- '../data/restore_thresholds.RData'
final_report <- '../inst/extdata/analysis_results_hierarchy.xlsx'
key <- 'CODE'
hierarchy <-'GEO'
old_file <- import(old_file)
new_file <- import(new_file)
geo_hier <- import(geo_hier)
geo_pair <- import(geo_pair)
thresholds <- import(thresholds)
test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  hier_df = geo_hier,
                  hier_pair_df = geo_pair,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key,
                  hier_col = hierarchy)
                  
# Now let us consider the flat hierarchy case
# To save space, we will reuse old_file, new_file, thresholds, and key variables.
# Remove the hierarchy columns:
old_file$GEO <- NULL
new_file$GEO <- NULL
final_report <- '../inst/extdata/analysis_results_flat_hierarchy.xlsx'
# Note that a dummy column GEO will be generated in the report
test_two_datasets(legacy_df = old_file,
                  target_df = new_file,
                  thresholds_df = thresholds,
                  final_report = final_report,
                  key_col = key)
 | 
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.