Description Usage Arguments Examples
This is a package of regression testing for dataset migration. Given a scenario where a legacy dataset will be replaced by a target dataset, we will analyze the difference between them based on following tests: 1. Distribution test: Kolmogorov-Smirnov test; 2. Correlation tests: Pearson correlation coefficient and Spearman's correlation; 3. Different variables and records; 4. Magnitude comparison; 5. Mean relative errors; 6. The difference between two hierarchical pairs in Spearman's test; 7. Features that have NA values; 8. Hybrid tests, which shows features that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests; 9. Ranking, which shows the ranking of variables that appear in Kolmogorov-Smirnov test, mean relative error test, and correlation tests. The final report will be written into a user-specified xlsx file or/and an object (which is stored in an RData file). Users can choose the test results produced in the final report.
1 2 3 4 5 6 7 8 9 10 | test_two_datasets(legacy_file = NULL, legacy_df = NULL,
target_file = NULL, target_df = NULL, hier = NULL,
hier_df = NULL, hier_pair = NULL, hier_pair_df = NULL,
thresholds = NULL, thresholds_df = NULL, final_report = NULL,
final_data = NULL, key_col, hier_col = NULL, report_join = TRUE,
report_hybrid = TRUE, report_magnitude = TRUE, report_mre = TRUE,
report_spearman = TRUE, report_pearson = TRUE,
report_distribution = TRUE, report_rank = TRUE,
report_spearman_diff = TRUE, report_na = TRUE,
report_var_attr = TRUE)
|
legacy_file |
Full path of the input legacy dataset (csv) |
legacy_df |
Data frame contained the input legacy dataset |
target_file |
Full path of the input target dataset (csv) |
target_df |
Data frame contained the input target dataset |
hier |
Full path of the hierarchy file (csv) |
hier_df |
Data frame contained the hierarchy |
hier_pair |
Full path of the hierarchical pair file (csv) |
hier_pair_df |
Data frame contained the hierarchical pair |
thresholds |
Full path of the file contained thresholds |
thresholds_df |
Data frame contained thresholds |
final_report |
Full path of the output file (xlsx) |
final_data |
Full path of the output file (RData) |
key_col |
Key column in the two datasets |
hier_col |
Column name that contains hierarchies |
report_join |
Boolean variable to control the report of joined metrics. TRUE - generate the report; FALSE - the report will not be generated. |
report_hybrid |
Boolean variable to control the report of hybrid metrics. TRUE - generate the report; FALSE - the report will not be generated. |
report_magnitude |
Boolean variable to control the report of magnitude metrics. TRUE - generate the report; FALSE - the report will not be generated. |
report_mre |
Boolean variable to control the report based on mean relative errors. TRUE - generate the report; FALSE - the report will not be generated. |
report_spearman |
Boolean variable to control the report based on Spearman's test. TRUE - generate the report; FALSE - the report will not be generated. |
report_pearson |
Boolean variable to control the report based on Pearson test. TRUE - generate the report; FALSE - the report will not be generated. |
report_distribution |
Boolean variable to control the report based on distribution test. TRUE - generate the report; FALSE - the report will not be generated. |
report_rank |
Boolean variable to control the report of appearances. TRUE - generate the report; FALSE - the report will not be generated. |
report_spearman_diff |
Boolean variable to control the report of Spearman's test difference on different hierarchies. TRUE - generate the report; FALSE - the report will not be generated. |
report_na |
Boolean variable to control the report of features with NA values. TRUE - generate the report; FALSE - the report will not be generated. |
report_var_attr |
Boolean variable to control the report of variables' attributes. TRUE - generate the report; FALSE - the report will not be generated. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | library("rio")
# Let us first look at the hierarchical case
old_file <- '../data/restore_old.RData'
new_file <- '../data/restore_new.RData'
geo_hier <- '../data/restore_geo_hierarchies.RData'
geo_pair <- '../data/restore_geo_pairs.RData'
thresholds <- '../data/restore_thresholds.RData'
final_report <- '../inst/extdata/analysis_results_hierarchy.xlsx'
key <- 'CODE'
hierarchy <-'GEO'
old_file <- import(old_file)
new_file <- import(new_file)
geo_hier <- import(geo_hier)
geo_pair <- import(geo_pair)
thresholds <- import(thresholds)
test_two_datasets(legacy_df = old_file,
target_df = new_file,
hier_df = geo_hier,
hier_pair_df = geo_pair,
thresholds_df = thresholds,
final_report = final_report,
key_col = key,
hier_col = hierarchy)
# Now let us consider the flat hierarchy case
# To save space, we will reuse old_file, new_file, thresholds, and key variables.
# Remove the hierarchy columns:
old_file$GEO <- NULL
new_file$GEO <- NULL
final_report <- '../inst/extdata/analysis_results_flat_hierarchy.xlsx'
# Note that a dummy column GEO will be generated in the report
test_two_datasets(legacy_df = old_file,
target_df = new_file,
thresholds_df = thresholds,
final_report = final_report,
key_col = key)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.