compare_diag | R Documentation |
Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.
compare_diag( .data, add_character = FALSE, uniq_thres = 0.01, miss_msg = TRUE, verbose = TRUE )
.data |
an object of class "split_df", usually, a result of a call to split_df(). |
add_character |
logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables. |
uniq_thres |
numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value. |
miss_msg |
logical. Set whether to output a message when diagnosing missing value. |
verbose |
logical. Set whether to echo information to the console at runtime. |
In the two split datasets, a variable with a single value, a variable with a level not found in any dataset, and a variable with a high ratio to the number of levels are diagnosed.
list. Variables of tbl_df for first component named "single_value":
variables : character. variable name
train_uniq : character. the type of unique value in train set. it is divided into "single" and "multi".
test_uniq : character. the type of unique value in test set. it is divided into "single" and "multi".
Variables of tbl_df for second component named "uniq_rate":
variables : character. categorical variable name
train_uniqcount : numeric. the number of unique value in train set
train_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in train set
test_uniqcount : numeric. the number of unique value in test set
test_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in test set
Variables of tbl_df for third component named "missing_level":
variables : character. variable name
n_levels : integer. count of level of categorical variable
train_missing_nlevel : integer. the number of non-existent levels in the train set
test_missing_nlevel : integer. he number of non-existent levels in the test set
library(dplyr) # Credit Card Default Data head(ISLR::Default) defaults <- ISLR::Default defaults$id <- seq(NROW(defaults)) set.seed(1) defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA set.seed(2) defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA sb <- defaults %>% split_by(default) sb %>% compare_diag() sb %>% compare_diag(add_character = TRUE) sb %>% compare_diag(uniq_thres = 0.0005)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.