compare_diag: Diagnosis of train set and test set of split_df object

View source: R/split.R

compare_diagR Documentation

Diagnosis of train set and test set of split_df object

Description

Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.

Usage

compare_diag(
  .data,
  add_character = FALSE,
  uniq_thres = 0.01,
  miss_msg = TRUE,
  verbose = TRUE
)

Arguments

.data

an object of class "split_df", usually, a result of a call to split_df().

add_character

logical. Decide whether to include text variables in the compare of categorical data. The default value is FALSE, which also not includes character variables.

uniq_thres

numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.

miss_msg

logical. Set whether to output a message when diagnosing missing value.

verbose

logical. Set whether to echo information to the console at runtime.

Details

In the two split datasets, a variable with a single value, a variable with a level not found in any dataset, and a variable with a high ratio to the number of levels are diagnosed.

Value

list. Variables of tbl_df for first component named "single_value":

  • variables : character. variable name

  • train_uniq : character. the type of unique value in train set. it is divided into "single" and "multi".

  • test_uniq : character. the type of unique value in test set. it is divided into "single" and "multi".

Variables of tbl_df for second component named "uniq_rate":

  • variables : character. categorical variable name

  • train_uniqcount : numeric. the number of unique value in train set

  • train_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in train set

  • test_uniqcount : numeric. the number of unique value in test set

  • test_uniqrate : numeric. the ratio of unique values(number of unique values / number of observation) in test set

Variables of tbl_df for third component named "missing_level":

  • variables : character. variable name

  • n_levels : integer. count of level of categorical variable

  • train_missing_nlevel : integer. the number of non-existent levels in the train set

  • test_missing_nlevel : integer. he number of non-existent levels in the test set

Examples

library(dplyr)

# Credit Card Default Data
head(ISLR::Default)

defaults <- ISLR::Default
defaults$id <- seq(NROW(defaults))

set.seed(1)
defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA
set.seed(2)
defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA

sb <- defaults %>%
  split_by(default)

sb %>%
  compare_diag()

sb %>%
  compare_diag(add_character = TRUE)

sb %>%
  compare_diag(uniq_thres = 0.0005)


alookr documentation built on May 29, 2024, 10:38 a.m.