View source: R/data_integrity.R
data_integrity | R Documentation |
A handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix). It also returns a vector of variables which have high cardinality. It returns an 'integrity' object, which has: 'status_now' (comes from status function), and 'results' list, following elements can be found:
vars_cat: Vector containing the categorical variables names (factor or character)
vars_num: Vector containing the numerical variables names
vars_char: Vector containing the character variables names
vars_factor: Vector containing the factor variables names
vars_other: Vector containing the other variables names (date time, posix and boolean)
vars_num_with_NA: Summary table for numerical variables with NA
vars_cat_with_NA: Summary table for categorical variables with NA
vars_cat_high_card: Summary table for high cardinality variables (where thershold = MAX_UNIQUE parameter)
vars_one_value: Vector containing the variables names with 1 unique different value
Explore the NA and high cardinality variables by doing summary(integrity_object), or a full summary by doing print(integrity_object)
data_integrity(data, MAX_UNIQUE = 35)
data |
data frame or a single vector |
MAX_UNIQUE |
max unique threshold to flag a categorical variable as a high cardinality one. Normally above 35 values it is needed to reduce the number of different values. |
An 'integrity' object.
# Example 1:
data_integrity(heart_disease)
# Example 2:
# changing the default minimum threshold to flag a variable as high cardiniality
data_integrity(data=data_country, MAX_UNIQUE=50)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.