data_integrity: Data integrity

Description Usage Arguments Value Examples

View source: R/data_integrity.R

Description

A handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix). It also returns a vector of variables which have high cardinality. It returns an 'integrity' object, which has: 'status_now' (comes from status function), and 'results' list, following elements can be found:

vars_cat: Vector containing the categorical variables names (factor or character)

vars_num: Vector containing the numerical variables names

vars_char: Vector containing the character variables names

vars_factor: Vector containing the factor variables names

vars_other: Vector containing the other variables names (date time, posix and boolean)

vars_num_with_NA: Summary table for numerical variables with NA

vars_cat_with_NA: Summary table for categorical variables with NA

vars_cat_high_card: Summary table for high cardinality variables (where thershold = MAX_UNIQUE parameter)

vars_one_value: Vector containing the variables names with 1 unique different value

Explore the NA and high cardinality variables by doing summary(integrity_object), or a full summary by doing print(integrity_object)

Usage

1
data_integrity(data, MAX_UNIQUE = 35)

Arguments

data

data frame or a single vector

MAX_UNIQUE

max unique threshold to flag a categorical variable as a high cardinality one. Normally above 35 values it is needed to reduce the number of different values.

Value

An 'integrity' object.

Examples

1
2
3
4
5
# Example 1:
data_integrity(heart_disease)
# Example 2:
# changing the default minimum threshold to flag a variable as high cardiniality
data_integrity(data=data_country, MAX_UNIQUE=50)

Example output

Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package:HmiscThe following objects are masked frompackage:base:

    format.pval, units

funModeling v.1.9.4 :)
Examples and tutorials at livebook.datascienceheroes.com
 / Now in Spanish: librovivodecienciadedatos.ai
$vars_num_with_NA
           variable q_na       p_na
1 num_vessels_flour    4 0.01320132

$vars_cat_with_NA
  variable q_na       p_na
1     thal    2 0.00660066

$vars_cat_high_card
[1] variable unique  
<0 rows> (or 0-length row.names)

$MAX_UNIQUE
[1] 35

$vars_one_value
character(0)

$vars_cat
[1] "gender"              "chest_pain"          "fasting_blood_sugar"
[4] "resting_electro"     "thal"                "exter_angina"       
[7] "has_heart_disease"  

$vars_num
[1] "age"                    "resting_blood_pressure" "serum_cholestoral"     
[4] "max_heart_rate"         "exer_angina"            "oldpeak"               
[7] "slope"                  "num_vessels_flour"      "heart_disease_severity"

$vars_char
character(0)

$vars_factor
[1] "gender"              "chest_pain"          "fasting_blood_sugar"
[4] "resting_electro"     "thal"                "exter_angina"       
[7] "has_heart_disease"  

$vars_other
character(0)

$vars_num_with_NA
[1] variable q_na     p_na    
<0 rows> (or 0-length row.names)

$vars_cat_with_NA
[1] variable q_na     p_na    
<0 rows> (or 0-length row.names)

$vars_cat_high_card
  variable unique
1  country     70

$MAX_UNIQUE
[1] 50

$vars_one_value
character(0)

$vars_cat
[1] "country" "has_flu"

$vars_num
[1] "person"

$vars_char
[1] "country" "has_flu"

$vars_factor
character(0)

$vars_other
character(0)

funModeling documentation built on July 1, 2020, 5:40 p.m.