knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70)
options(tibble.print_min = 4L, tibble.print_max = 4L)

library(dlookr)
library(dplyr)
library(ggplot2)

Preface

After you have acquired the data, you should do the following:

The dlookr package makes these steps fast and easy:

dlookr increases synergy with dplyr. Particularly in data exploration and data wrangling, it increases the efficiency of the tidyverse package group.

Supported data structures

Data diagnosis supports the following data structures.

List of supported tasks of data analytics

Diagnose Data

Overall Diagnose Data

Tasks | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: describe overview of data | Inquire basic information to understand the data in general | overview() | summary overview object | summary described overview of data | summary.overview() | plot overview object | plot described overview of data | plot.overview() | diagnose data quality of variables | The scope of data quality diagnosis is information on missing values and unique value information | diagnose() | x diagnose data quality of categorical variables | frequency, ratio, rank by levels of each variables | diagnose_category() | x diagnose data quality of numerical variables | descriptive statistics, number of zero, minus, outliers | diagnose_numeric() | x diagnose data quality for outlier | number of outliers, ratio, mean of outliers, mean with outliers, mean without outliers | diagnose_outlier() | x plot outliers information of numerical data | box plot and histogram whith outliers, without outliers | plot_outlier.data.frame() | x plot outliers information of numerical data by target variable | box plot and density plot whith outliers, without outliers | plot_outlier.target_df() | x diagnose combination of categorical variables | Check for sparse cases of level combinations of categorical variables | diagnose_sparese() |

Visualize Missing Values

Tasks | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: pareto chart for missing value | visualize the Pareto chart for variables with a missing value. | plot_na_pareto() | combination chart for missing value | visualize the distribution of missing value by combining variables. | plot_na_hclust() | plot the combination variables that is include missing value | visualize the combinations of missing value across cases | plot_na_intersect() |

Reporting

Types | Descriptions | Functions | Support DBI :-----|:-------|:---|:---: report the information of data diagnosis into a PDF file | report the information for diagnosing the data quality | diagnose_report() | x reporting the information of data diagnosis into HTML file | report the information for diagnosing the quality of the data | diagnose_report() | x reporting the information of data diagnosis into HTML file | dynamic report the information for diagnosing the quality of the data | diagnose_web_report() | x reporting the information of data diagnosis into PDF and HTML files | paged report the information for diagnosing the quality of the data | diagnose_paged_report() | x

EDA

Univariate EDA

Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | summaries | frequency tables | univar_category() | categorical | summaries | chi-squared test | summary.univar_category() | categorical | visualize | bar charts | plot.univar_category() | categorical | visualize | bar charts | plot_bar_category() | numerical | summaries | descriptive statistics | describe() | x numerical | summaries | descriptive statistics | univar_numeric() | numerical | summaries | descriptive statistics of standardized variable | summary.univar_numeric() | numerical | visualize | histogram, box plot | plot.univar_numeric() | numerical | visualize | Q-Q plots | plot_qq_numeric() | numerical | visualize | box plot | plot_box_numeric() | numerical | visualize | histogram | plot_hist_numeric() |

Bivariate EDA

Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | summaries | frequency tables cross cases | compare_category() | categorical | summaries | contingency tables, chi-squared test | summary.compare_category() | categorical | visualize | mosaics plot | plot.compare_category() | numerical | summaries | correlation coefficient, linear model summaries | compare_numeric() | numerical | summaries | correlation coefficient, linear model summaries with threshold | summary.compare_numeric() | numerical | visualize | scatter plot with marginal box plot | plot.compare_numeric() | numerical | Correlate | correlation coefficient | correlate() | x numerical | Correlate | summaries with correlation matrix | summary.correlate() | x numerical | Correlate | visualization of a correlation matrix | plot.correlate() | x both | PPS | PPS(Predictive Power Score) | pps() | x both | PPS | summaries with PPS | summary.pps() | x both | PPS | visualization of a PPS matrix | plot.pps() | x

Normality Test

Types | Tasks | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: numerical | summaries | Shapiro-Wilk normality test | normality() | x numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | plot_normality() | x

Relationship between target variable and predictors

Target Variable | Predictor | Descriptions | Functions | Support DBI :---|:---|:-------|:---|:---: categorical | categorical | contingency tables | relate() | x categorical | categorical | mosaics plot | plot.relate() | x categorical | numerical | descriptive statistic for each levels and total observation | relate() | x categorical | numerical | density plot | plot.relate() | x categorical | categorical | bar charts | plot_bar_category() | numerical | categorical | ANOVA test | relate() | x numerical | categorical | scatter plot | plot.relate() | x numerical | numerical | simple linear model | relate() | x numerical | numerical | box plot | plot.relate() | x categorical | numerical | Q-Q plots | plot_qq_numeric() | categorical | numerical | box plot | plot_box_numeric() | categorical | numerical | histogram | plot_hist_numeric() |

Reporting

Types | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: reporting the information of EDA into PDF file | reporting the information of EDA | eda_report() | x reporting the information of EDA into HTML file | reporting the information of EDA | eda_report() | x reporting the information of EDA into PDF file | dynamic reporting the information of EDA | eda_web_report() | x reporting the information of EDA into HTML file | paged reporting the information of EDA | eda_paged_report() | x

Transform Data

Find Variables

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: missing values | find the variable that contains the missing value in the object that inherits the data.frame | find_na() | outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | find_outliers() | skewed variable | find the numerical variable that is the skewed variable that inherits the data.frame | find_skewness() |

Imputation

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: missing values | missing values are imputed with some representative values and statistical methods. | imputate_na() | outliers | outliers are imputed with some representative values and statistical methods. | imputate_outlier() | summaries | calculate descriptive statistics of the original and imputed values. | summary.imputation() | visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | plot.imputation() |

Binning

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: binning | converts a numeric variable to a categorization variable | binning() | summaries | calculate frequency and relative frequency for each levels(bins) | summary.bins() | visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | plot.bins() | optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | binning_by() | summaries | summary metrics to evaluate the performance of binomial classification model | summary.optimal_bins() | visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | plot.optimal_bins() | infogain binning | categorizes a numeric characteristic into bins for multi-class variables using recursive information gain ratio maximization | binning_rgr() | visualize | generates plots for understanding distribution and distribution by target variable after running binning_rgr() | plot.infogain_bins() | evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin() | summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | summary.performance_bin() | visualize | It generates plots to understand frequency, WoE by bins using performance_bin after running binning_by() | plot.performance_bin() | visualize | extract bins from "bins" and "optimal_bins" objects | extract.bins() |

Diagnose Binned Variable

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin() | summaries | summary method for "performance_bin". summary metrics to evaluate the performance of the binomial classification model | summary.performance_bin() | visualize | visualize for understanding frequency, WoE by bins using performance_bin and something else | plot.performance_bin() |

Transformation

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: transformation | performs variable transformation for standardization and resolving skewness of numerical variables | transform() | summaries | compares the distribution of data before and after data transformation | summary.transform() | visualize | visualize two kinds of a plot by attribute of the 'transform' class. The transformation of a numerical variable is a density plot | plot.transform() |

Reporting

Types | Descriptions | Functions | Support DBI :-----|:--------|:---|:---: reporting the information of transformation into PDF | reporting the information of transformation | transformation_report() | reporting the information of transformation into HTML | reporting the information of transformation | transformation_report() | reporting the transformation information into PDF | dynamic reporting the transformation information | transformation_web_report() | reporting the information of transformation into HTML | paged reporting the information of transformation | transformation_paged_report() |

Miscellaneous

Statistics

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: statistics | calculate the entropy | entropy() | statistics | calculate the skewness of the data | skewness() | statistics | calculate the kurtosis of the data | kurtosis() | statistics | calculate the Jensen-Shannon divergence between two probability distributions | jsd() | statistics | calculate the Kullback-Leibler divergence between two probability distributions | kld() | statistics | calculate the Cramer's V statistic between two categorical(discrete) variables | cramer() | statistics | calculate the Theil's U statistic between two categorical(discrete) variables | theil() | statistics | finding percentile of a numerical variable. | get_percentile() | statistics | transform a numeric vector using several methods like "log", "sqrt", "log+1", "log+a", "1/x", "x^2", "x^3", "Box-Cox", "Yeo-Johnson"| get_transform() | statistics | calculate the Cramer's V statistic | cramer() | statistics | calculate the Theil's U statistic | theil() |

Programming

Types | Descriptions | Functions | Support DBI :---|:-------|:---|:---: programming | extracts variable information having a certain class from an object inheriting data.frame | find_class() | programming | gets class of variables in data.frame or tbl_df | get_class() | programming | retrieves the column information of the DBMS table through the tbl_bdi object of dplyr | get_column_info() | programming | finding the user machine's OS. | get_os() | programming | import Google fonts | import_google_font() |



choonghyunryu/dlookr documentation built on June 11, 2024, 9:12 a.m.