knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70) options(tibble.print_min = 4L, tibble.print_max = 4L) library(dlookr) library(dplyr) library(ggplot2)
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
dlookr increases synergy with dplyr
. Particularly in data exploration and data wrangling, it increases the efficiency of the tidyverse
package group.
Data diagnosis supports the following data structures.
Tasks | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
describe overview of data | Inquire basic information to understand the data in general | overview()
|
summary overview object | summary described overview of data | summary.overview()
|
plot overview object | plot described overview of data | plot.overview()
|
diagnose data quality of variables | The scope of data quality diagnosis is information on missing values and unique value information | diagnose()
| x
diagnose data quality of categorical variables | frequency, ratio, rank by levels of each variables | diagnose_category()
| x
diagnose data quality of numerical variables | descriptive statistics, number of zero, minus, outliers | diagnose_numeric()
| x
diagnose data quality for outlier | number of outliers, ratio, mean of outliers, mean with outliers, mean without outliers | diagnose_outlier()
| x
plot outliers information of numerical data | box plot and histogram whith outliers, without outliers | plot_outlier.data.frame()
| x
plot outliers information of numerical data by target variable | box plot and density plot whith outliers, without outliers | plot_outlier.target_df()
| x
diagnose combination of categorical variables | Check for sparse cases of level combinations of categorical variables | diagnose_sparese()
|
Tasks | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
pareto chart for missing value | visualize the Pareto chart for variables with a missing value. | plot_na_pareto()
|
combination chart for missing value | visualize the distribution of missing value by combining variables. | plot_na_hclust()
|
plot the combination variables that is include missing value | visualize the combinations of missing value across cases | plot_na_intersect()
|
Types | Descriptions | Functions | Support DBI
:-----|:-------|:---|:---:
report the information of data diagnosis into a PDF file | report the information for diagnosing the data quality | diagnose_report()
| x
reporting the information of data diagnosis into HTML file | report the information for diagnosing the quality of the data | diagnose_report()
| x
reporting the information of data diagnosis into HTML file | dynamic report the information for diagnosing the quality of the data | diagnose_web_report()
| x
reporting the information of data diagnosis into PDF and HTML files | paged report the information for diagnosing the quality of the data | diagnose_paged_report()
| x
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | summaries | frequency tables | univar_category()
|
categorical | summaries | chi-squared test | summary.univar_category()
|
categorical | visualize | bar charts | plot.univar_category()
|
categorical | visualize | bar charts | plot_bar_category()
|
numerical | summaries | descriptive statistics | describe()
| x
numerical | summaries | descriptive statistics | univar_numeric()
|
numerical | summaries | descriptive statistics of standardized variable | summary.univar_numeric()
|
numerical | visualize | histogram, box plot | plot.univar_numeric()
|
numerical | visualize | Q-Q plots | plot_qq_numeric()
|
numerical | visualize | box plot | plot_box_numeric()
|
numerical | visualize | histogram | plot_hist_numeric()
|
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | summaries | frequency tables cross cases | compare_category()
|
categorical | summaries | contingency tables, chi-squared test | summary.compare_category()
|
categorical | visualize | mosaics plot | plot.compare_category()
|
numerical | summaries | correlation coefficient, linear model summaries | compare_numeric()
|
numerical | summaries | correlation coefficient, linear model summaries with threshold | summary.compare_numeric()
|
numerical | visualize | scatter plot with marginal box plot | plot.compare_numeric()
|
numerical | Correlate | correlation coefficient | correlate()
| x
numerical | Correlate | summaries with correlation matrix | summary.correlate()
| x
numerical | Correlate | visualization of a correlation matrix | plot.correlate()
| x
both | PPS | PPS(Predictive Power Score) | pps()
| x
both | PPS | summaries with PPS | summary.pps()
| x
both | PPS | visualization of a PPS matrix | plot.pps()
| x
Types | Tasks | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
numerical | summaries | Shapiro-Wilk normality test | normality()
| x
numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | plot_normality()
| x
Target Variable | Predictor | Descriptions | Functions | Support DBI
:---|:---|:-------|:---|:---:
categorical | categorical | contingency tables | relate()
| x
categorical | categorical | mosaics plot | plot.relate()
| x
categorical | numerical | descriptive statistic for each levels and total observation | relate()
| x
categorical | numerical | density plot | plot.relate()
| x
categorical | categorical | bar charts | plot_bar_category()
|
numerical | categorical | ANOVA test | relate()
| x
numerical | categorical | scatter plot | plot.relate()
| x
numerical | numerical | simple linear model | relate()
| x
numerical | numerical | box plot | plot.relate()
| x
categorical | numerical | Q-Q plots | plot_qq_numeric()
|
categorical | numerical | box plot | plot_box_numeric()
|
categorical | numerical | histogram | plot_hist_numeric()
|
Types | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
reporting the information of EDA into PDF file | reporting the information of EDA | eda_report()
| x
reporting the information of EDA into HTML file | reporting the information of EDA | eda_report()
| x
reporting the information of EDA into PDF file | dynamic reporting the information of EDA | eda_web_report()
| x
reporting the information of EDA into HTML file | paged reporting the information of EDA | eda_paged_report()
| x
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
missing values | find the variable that contains the missing value in the object that inherits the data.frame | find_na()
|
outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | find_outliers()
|
skewed variable | find the numerical variable that is the skewed variable that inherits the data.frame | find_skewness()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
missing values | missing values are imputed with some representative values and statistical methods. | imputate_na()
|
outliers | outliers are imputed with some representative values and statistical methods. | imputate_outlier()
|
summaries | calculate descriptive statistics of the original and imputed values. | summary.imputation()
|
visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | plot.imputation()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
binning | converts a numeric variable to a categorization variable | binning()
|
summaries | calculate frequency and relative frequency for each levels(bins) | summary.bins()
|
visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | plot.bins()
|
optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | binning_by()
|
summaries | summary metrics to evaluate the performance of binomial classification model | summary.optimal_bins()
|
visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | plot.optimal_bins()
|
infogain binning | categorizes a numeric characteristic into bins for multi-class variables using recursive information gain ratio maximization | binning_rgr()
|
visualize | generates plots for understanding distribution and distribution by target variable after running binning_rgr() | plot.infogain_bins()
|
evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin()
|
summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | summary.performance_bin()
|
visualize | It generates plots to understand frequency, WoE by bins using performance_bin after running binning_by() | plot.performance_bin()
|
visualize | extract bins from "bins" and "optimal_bins" objects | extract.bins()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin()
|
summaries | summary method for "performance_bin". summary metrics to evaluate the performance of the binomial classification model | summary.performance_bin()
|
visualize | visualize for understanding frequency, WoE by bins using performance_bin and something else | plot.performance_bin()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
transformation | performs variable transformation for standardization and resolving skewness of numerical variables | transform()
|
summaries | compares the distribution of data before and after data transformation | summary.transform()
|
visualize | visualize two kinds of a plot by attribute of the 'transform' class. The transformation of a numerical variable is a density plot | plot.transform()
|
Types | Descriptions | Functions | Support DBI
:-----|:--------|:---|:---:
reporting the information of transformation into PDF | reporting the information of transformation | transformation_report()
|
reporting the information of transformation into HTML | reporting the information of transformation | transformation_report()
|
reporting the transformation information into PDF | dynamic reporting the transformation information | transformation_web_report()
|
reporting the information of transformation into HTML | paged reporting the information of transformation | transformation_paged_report()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
statistics | calculate the entropy | entropy()
|
statistics | calculate the skewness of the data | skewness()
|
statistics | calculate the kurtosis of the data | kurtosis()
|
statistics | calculate the Jensen-Shannon divergence between two probability distributions | jsd()
|
statistics | calculate the Kullback-Leibler divergence between two probability distributions | kld()
|
statistics | calculate the Cramer's V statistic between two categorical(discrete) variables | cramer()
|
statistics | calculate the Theil's U statistic between two categorical(discrete) variables | theil()
|
statistics | finding percentile of a numerical variable. | get_percentile()
|
statistics | transform a numeric vector using several methods like "log", "sqrt", "log+1", "log+a", "1/x", "x^2", "x^3", "Box-Cox", "Yeo-Johnson"| get_transform()
|
statistics | calculate the Cramer's V statistic | cramer()
|
statistics | calculate the Theil's U statistic | theil()
|
Types | Descriptions | Functions | Support DBI
:---|:-------|:---|:---:
programming | extracts variable information having a certain class from an object inheriting data.frame | find_class()
|
programming | gets class of variables in data.frame or tbl_df | get_class()
|
programming | retrieves the column information of the DBMS table through the tbl_bdi object of dplyr | get_column_info()
|
programming | finding the user machine's OS. | get_os()
|
programming | import Google fonts | import_google_font()
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.