In choonghyunryu/dlookr: Tools for Data Diagnosis, Exploration, Transformation

knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70, collapse = TRUE)
options(tibble.print_min = 4L, tibble.print_max = 4L, crayon.enabed = FALSE)

library(dlookr)
library(dplyr)
library(ggplot2)

Preface

After you have acquired the data, you should do the following:

Diagnose data quality.
- If there is a problem with data quality,
- The data must be corrected or re-acquired.
Explore data to understand the data and find scenarios for performing the analysis.
Derive new variables or perform variable transformations.

The dlookr package makes these steps fast and easy:

Performs a data diagnosis or automatically generates a data diagnosis report.
Discover data in various ways and automatically generate EDA(exploratory data analysis) reports.
Impute missing values and outliers, resolve skewed data, and categorize continuous variables into categorical variables. And generates an automated report to support it.

This document introduces data transformation methods provided by the dlookr package. You will learn how to transform tbl_df data that inherits from data.frame and data.frame with functions provided by dlookr

dlookr increases synergy with dplyr. Particularly in data transformation and data wrangle, it increases the efficiency of the tidyverse package group.

datasets

To illustrate the primary use of data transformation in the dlookr package, I use a Carseats dataset. Carseats in the ISLR package is simulation dataset that sells children's car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.

str(Carseats)

The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)

Sales
- Unit sales (in thousands) at each location
CompPrice
- Price charged by a competitor at each location
Income
- Community income level (in thousands of dollars)
Advertising
- Local advertising budget for company at each location (in thousands of dollars)
Population
- Population size in region (in thousands)
Price
- Price company charges for car seats at each site
ShelveLoc
- A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age
- Average age of the local population
Education
- Education level at each location
Urban
- A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US
- A factor with levels No and Yes to indicate whether the store is in the US or not

When data analysis is performed, data containing missing values is often encountered. However, Carseats is complete data without missing. Therefore, the missing values are generated as follows. And I created a data.frame object named carseat.

carseats <- Carseats

suppressWarnings(RNGversion("3.5.0"))
set.seed(123)
carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA

suppressWarnings(RNGversion("3.5.0"))
set.seed(456)
carseats[sample(seq(NROW(carseats)), 10), "Urban"] <- NA

Data Transformation

dlookr imputes missing values and outliers and resolves skewed data. It also provides the ability to bin continuous variables as categorical variables.

Here is a list of the data conversion functions and functions provided by dlookr:

find_na() finds a variable that contains the missing values variable, and imputate_na() imputes the missing values.
find_outliers() finds a variable that contains the outliers, and imputate_outlier() imputes the outlier.
summary.imputation() and plot.imputation() provide information and visualization of the imputed variables.
find_skewness() finds the variables of the skewed data, and transform() resolves the skewed data.
transform() also performs standardization of numeric variables.
summary.transform() and plot.transform() provide information and visualization of transformed variables.
binning() and binning_by() convert binational data into categorical data.
print.bins() and summary.bins() show and summarize the binning results.
plot.bins() and plot.optimal_bins() provide visualization of the binning result.
transformation_report() performs the data transform and reports the result.

Imputation of missing values

imputes the missing value with `imputate_na()`

imputate_na() imputes the missing value contained in the variable. The predictor with missing values supports numeric and categorical variables and the following method.

predictor is numerical variable
- "mean": arithmetic mean
- "median": median
- "mode": mode
- "knn": K-nearest neighbors
  - target variable must be specified
- "rpart": Recursive Partitioning and Regression Trees
  - target variable must be specified
- "mice": Multivariate Imputation by Chained Equations
  - target variable must be specified
  - random seed must be set
predictor is categorical variable
- "mode": mode
- "rpart": Recursive Partitioning and Regression Trees
  - target variable must be specified
- "mice": Multivariate Imputation by Chained Equations
  - target variable must be specified
  - random seed must be set

In the following example, imputate_na() imputes the missing value of Income, a numeric variable of carseats, using the "rpart" method. summary() summarizes missing value imputation information, and plot() visualizes missing information.

if (requireNamespace("rpart", quietly = TRUE)) {
  income <- imputate_na(carseats, Income, US, method = "rpart")

  # result of imputation
  income

  # summary of imputation
  summary(income)

  # viz of imputation
  plot(income)
} else {
  cat("If you want to use this feature, you need to install the rpart package.\n")
}

The following imputes the categorical variable urban by the "mice" method.

The "mice" method must require the mice, ranger package. If you want to use this feature, you need to install the mice, and ranger package.

library(mice)

urban <- imputate_na(carseats, Urban, US, method = "mice")

# result of imputation
urban

# summary of imputation
summary(urban)

# viz of imputation
plot(urban)

Collaboration with dplyr

The following example imputes the missing value of the Income variable and then calculates the arithmetic mean for each level of US. In this case, dplyr is used and is easily interpreted logically using pipes.

# The mean before and after the imputation of the Income variable
carseats %>%
  mutate(Income_imp = imputate_na(carseats, Income, US, method = "knn")) %>%
  group_by(US) %>%
  summarise(orig = mean(Income, na.rm = TRUE),
            imputation = mean(Income_imp))

Impute outliers

imputes the outliers with `imputate_outlier()`

imputate_outlier() imputes the outlier value. The predictor with outliers supports only numeric variables and supports the following methods.

predictor is numerical variable
- "mean": arithmetic mean
- "median": median
- "mode": mode
- "capping": Impute the upper outliers with the 95 percentile and the bottom outliers with the fifth percentile.

imputate_outlier() imputes the outliers with the numeric variable Price as the "capping" method, as follows. summary() summarizes outliers imputation information, and plot() visualizes imputation information.

price <- imputate_outlier(carseats, Price, method = "capping")

# result of imputation
price

# summary of imputation
summary(price)

# viz of imputation
plot(price)

Collaboration with dplyr

The following example imputes the outliers of the Price variable and then calculates the arithmetic mean for each level of US. In this case, dplyr is used and is easily interpreted logically using pipes.

# The mean before and after the imputation of the Price variable
carseats %>%
  mutate(Price_imp = imputate_outlier(carseats, Price, method = "capping")) %>%
  group_by(US) %>%
  summarise(orig = mean(Price, na.rm = TRUE),
    imputation = mean(Price_imp, na.rm = TRUE))

Standardization and Resolving Skewness

Introduction to the use of `transform()`

transform() performs data transformation. Only numeric variables are supported, and the following methods are provided.

Standardization
- "zscore": z-score transformation. (x - mu) / sigma
- "minmax": minmax transformation. (x - min) / (max - min)
Resolving Skewness
- "log": log transformation. log(x)
- "log+1": log transformation. log(x + 1). Used for values that contain 0.
- "sqrt": square root transformation.
- "1/x": 1 / x transformation
- "x^2": x square transformation
- "x^3": x^3 square transformation

Standardization with `transform()`

Use the methods "zscore" and "minmax" to perform standardization.

carseats %>% 
  mutate(Income_minmax = transform(carseats$Income, method = "minmax"),
    Sales_minmax = transform(carseats$Sales, method = "minmax")) %>% 
  select(Income_minmax, Sales_minmax) %>% 
  boxplot()

Resolving Skewness data with `transform()`

find_skewness() searches for variables with skewed data. This function finds data skewed by search conditions and calculates skewness.

# find index of skewed variables
find_skewness(carseats)

# find names of skewed variables
find_skewness(carseats, index = FALSE)

# compute the skewness
find_skewness(carseats, value = TRUE)

# compute the skewness & filtering with threshold
find_skewness(carseats, value = TRUE, thres = 0.1)

The skewness of Advertising is 0.637. This means that the distribution of data is somewhat inclined to the left. So, for normal distribution, use transform() to convert to the "log" method as follows. summary() summarizes transformation information, and plot() visualizes transformation information.

Advertising_log <- transform(carseats$Advertising, method = "log")

# result of transformation
head(Advertising_log)
# summary of transformation
summary(Advertising_log)
# viz of transformation
plot(Advertising_log)

The raw data seems to contain 0, as there is a -Inf in the log converted value. So this time, convert it to "log+1".

Advertising_log <- transform(carseats$Advertising, method = "log+1")

# result of transformation
head(Advertising_log)
# summary of transformation
summary(Advertising_log)
# viz of transformation
# plot(Advertising_log)

Binning

Binning of individual variables using `binning()`

binning() transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported.

"quantile": categorize using quantile to include the same frequencies
"equal": categorize to have equal length segments
"pretty": categorized into moderately good segments
"kmeans": categorization using K-means clustering
"bclust": categorization using bagged clustering technique

Here are some examples of bin Income using binning().

# Binning the carat variable. the default type argument is "quantile"
bin <- binning(carseats$Income)
# Print bins class object
bin
# Summarize bins class object
summary(bin)
# Plot bins class object
plot(bin)
# Using labels argument
bin <- binning(carseats$Income, nbins = 4,
              labels = c("LQ1", "UQ1", "LQ3", "UQ3"))
bin
# Using another type argument
binning(carseats$Income, nbins = 5, type = "equal")
binning(carseats$Income, nbins = 5, type = "pretty")

if (requireNamespace("classInt", quietly = TRUE)) {
  binning(carseats$Income, nbins = 5, type = "kmeans")
  binning(carseats$Income, nbins = 5, type = "bclust")
} else {
  cat("If you want to use this feature, you need to install the classInt package.\n")
}

# Extract the binned results
extract(bin)

# -------------------------
# Using pipes & dplyr
# -------------------------
library(dplyr)

carseats %>%
 mutate(Income_bin = binning(carseats$Income) %>% 
                     extract()) %>%
 group_by(ShelveLoc, Income_bin) %>%
 summarise(freq = n()) %>%
 arrange(desc(freq)) %>%
 head(10)

Optimal Binning with `binning_by()`

binning_by() transforms a numeric variable into a categorical variable by optimal binning. This method is often used when developing a scorecard model.

The following binning_by() example optimally binning Advertising considering the target variable US with a binary class.

library(dplyr)

if (requireNamespace("partykit", quietly = TRUE)) {
  # optimal binning using character
  bin <- binning_by(carseats, "US", "Advertising")

  # optimal binning using name
  bin <- binning_by(carseats, US, Advertising)
  bin

  # summary optimal_bins class
  summary(bin)

  # performance table
  attr(bin, "performance")

  # visualize optimal_bins class
  plot(bin)

  # extract binned results
  extract(bin) %>% 
    head(20)
} else {
  cat("If you want to use this feature, you need to install the partykit package.\n")
}

Automated report

dlookr provides two automated data transformation reports:

Web page-based dynamic reports can perform in-depth analysis through visualization and statistical tables.
Static reports generated as pdf files or html files can be archived as output of data analysis.

Create a dynamic report using `transformation_web_report()`

transformation_web_report() creates a dynamic report for objects inherited from data.frame(tbl_df, tbl, etc) or data.frame.

Contents of dynamic web report

The contents of the report are as follows.:

Overview
- Data Structures
- Data Types
- Job Information
Imputation
- Missing Values
- Outliers
Resolving Skewness
Binning
Optimal Binning

Some arguments for dynamic web report

transformation_web_report() generates various reports with the following arguments.

target
- target variable
output_file
- name of the generated file.
output_dir
- name of the directory to generate report file.
title
- title of the report.
subtitle
- subtitle of the report.
author
- author of the report.
title_color
- color of title.
logo_img
- The name of the logo image file is on the top left.
create_date
- The date on which the report is generated.
theme
- name of the theme for the report. Support "orange" and "blue".
sample_percent
- Sample percent of data for performing data transformation.

The following script creates a data transformation report for the tbl_df class object, heartfailure.

heartfailure %>%
  transformation_web_report(target = "death_event", subtitle = "heartfailure",
                            output_dir = "./", output_file = "transformation.html", 
                            theme = "blue")

Screenshot of dynamic report

The dynamic contents of the report are shown in the following figure.:

knitr::include_graphics('img/transformation_web_title.jpg')

Create a static report using `transformation_paged_report()`

transformation_paged_report() create static report for object inherited from data.frame(tbl_df, tbl, etc) or data.frame.

Contents of static paged report

The contents of the report are as follows.:

Overview
- Data Structures
- Job Information
Imputation
- Missing Values
- Outliers
Resolving Skewness
Binning
Optimal Binning

Some arguments for static paged report

transformation_paged_report() generates various reports with the following arguments.

target
- target variable
output_format
- report output type. Choose either "pdf" or "html".
output_file
- name of the generated file.
output_dir
- name of the directory to generate the report file.
title
- title of the report.
subtitle
- subtitle of the report.
abstract_title
- abstract of the report
author
- author of the report.
title_color
- color of title.
subtitle_color
- color of subtitle.
logo_img
- The name of the logo image file is on the top left.
cover_img
- name of cover image file on center.
create_date
- The date on which the report is generated.
theme
- name of the theme for the report. Support "orange" and "blue".
sample_percent
- Sample percent of data for performing data transformation.

The following script creates a data transformation report for the data.frame class object, heartfailure.

heartfailure %>%
  transformation_paged_report(target = "death_event", subtitle = "heartfailure",
                              output_dir = "./", output_file = "transformation.pdf", 
                              theme = "blue")

Screenshot of static report

The cover of the report is shown in the following figure.:

knitr::include_graphics('img/transformation_paged_cover.jpg')

The contents of the report are shown in the following figure.:

knitr::include_graphics('img/transformation_paged_content.jpg')

choonghyunryu/dlookr documentation built on June 11, 2024, 9:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

choonghyunryu/dlookr
Tools for Data Diagnosis, Exploration, Transformation

In choonghyunryu/dlookr: Tools for Data Diagnosis, Exploration, Transformation

Preface

datasets

Data Transformation

Imputation of missing values

imputes the missing value with `imputate_na()`

Collaboration with dplyr

Impute outliers

imputes the outliers with `imputate_outlier()`

Collaboration with dplyr

Standardization and Resolving Skewness

Introduction to the use of `transform()`

Standardization with `transform()`

Resolving Skewness data with `transform()`

Binning

Binning of individual variables using `binning()`

Optimal Binning with `binning_by()`

Automated report

Create a dynamic report using `transformation_web_report()`

Contents of dynamic web report

Some arguments for dynamic web report

Screenshot of dynamic report

Create a static report using `transformation_paged_report()`

Contents of static paged report

Some arguments for static paged report

Screenshot of static report

R Package Documentation

Browse R Packages

We want your feedback!

choonghyunryu/dlookr Tools for Data Diagnosis, Exploration, Transformation

In choonghyunryu/dlookr: Tools for Data Diagnosis, Exploration, Transformation

Preface

datasets

Data Transformation

Imputation of missing values

imputes the missing value with imputate_na()

Collaboration with dplyr

Impute outliers

imputes the outliers with imputate_outlier()

Collaboration with dplyr

Standardization and Resolving Skewness

Introduction to the use of transform()

Standardization with transform()

Resolving Skewness data with transform()

Binning

Binning of individual variables using binning()

Optimal Binning with binning_by()

Automated report

Create a dynamic report using transformation_web_report()

Contents of dynamic web report

Some arguments for dynamic web report

Screenshot of dynamic report

Create a static report using transformation_paged_report()

Contents of static paged report

Some arguments for static paged report

Screenshot of static report

R Package Documentation

Browse R Packages

We want your feedback!

choonghyunryu/dlookr
Tools for Data Diagnosis, Exploration, Transformation

imputes the missing value with `imputate_na()`

imputes the outliers with `imputate_outlier()`

Introduction to the use of `transform()`

Standardization with `transform()`

Resolving Skewness data with `transform()`

Binning of individual variables using `binning()`

Optimal Binning with `binning_by()`

Create a dynamic report using `transformation_web_report()`

Create a static report using `transformation_paged_report()`