knitr::opts_chunk$set(collapse = TRUE, comment = "", out.width = "600px", dpi = 70, collapse = TRUE) options(tibble.print_min = 4L, tibble.print_max = 4L, crayon.enabed = FALSE) library(dlookr) library(dplyr) library(ggplot2)
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
This document introduces data transformation methods provided by the dlookr package. You will learn how to transform tbl_df
data that inherits from data.frame and data.frame
with functions provided by dlookr
dlookr increases synergy with dplyr
. Particularly in data transformation and data wrangle, it increases the efficiency of the tidyverse
package group.
To illustrate the primary use of data transformation in the dlookr package, I use a Carseats
dataset.
Carseats
in the ISLR
package is simulation dataset that sells children's car seats at 400 stores. This data is a data.frame created for the purpose of predicting sales volume.
str(Carseats)
The contents of individual variables are as follows. (Refer to ISLR::Carseats Man page)
When data analysis is performed, data containing missing values is often encountered. However, Carseats
is complete data without missing. Therefore, the missing values are generated as follows. And I created a data.frame object named carseat.
carseats <- Carseats suppressWarnings(RNGversion("3.5.0")) set.seed(123) carseats[sample(seq(NROW(carseats)), 20), "Income"] <- NA suppressWarnings(RNGversion("3.5.0")) set.seed(456) carseats[sample(seq(NROW(carseats)), 10), "Urban"] <- NA
dlookr imputes missing values and outliers and resolves skewed data. It also provides the ability to bin continuous variables as categorical variables.
Here is a list of the data conversion functions and functions provided by dlookr:
find_na()
finds a variable that contains the missing values variable, and imputate_na()
imputes the missing values.find_outliers()
finds a variable that contains the outliers, and imputate_outlier()
imputes the outlier.summary.imputation()
and plot.imputation()
provide information and visualization of the imputed variables.find_skewness()
finds the variables of the skewed data, and transform()
resolves the skewed data.transform()
also performs standardization of numeric variables.summary.transform()
and plot.transform()
provide information and visualization of transformed variables.binning()
and binning_by()
convert binational data into categorical data.print.bins()
and summary.bins()
show and summarize the binning results.plot.bins()
and plot.optimal_bins()
provide visualization of the binning result.transformation_report()
performs the data transform and reports the result.imputate_na()
imputate_na()
imputes the missing value contained in the variable. The predictor with missing values supports numeric and categorical variables and the following method
.
In the following example, imputate_na()
imputes the missing value of Income
, a numeric variable of carseats, using the "rpart" method. summary()
summarizes missing value imputation information, and plot()
visualizes missing information.
if (requireNamespace("rpart", quietly = TRUE)) { income <- imputate_na(carseats, Income, US, method = "rpart") # result of imputation income # summary of imputation summary(income) # viz of imputation plot(income) } else { cat("If you want to use this feature, you need to install the rpart package.\n") }
The following imputes the categorical variable urban
by the "mice" method.
The "mice" method must require the mice
, ranger
package. If you want to use this feature, you need to install the mice
, and ranger
package.
library(mice) urban <- imputate_na(carseats, Urban, US, method = "mice") # result of imputation urban # summary of imputation summary(urban) # viz of imputation plot(urban)
The following example imputes the missing value of the Income
variable and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used and is easily interpreted logically using pipes.
# The mean before and after the imputation of the Income variable carseats %>% mutate(Income_imp = imputate_na(carseats, Income, US, method = "knn")) %>% group_by(US) %>% summarise(orig = mean(Income, na.rm = TRUE), imputation = mean(Income_imp))
imputate_outlier()
imputate_outlier()
imputes the outlier value. The predictor with outliers supports only numeric variables and supports the following methods.
imputate_outlier()
imputes the outliers with the numeric variable Price
as the "capping" method, as follows. summary()
summarizes outliers imputation information, and plot()
visualizes imputation information.
price <- imputate_outlier(carseats, Price, method = "capping") # result of imputation price # summary of imputation summary(price) # viz of imputation plot(price)
The following example imputes the outliers of the Price
variable and then calculates the arithmetic mean for each level of US
. In this case, dplyr
is used and is easily interpreted logically using pipes.
# The mean before and after the imputation of the Price variable carseats %>% mutate(Price_imp = imputate_outlier(carseats, Price, method = "capping")) %>% group_by(US) %>% summarise(orig = mean(Price, na.rm = TRUE), imputation = mean(Price_imp, na.rm = TRUE))
transform()
transform()
performs data transformation. Only numeric variables are supported, and the following methods are provided.
transform()
Use the methods "zscore" and "minmax" to perform standardization.
carseats %>% mutate(Income_minmax = transform(carseats$Income, method = "minmax"), Sales_minmax = transform(carseats$Sales, method = "minmax")) %>% select(Income_minmax, Sales_minmax) %>% boxplot()
transform()
find_skewness()
searches for variables with skewed data. This function finds data skewed by search conditions and calculates skewness.
# find index of skewed variables find_skewness(carseats) # find names of skewed variables find_skewness(carseats, index = FALSE) # compute the skewness find_skewness(carseats, value = TRUE) # compute the skewness & filtering with threshold find_skewness(carseats, value = TRUE, thres = 0.1)
The skewness of Advertising
is 0.637. This means that the distribution of data is somewhat inclined to the left. So, for normal distribution, use transform()
to convert to the "log" method as follows.
summary()
summarizes transformation information, and plot()
visualizes transformation information.
Advertising_log <- transform(carseats$Advertising, method = "log") # result of transformation head(Advertising_log) # summary of transformation summary(Advertising_log) # viz of transformation plot(Advertising_log)
The raw data seems to contain 0, as there is a -Inf in the log converted value. So this time, convert it to "log+1".
Advertising_log <- transform(carseats$Advertising, method = "log+1") # result of transformation head(Advertising_log) # summary of transformation summary(Advertising_log) # viz of transformation # plot(Advertising_log)
binning()
binning()
transforms a numeric variable into a categorical variable by binning it. The following types of binning are supported.
Here are some examples of bin Income
using binning()
.
# Binning the carat variable. the default type argument is "quantile" bin <- binning(carseats$Income) # Print bins class object bin # Summarize bins class object summary(bin) # Plot bins class object plot(bin) # Using labels argument bin <- binning(carseats$Income, nbins = 4, labels = c("LQ1", "UQ1", "LQ3", "UQ3")) bin # Using another type argument binning(carseats$Income, nbins = 5, type = "equal") binning(carseats$Income, nbins = 5, type = "pretty") if (requireNamespace("classInt", quietly = TRUE)) { binning(carseats$Income, nbins = 5, type = "kmeans") binning(carseats$Income, nbins = 5, type = "bclust") } else { cat("If you want to use this feature, you need to install the classInt package.\n") } # Extract the binned results extract(bin) # ------------------------- # Using pipes & dplyr # ------------------------- library(dplyr) carseats %>% mutate(Income_bin = binning(carseats$Income) %>% extract()) %>% group_by(ShelveLoc, Income_bin) %>% summarise(freq = n()) %>% arrange(desc(freq)) %>% head(10)
binning_by()
binning_by()
transforms a numeric variable into a categorical variable by optimal binning. This method is often used when developing a scorecard model
.
The following binning_by()
example optimally binning Advertising
considering the target variable US
with a binary class.
library(dplyr) if (requireNamespace("partykit", quietly = TRUE)) { # optimal binning using character bin <- binning_by(carseats, "US", "Advertising") # optimal binning using name bin <- binning_by(carseats, US, Advertising) bin # summary optimal_bins class summary(bin) # performance table attr(bin, "performance") # visualize optimal_bins class plot(bin) # extract binned results extract(bin) %>% head(20) } else { cat("If you want to use this feature, you need to install the partykit package.\n") }
dlookr provides two automated data transformation reports:
transformation_web_report()
transformation_web_report()
creates a dynamic report for objects inherited from data.frame(tbl_df
, tbl
, etc) or data.frame.
The contents of the report are as follows.:
transformation_web_report() generates various reports with the following arguments.
The following script creates a data transformation report for the tbl_df
class object, heartfailure
.
heartfailure %>% transformation_web_report(target = "death_event", subtitle = "heartfailure", output_dir = "./", output_file = "transformation.html", theme = "blue")
knitr::include_graphics('img/transformation_web_title.jpg')
transformation_paged_report()
transformation_paged_report()
create static report for object inherited from data.frame(tbl_df
, tbl
, etc) or data.frame.
The contents of the report are as follows.:
transformation_paged_report() generates various reports with the following arguments.
The following script creates a data transformation report for the data.frame
class object, heartfailure
.
heartfailure %>% transformation_paged_report(target = "death_event", subtitle = "heartfailure", output_dir = "./", output_file = "transformation.pdf", theme = "blue")
knitr::include_graphics('img/transformation_paged_cover.jpg')
knitr::include_graphics('img/transformation_paged_content.jpg')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.