Summarize and Explore the Data"

library(rmarkdown)
library(SmartEDA)
library(knitr)
library(ISLR)
library(scales)
library(gridExtra)
library(ggplot2)

1. Introduction

The document introduces the SmartEDA package and how it can help you to build exploratory data analysis.

SmartEDA includes multiple custom functions to perform initial exploratory analysis on any input data describing the structure and the relationships present in the data. The generated output can be obtained in both summary and graphical form. The graphical form or charts can also be exported as reports.

सर्वस्य लोचनं शास्त्रं
Science is the only eye

अनेकसंशयोच्छेदि, परोक्षार्थस्य दर्शक|
सर्वस्य लोचनं शास्त्रं, यस्य नास्त्यन्ध एव सः ||

It blasts many doubts, foresees what is not obvious |
Science is the eye of everyone, one who hasnt got it, is like a blind ||

SmartEDA package helps you to construct a good base of data understanding. The capabilities and functionalities are listed below

  1. SmartEDA package will make you capable of applying different types of EDA without having to

    • remember the different R package names
    • write lengthy R scripts
    • manual effort to prepare the EDA report
  2. No need to categorize the variables into Character, Numeric, Factor etc. SmartEDA functions automatically categorize all the features into the right data type (Character, Numeric, Factor etc.) based on the input data.

  3. ggplot2 functions are used for graphical presentation of data

  4. Rmarkdown and knitr functions were used for build HTML reports

To summarize, SmartEDA package helps in getting the complete exploratory data analysis just by running the function instead of writing lengthy r code.

Journal of Open Source Software Article

An article describing SmartEDA pacakge for exploratory data analysis approach has been published in arxiv and Journal of Open Source Software JOSS. Please cite the paper if you use SmartEDA in your work!

2. Data

In this vignette, we will be using a simulated data set containing sales of child car seats at 400 different stores.

Data Source ISLR package.

Install the package "ISLR" to get the example data set.

#install.packages("ISLR")
library("ISLR")
#install.packages("SmartEDA")
library("SmartEDA")
## Load sample dataset from ISLR pacakge
Carseats= ISLR::Carseats

2.1 Overview of the data

Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables

# Overview of the data - Type = 1
ExpData(data=Carseats,type=1)

# Structure of the data - Type = 2
ExpData(data=Carseats,type=2)
ovw_tabl <- ExpData(data=Carseats,type=1)
ovw_tab2 <- ExpData(data=Carseats,type=2)
kable(ovw_tabl, "html")
kable(ovw_tab2, "html")

2.2 Add summary statistics into Metadata ouput

ovw_tabl_du <- ExpData(data=Carseats,type=2, fun = c("mean", "median", "var"))
# Metadata Information with additional statistics like mean, median and variance
ExpData(data=Carseats,type=2, fun = c("mean", "median", "var"))
kable(ovw_tabl_du, "html")
# Derive Quantile 
quantile_10 = function(x){
  quantile_10 = quantile(x, na.rm = TRUE, 0.1)
}

quantile_90 = function(x){
  quantile_90 = quantile(x, na.rm = TRUE, 0.9)
}

output_e1 <- ExpData(data=Carseats, type=2, fun=c("quantile_10", "quantile_90"))
kable(output_e1, "html")

3. Exploratory data analysis (EDA)

This function shows the EDA output for 3 different cases

  1. Target variable is not defined
  2. Target variable is continuous
  3. Target variable is categorical

3.1 Example for case 1: Target variable is not defined

3.1.1 Summary of numerical variables

Summary of all numerical variables

ec1 = ExpNumStat(Carseats,by="A",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2,Nlim=3)
rownames(ec1)<-NULL
ExpNumStat(Carseats,by="A",gp=NULL,Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2,Nlim=10)
paged_table(ec1)
Compute Weighted Summary Statistics for numerical variable
carseat = ISLR::Carseats
## Compute random weight
carseat$wt = stats::runif( nrow(carseat), 0.5, 1.5 )
wt_summary = ExpNumStat(carseat,by="A",gp=NULL,round=2,Nlim=10, weight = "wt")
wt_summary[,c("Vname","TN","W_count","mean", "W_Mean", "SD","W_Sd")]
## With group by statement
wt_summary = ExpNumStat(carseat,by="GA",gp="ShelveLoc",round=2,Nlim=10, weight = "wt")
wt_summary[,c("Vname","Group","TN","W_count","mean", "W_Mean", "SD","W_Sd")]

3.1.2 Distributions of numerical variables

Graphical representation of all numeric features

# Note: Variable excluded (if unique value of variable which is less than or eaual to 10 [nlim=10])
plot1 <- ExpNumViz(Carseats,target=NULL,nlim=10,Page=c(2,2),sample=4)
plot1[[1]]

3.1.3. Summary of categorical variables

et1 <- ExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=5,round=2,bin=NULL,per=T)
rownames(et1)<-NULL
ExpCTable(Carseats,Target=NULL,margin=1,clim=10,nlim=3,round=2,bin=NULL,per=T)
kable(et1,"html")

NA is Not Applicable

3.1.4. Distributions of categorical variables

plot2 <- ExpCatViz(Carseats,target=NULL,col ="slateblue4",clim=10,margin=2,Page = c(2,2),sample=4)
plot2[[1]]

3.2 Example for case 2: Target variable is continuous

3.2.1. Target variable

Summary of continuous dependent variable

  1. Variable name - Price
  2. Variable description - Price company charges for car seats at each site
summary(Carseats[,"Price"])

3.2.2 Summary of numerical variables

Summary statistics when dependent variable is continuous Price.

cpp = ExpNumStat(Carseats,by="A",gp="Price",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2)
rownames(cpp)<-NULL
ExpNumStat(Carseats,by="A",gp="Price",Qnt=seq(0,1,0.1),MesofShape=1,Outlier=TRUE,round=2)
paged_table(cpp)

If Target variable is continuous, summary statistics will add the correlation column (Correlation between Target variable vs all independent variables)

3.2.3 Distributions of numerical variables

Graphical representation of all numeric variables

Scatter plot between all numeric variables and target variable Price. This plot help to examine how well a target variable is correlated with dependent variables.

Dependent variable is Price (continuous).

#Note: sample=8 means randomly selected 8 scatter plots
#Note: nlim=4 means included numeric variable with unique value is more than 4
plot3 <- ExpNumViz(Carseats,target="Price",nlim=4,scatter=FALSE,fname=NULL,col="green",Page=c(2,2),sample=8)
plot3[[1]]
#Note: sample=8 means randomly selected 8 scatter plots
#Note: nlim=4 means included numeric variable with unique value is more than 4
plot31 <- ExpNumViz(Carseats,target="US",nlim=4,scatter=TRUE,fname=NULL,Page=c(2,1),sample=4)
plot31[[1]]

3.2.4. Summary of categorical variables

Summary of categorical variables

et11 <- ExpCTable(Carseats,Target="Price",margin=1,clim=10,round=2,bin=4,per=F)
rownames(et11)<-NULL
##bin=4, descretized 4 categories based on quantiles
ExpCTable(Carseats,Target="Price",margin=1,clim=10,round=2,bin=4,per=F)
paged_table(et11)
Compute Weighted Summary Statistics for categorical variable
carseat = ISLR::Carseats
## Compute random weight
carseat$wt = stats::runif( nrow(carseat), 0.5, 1.5 )
wt_summary = ExpCTable(carseat,margin=1,clim=10,round=2,bin=4,per=F, weight = "wt")
wt_summary

3.3 Example for case 3: Target variable is categorical

3.3.1. Summary of categorical dependent variable

  1. Variable name - Urban
  2. Variable description - Whether the store is in an urban or rural location
tab_tar <- data.frame(table(Carseats[,"Urban"]))
tab_tar$Descriptions <- "Store location"
names(tab_tar) <- c("Urban","Frequency","Descriptions")
rownames(tab_tar)<-NULL
kable(tab_tar, "html")

3.3.2 Summary of numerical variables

Summary of all numeric variables

snc = ExpNumStat(Carseats,by="GA",gp="Urban",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
rownames(snc)<-NULL
ExpNumStat(Carseats,by="GA",gp="Urban",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
paged_table(snc)

3.3.3 Distributions of Numerical variables

Boxplot for all the numeric attributes by each category of Urban

plot4 <- ExpNumViz(Carseats,target="Urban",type=1,nlim=3,fname=NULL,col=c("darkgreen","springgreen3","springgreen1"),Page=c(2,2),sample=8)
plot4[[1]]

3.3.4 Summary of categorical variables

et100 <- ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
rownames(et100)<-NULL

et4 <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=3,nlim=3,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2)
rownames(et4)<-NULL


et5 <- ExpCatStat(Carseats,Target="Urban",result = "IV",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2)
rownames(et5)<-NULL
et5 <- et5[1:15,]

Cross tabulation with target variable

ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
kable(et100,"html")

Information Value

ExpCatStat(Carseats,Target="Urban",result = "IV",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2)
kable(et5,"html")

Statistical test

et4 <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,bins=10,Pclass="Yes",plot=FALSE,top=20,Round=2)
kable(et4,"html")

Variable importance based on Information value

varimp <- ExpCatStat(Carseats,Target="Urban",result = "Stat",clim=10,nlim=5,bins=10,Pclass="Yes",plot=TRUE,top=10,Round=2)

3.3.5. Distributions of categorical variables

Stacked bar plot with vertical or horizontal bars for all categorical variables

plot5 <- ExpCatViz(Carseats,target="Urban",fname=NULL,clim=5,col=c("slateblue4","slateblue1"),margin=2,Page = c(2,1),sample=2)
plot5[[1]]

4. Quantile-quantile plot for numeric variables

Function definition:

ExpOutQQ (data,nlim=3,fname=NULL,Page=NULL,sample=NULL)
data    : Input dataframe or data.table
nlim    : numeric variable limit
fname   : output file name (Output will be in PDF format)
Page    : output pattern. if Page=c(3,2), It will generate 6 plots with 3 rows and 2 columns
sample  : random number of plots

Carseats data from ISLR package:

options(width = 150)
CData = ISLR::Carseats
qqp <- ExpOutQQ(CData,nlim=10,fname=NULL,Page=c(2,2),sample=4)
qqp[[1]]

5. Parallel Co-ordinate plots

Function definition:

ExpParcoord (data,Group=NULL,Stsize=NULL,Nvar=NULL,Cvar=NULL,scale=NULL)
data    : Input dataframe or data.table
Group   : stratification variables
Stsize  : vector of startum sample sizes
Nvar    : vector of numerice variables, default it will consider all the numeric variable from data
Cvar    : vector of categorical variables, default it will consider all the categorical variable
scale   : scale the variables in the parallel coordinate plot[Default normailized with minimum of the variable is zero and maximum of the variable is one]

5.1 Defualt ExpParcoord funciton

ExpParcoord(CData,Group=NULL,Stsize=NULL,Nvar=c("Price","Income","Advertising","Population","Age","Education"))

5.2 With Stratified rows and selected columns only

ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income"),Cvar=c("Urban","US"))

5.3 Without stratification

ExpParcoord(CData,Group="ShelveLoc",Nvar=c("Price","Income"),Cvar=c("Urban","US"),scale=NULL)

5.4 Scale change

std: univariately, subtract mean and divide by standard deviation

ExpParcoord(CData,Group="US",Nvar=c("Price","Income"),Cvar=c("ShelveLoc"),scale="std")

5.5 Selected numeric variables

ExpParcoord(CData,Group="ShelveLoc",Stsize=c(10,15,20),Nvar=c("Price","Income","Advertising","Population","Age","Education"))

5.6 Selected categorical variables

ExpParcoord(CData,Group="US",Stsize=c(15,50),Cvar=c("ShelveLoc","Urban"))

6. Customized Summary Statistics

Used 'data.table' package functions

Function definition:

ExpCustomStat(data,Cvar=NULL,Nvar=NULL,stat=NULL,gpby=TRUE,filt=NULL,dcast=FALSE)

ExpCustomStat examples

e1du <- ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=F)
rownames(e1du)<-NULL

e1du1 <- ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=T)
rownames(e1du1)<-NULL

e1du2 <- ExpCustomStat(Carseats,Cvar=c("Urban","ShelveLoc"),Nvar=c("Age","Price","Advertising","Sales"),stat=c("mean"),gpby=FALSE,dcast=T)
rownames(e1du2)<-NULL
ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=F)
kable(e1du,"html")
ExpCustomStat(Carseats,Cvar="Urban",Nvar=c("Age","Price"),stat=c("mean","count"),gpby=TRUE,dcast=T)
kable(e1du1,"html")
ExpCustomStat(Carseats,Cvar=c("Urban","ShelveLoc"),Nvar=c("Age","Price","Advertising","Sales"),stat=c("mean"),gpby=FALSE,dcast=T)
kable(e1du2,"html")

7. Univariate outlier analysis

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.An outlier can cause serious problems in statistical analyses.

Function ExpOutliers can run univariate outlier analysis based on boxplot or SD method. The function returns the summary of oultlier for selected numeric features and adding new features if there are any outlers

Identifying outliers: There are several methods we can use to identify outliers. In ExpOutliers used two methods (1) Boxplot and (2) Standard Deviation

ana1 <- ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "boxplot",  treatment = "mean", capping = c(0.1, 0.9))
outlier_summ <- ana1[[1]]
outlier_data <- ana1[[2]]

ana2 <- ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev",  treatment = "median", capping = c(0.1, 0.9))
outlier_summ1 <- ana2[[1]]
outlier_data1 <- ana2[[2]]

7.1 Identifying outliers using Boxplot method

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "boxplot",  treatment = "mean", capping = c(0.1, 0.9))

Summary

kable(outlier_summ,"html")

Output data head view

kable(head(outlier_data),"html")

7.2 Identifying outliers using 3 Standard Deviation method

ExpOutliers(Carseats, varlist = c("Sales","CompPrice","Income"), method = "3xStDev",  treatment = "medain", capping = c(0.1, 0.9))

Summary

kable(outlier_summ1,"html")

Output data head view

kable(head(outlier_data1),"html")


Try the SmartEDA package in your browser

Any scripts or data that you put into this service are public.

SmartEDA documentation built on Dec. 4, 2022, 1:15 a.m.