knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  cache = FALSE
)
devtools::load_all()
require(knitr)

AEDA - Automated Exploratory Data Analysis

This Vignette is supposed to give you a introductory glance at the key features of AEDA.
It shows you, how to use this package to simplify several task throughout the data exploration process.
A more detailed in depth and continuously updated examples and wikipedia can be found on the GitHub project page:

Purpose

Writing exploratory data analysis (EDA) scripts helps in extracting valuable information from the data but can be very time consuming. Often people are producing the same tables and figures again and again which could be automatized with EDA scripts.

With this package we want to automate the process of creating an EDA report by providing functions which, one would normally script for each data type of dataset. Therefore we provide following functionalites:

  1. Basic Data Summary
  2. Categorical Data Summary
  3. Numeric Data Summary
  4. Correlation Analysis
  5. Cluster Analysis
  6. Principal Component Analysis
  7. Multidimensional Scaling Analysis
  8. Exploratory Factor Analysis

In order to enable the user to either, create a fast EDA report, for the most functions we set default methods/arguments, like k-means clustering as method in the unsupervised clustering task.
Apart from the default methods/arguments for each EDA step the user is able to modify the analysis step, by choosing other analysis algorithms. To come back to the last example, the user might also be interested in doing a hierarchical cluster analysis with complete-linkage. In the later subsection 5 we will refer to this again.

The remaining of this documentation will be organized in accordance with the steps listed above. As the package evolves, more content will be added.

General AEDA Pipeline

For all analysis steps (except) the basic data summary following steps should be executed in order to get an analysis into the final report:

  1. my.task = make*Task()
  2. my.analysis = make*Analysis()
  3. my.report = makeReport(my.analysis)

Or since these multiple function calls do not provide much additional functionality, if the user does not modify the parameters, there is a shortcut to get a report:

After creating several report-objects with makeReport() (e.g cluster.report, numeric.report, mds.report) you can pass them over to the function finishReport(cluster.report, numeric.report, mds.report) to create a final rmd-file which can be converted to HTML with rmarkdown::render.

Further shortcuts are:

fastReport(): Provided with a data.frame this function will create a finished rmd-file, which is ready to be convertetd to HTML. This function will use all available report types by default if the needed data types are available. The user can still define arguments to each report function via m.par.vals but since every report needs different arguments this paramter can get unhandy.
openMLReport(): This function has the same functionality as fastReport() but instead of a data.frame it takes the ID of an openML dataset.

After creating several report-objects with makeReport() (e.g cluster.report, numeric.report, mds.report) you can pass them over to the function finishReport(cluster.report, numeric.report, mds.report) to create a final HTML-report, which will be created with knitr.


In the following a detailed summary for each analysis step will be displayed:

1) Basic Data Summary

In the basic data summary column-types like integer, numeric, or character etc. will be identified and according to each type accumulated. In addition to that, the basic data summary contains a missing value summary such as listing columns which contain NAs with absolute and percentage amount. An example will be display with this code snippet:

#load Data
data("airquality")
#create the task
basic.report.task = makeBasicReportTask(id = "airquality", data = airquality, target = "Wind")
#create the report
basic.report = makeReport(basic.report.task)

In order to have a look into the basic.report-object one can either print() the object or just call basic.report in the R-commandline.

print(basic.report)

#To gain more insights you can look into the object with the $ operator:
basic.report$basic.data.summary$basic.summary.list$num
basic.report$basic.data.summary$basic.summary.list$int
basic.report$basic.data.summary$basic.summary.list$fact

Since the airquality dataset contains NAs, a missing value analysis is calculated as well by makeBasicReportTask. In order to look into the that simply use the $-Operator again:

basic.report$na.summary$na.df

We provide 2 Missing-Value Plots. The first one shows the index of observation in which the value is missing for each column. The second one shows an overall percentage of missing value for each column. To call the plots you can execute following lines: basic.report$na.summary$image()and basic.report$na.summary$ggplot

include_graphics(c("images/NAindex.png", "images/NApct.png"))

2) Categorical Data Summary

In the categorical data summary we create 1-D frequency and 2-D contingency tables for categorical variables. In addition to that, the internal function plotBar()from our package computes a ggplot for all categorical variables. The user is able to modify the geom_bar() arguments if wanted in within the task initialized by makeCatSumTask(). For more Information look at ?makeCatSumTask.
In this example we use the default arguments.

#load data
data("Arthritis", package = "vcd")
#create the task
cat.sum.task = makeCatSumTask(id = "Arthritis.Task", data = Arthritis, target = "Improved")
#compute the analysis
cat.sum.analysis = makeCatSum(cat.sum.task)
#create the report
cat.sum.report = makeReport(cat.sum.analysis)

#have a look at the 1-D absolute frequency tables
cat.sum.report$cat.sum$freq

#have a look at the 2-D absolute contingecy tables
cat.sum.report$cat.sum$contg.list

To access the bar plots we can use the $-operator again. Note that in our package we have the internal function multiplot() which can plot a list of ggplot-objects in a grid. To call the multiplot() function simply run: multiplot(plotlist = cat.sum.report$cat.sum$plot.list, cols = 2) which leads to following plot:

include_graphics("images/categoricalBar1.png") 

Since we can add geom_bar arguments into the makeCatSumTask the user can modify the barplot. For example we stack the frequency for a categorical variable instead of dodging it. This can be done easily by the same procedure as the example above, with just one additional parameter: cat.sum.task = makeCatSumTask(id = "Arthritis.Task", data = Arthritis, target = "Improved", position = "stack") leading to the following image after using multiplot() :

include_graphics("images/categoricalBar2.png") 

3) Numeric Data Summary

In the numeric data summary we create a numeric summary data.frame which provides several characteristics for each numeric column. Furthermore for each numeric column a boxplot, histogramm and kernel density plot will be created. In order to to get the numeric report the user must run following functions:

#load data
data("Aids2", package = "MASS")
#create the task
num.sum.task = makeNumSumTask(id = "AidsTask", data = Aids2, target = "sex")
#compute the analysis
num.sum = makeNumSum(num.sum.task)
#create the report
num.sum.report = makeReport(num.sum)

In order to call the numeric summary data.frame execute num.sum.report$num.sum.df.
The characteristics we calculated are:

colnames(num.sum.report$num.sum.df)

Showing the result after selecting specific columns we receive:

kable(num.sum.report$num.sum.df[, c(1,2,4,5,6,7,8,10,12,13,14,16,21,22)])

To show the ggplot-objects for a variable, e.g age with multiplot it can be done in one grid: multiplot(plotlist = num.sum.report$num.sum.var$age[2:4], cols = 2) leading to the following plot:

include_graphics("images/plotNumeric1.png") 

Just like the makeCatSumTask in the numeric summary the user can change input parameters for geom_density(), geom_box or geom_histogramm(). with task = makeNumSumTask(id = "AidsTask", data = Aids2, target = "sex", geom.hist.args = list(bins = 20, alpha = 0.5)).

4) Correlation Analysis

Correlation describes the relationship between two features. Depending on the used method num and ordered features can be used in the analysis. The default option is to use all suitable features, but the user can specify which columns should be included.
The resulting correlation matrix is calculated with the base::cor

data("survey", package = "MASS")
# Since Exer and Smoke are ordinal, encode them as ordered
survey$Exer = factor(survey$Exer,levels(survey$Exer)[c(2,3,1)], ordered = TRUE)
survey$Smoke = factor(survey$Smoke,levels(survey$Smoke)[c(2,3,4,1)], ordered = TRUE)

#create the task
corr.task = makeCorrTask(id = "survey", data = survey)
#compute the analysis
corr.result = makeCorr(corr.task)
#create the report
corr.report = makeReport(corr.result)

#default method is "pearson"
corr.report$method
#print the correlation matrix
corr.report$corr.matrix

The correlation report will create a circle shaped correlation plot:

library(GGally)
ggcorr(data = NULL, cor_matrix = corr.report$corr.matrix, 
geom = "circle", nbreaks = 10)

To include the ordered features use the method spearman or kendall:

# As mentioned there is shortcut to get a report
s.corr.report = createCorrReport(data = survey, method = "spearman")

#method is now ...
s.corr.report$method
#correlation matrix includes ordered features
s.corr.report$corr.matrix

The resulting plot:

ggcorr(data = NULL, cor_matrix = s.corr.report$corr.matrix, 
geom = "circle", nbreaks = 10)

5) Cluster Analysis

Cluster analysis is a unsupervised learning task, which mainly focuses on grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Inspired by mlr's listed clustering algorithms our packages provides following clustering methods:

  1. For Hierarchical Clustering:
  2. Classical hierarchical clustering
  3. Agglomerative nesting
  4. Divisive analysis

  5. For Partitioning Clustering:

  6. K-nearest-neighbors
  7. Kernel K-means
  8. Partioning around medoids

  9. For Model-Based Clustering:

  10. Density-based spatial clustering
  11. Model-based clustering

By default in the cluster analysis k-means clustering will be computed on the numeric columns of a dataset. The components of the returned ClusterAnalysisObj are the following: (in case the method is not hierarchical)

The comb.cluster.list has the purpose to create clusters and grouping relationships between 2 numeric variables. The list is restricted up to 10 cluster analysis. If a dataset contains of 4 numeric columns, there are $\binom{4}{2} =$ r choose(4,2) combinations. If the tuple-combinations is above 10, out of the $\binom{k}{2}$ combinations, our function randomly selects 10 of it.
In case of a hierarchical clustering algorithmus the comb.cluster.list is an empty list, since the calculation will be done with respect to all numeric columns only.

Following code snippets are needed in order to create a cluster analysis report:

#create the task
cluster.task = makeClusterTask(id = "iris", data = iris)
#compute the analysis
cluster.analysis = makeClusterAnalysis(cluster.task)
#create the report
cluster.report = makeReport(cluster.analysis)

Plotting the diagnosis results for the cluster.all section can be done via: multiplot(plotlist = cluster.report$cluster.analysis$cluster.all$cluster.diag, cols = 2) leading to the diagnosis plot:

include_graphics("images/clusterDiagnosisAll1.png") 

Since the iris dataset contains of 4 numeric columns, the 2 cluster centers $c_1$ and $c_2$ are $\in \mathbb{R}^4$.

#show centers
cluster.report$cluster.analysis$cluster.all$cluster.res$centers

#show plot: Note, before a principal component analysis was conducted in order to map the centers and points into 2 dimensional space
cluster.report$cluster.analysis$cluster.all$cluster.plot

In case the user wants to perform hierarchical clustering, he can simply choose one of the methods, our function provides:

##check available methods
#?makeClusterTask
#create the task: hierarchical clustering with average linkage.
cluster.task.2 = makeClusterTask(id = "iris", data = iris, method = "cluster.h",
  par.vals = list(method = "average"))
#compute the analysis
cluster.analysis.2 = makeClusterAnalysis(cluster.task.2)
#we do not need to create the report, as cluster.analysis.2 already contains the analysis information. If we were to include the cluster.analysis.2 into our final report, then we would have to create the report.

#multiplot one diagnosis plot and the resulting dendogramm
multiplot(plotlist = list(cluster.analysis.2$cluster.analysis$cluster.all$cluster.plot,
  cluster.analysis.2$cluster.analysis$cluster.all$cluster.diag[[2]]), cols = 2)

In order to gain more insights into the cluster analysis for all numeric columns you can access cluster.report$cluster.analysis$cluster.all$cluster.res$. Since the user can specify the algorithm, the returned values differ for each method. The links for the the cluster clustering algorithm choices forward you to the respective methods in order to have a look at the returned values.

6) Principal Component Analysis

Principial Component Analysis (PCA) is a dimensionality reduction method that uses an an orthogonal transformation to reduce a large set of (numeric) variables to a small set of (numeric) variables, called principal components, that still contain most of the information of the large set. Those computed principal components are linearely independent, and hence uncorrelated.

The first principal component accounts for as much of the variability/variance in the data as possible, and each succeeding component accounts for as much of the remaining variability/variance as possible.

Following code snippets are needed in order to create a principal component analysis report:

#create the task
pca.task = makePCATask(id = "iris", data = iris, target = NULL)
#compute the analysis
pca.result = makePCA(pca.task)
#create the report
pca.report = makeReport(pca.result)

#show scree- and scatterplot:
multiplot(plotlist = list(scree = pca.report$pca.result$plotlist$pca.scree,
  scatter = pca.report$pca.result$plotlist$pca.scatter.1), cols = 2L)

In order to gain more insights into the principal component analysis you can either access pca.result or pca.report. For the sake of consistency one can extract the standard deviations of the principal components via: pca.report$pca.result$pca.result$sdev or the (weighting) loadings via pca.report$pca.result$pca.result$rotation. For more information have a look at the returned values by stats::prcomp().

7) Multidimensional Scaling Analysis

Multidimensional scaling (MDS) is a visual representation of distances or dissimilarities between sets of objects. Objects can be colors, faces, map coordinates, political persuasion, or any kind of real conceptual stimuli (Kruskal and Wish, 1978). Objects that are more similar (or have shorter distances) are closer together on the graph than objects that are less similar (or have linger distances). As well as interpreting dissimilarities as distances on a graph, MDS can also serve as a dimension reduction technique for high-dimensional data. An MDS algorithm aims to place each object in $K-$dimensional space such that the between-object distances are preserved as well as possible. In our approach we set $K = 2$ in order to create a two-dimensional scatterplot to represent the objects.

Our MDS-report provides following MDS algorithms:

  1. For Metric Scaling:
  2. Classical multidimensional scaling
  3. Weighted classical multidimensional scaling
  4. Multidimensional scaling on a symmetric dissimilarity matrix using SMACOF

  5. For Non-Metric Scaling:

  6. Kruskal's Non-metric Multidimensional Scaling
  7. Sammon's Non-Linear Mapping

Following code snippets are needed in order to create a multidimensional scaling report:

#create the task
mds.task = makeMDSTask(id = "mtcars", data = mtcars)
#compute the analysis
mds.analysis = makeMDSAnalysis(mds.task)
#create the report
mds.report = makeReport(mds.analysis)

#after computing the distances from following dataframe
head(mtcars)

#we map the objects (cars) based on dist(mtcars) back into 2-dimensional space
head(mds.report$mds.analysis$mds.result)

#Plot the MDS results
mds.report$mds.analysis$mds.plot

In order to gain more insights into the multidimensional scaling analysis you can access mds.report.mds.anylsis. Since the user can specify the algorithm, the returned values differ for each method. The links for the metric and non-metric forward you to the respective methods in order to have a look at the returned values.

8) Exploratory Factor Analysis

Factor analysis is a way to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved (latent) variables, called factors.
Like principal component analysis as well as multidimensional scaling its purpose is dimension reduction. It's a way to find hidden patterns, show how those patterns overlap and show what characteristics are seen in multiple patterns. These factors each embody a set of observed variables that have similar response patterns. Our makeFAReport computes an exploratory factor analysis, since in the very beginning of the analysis we believe to not have any idea about what structure the data is or how many dimensions are in a set of variables.

Since we do not specify the number of factors, by default our function computes the optimal number of factors according to the parallel analysis, A technique that compares the scree of factors of the observed data with that of a random data matrix of the same size as the original. If the user selects the number of factors, the computation will be done with the assigned number of factors.

Following code snippets are needed in order to create a factor analysis report:

#load data
library(psych)
data(bfi)
#take small sample of size 200L:
bfi.small = bfi[sample(seq_len(nrow(bfi)), size = 200L), ]
#create the task
fa.task = makeFATask(id = "bfi", data = bfi.small)
#compute the analysis
fa.result = makeFA(fa.task)
#create the report
fa.report = makeReport(fa.result)
#Plot the factor analysis results 
fa.diagram(fa.report$fa.result)

In order to gain more insights into the exploratory factor analysis you can either access fa.result or fa.report object. For the sake of consistency we will show an example with the report object. One can extract the loadings of thefactor analysis via: fa.report$fa.result$Structure. For more information have a look at the returned values by psych::fa().



ptl93/AEDA documentation built on May 7, 2019, 3:20 p.m.