knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = FALSE ) devtools::load_all() require(knitr)
This Vignette is supposed to give you a introductory glance at the key features of AEDA
.
It shows you, how to use this package to simplify several task throughout the data exploration process.
A more detailed in depth and continuously updated examples and wikipedia can be found on the GitHub project page:
Writing exploratory data analysis (EDA) scripts helps in extracting valuable information from the data but can be very time consuming. Often people are producing the same tables and figures again and again which could be automatized with EDA scripts.
With this package we want to automate the process of creating an EDA report by providing functions which, one would normally script for each data type of dataset. Therefore we provide following functionalites:
In order to enable the user to either, create a fast EDA report, for the most functions we set default methods/arguments, like k-means clustering as method in the unsupervised clustering task.
Apart from the default methods/arguments for each EDA step the user is able to modify the analysis step, by choosing other analysis algorithms. To come back to the last example, the user might also be interested in doing a hierarchical cluster analysis with complete-linkage. In the later subsection 5 we will refer to this again.
The remaining of this documentation will be organized in accordance with the steps listed above. As the package evolves, more content will be added.
AEDA
PipelineFor all analysis steps (except) the basic data summary following steps should be executed in order to get an analysis into the final report:
my.task = make*Task()
my.analysis = make*Analysis()
my.report = makeReport(my.analysis)
Or since these multiple function calls do not provide much additional functionality, if the user does not modify the parameters, there is a shortcut to get a report:
my.report = create*Report()
report
-objects with makeReport()
(e.g cluster.report
, numeric.report
, mds.report
) you can pass them over to the function finishReport(cluster.report, numeric.report, mds.report)
to create a final rmd-file which can be converted to HTML with rmarkdown::render
.
Further shortcuts are:
fastReport()
openMLReport()
fastReport()
: Provided with a data.frame
this function will create a finished rmd-file, which is ready to be convertetd to HTML. This function will use all available report types by default if the needed data types are available. The user can still define arguments to each report function via m.par.vals
but since every report needs different arguments this paramter can get unhandy. openMLReport()
: This function has the same functionality as fastReport()
but instead of a data.frame
it takes the ID of an openML dataset.
After creating several report
-objects with makeReport()
(e.g cluster.report
, numeric.report
, mds.report
) you can pass them over to the function finishReport(cluster.report, numeric.report, mds.report)
to create a final HTML-report, which will be created with knitr
.
In the following a detailed summary for each analysis step will be displayed:
In the basic data summary column-types like integer
, numeric
, or character
etc. will be identified and according to each type accumulated. In addition to that, the basic data summary contains a missing value summary such as listing columns which contain NAs
with absolute and percentage amount. An example will be display with this code snippet:
#load Data data("airquality") #create the task basic.report.task = makeBasicReportTask(id = "airquality", data = airquality, target = "Wind") #create the report basic.report = makeReport(basic.report.task)
In order to have a look into the basic.report
-object one can either print()
the object or just call basic.report
in the R-commandline.
print(basic.report) #To gain more insights you can look into the object with the $ operator: basic.report$basic.data.summary$basic.summary.list$num basic.report$basic.data.summary$basic.summary.list$int basic.report$basic.data.summary$basic.summary.list$fact
Since the airquality
dataset contains NAs
, a missing value analysis is calculated as well by makeBasicReportTask
. In order to look into the that simply use the $
-Operator again:
basic.report$na.summary$na.df
We provide 2 Missing-Value Plots. The first one shows the index of observation in which the value is missing for each column. The second one shows an overall percentage of missing value for each column.
To call the plots you can execute following lines:
basic.report$na.summary$image()
and basic.report$na.summary$ggplot
include_graphics(c("images/NAindex.png", "images/NApct.png"))
In the categorical data summary we create 1-D frequency and 2-D contingency tables for categorical variables. In addition to that, the internal function plotBar()
from our package computes a ggplot
for all categorical variables. The user is able to modify the geom_bar()
arguments if wanted in within the task initialized by makeCatSumTask()
.
For more Information look at ?makeCatSumTask
.
In this example we use the default arguments.
#load data data("Arthritis", package = "vcd") #create the task cat.sum.task = makeCatSumTask(id = "Arthritis.Task", data = Arthritis, target = "Improved") #compute the analysis cat.sum.analysis = makeCatSum(cat.sum.task) #create the report cat.sum.report = makeReport(cat.sum.analysis) #have a look at the 1-D absolute frequency tables cat.sum.report$cat.sum$freq #have a look at the 2-D absolute contingecy tables cat.sum.report$cat.sum$contg.list
To access the bar plots we can use the $
-operator again. Note that in our package we have the internal function multiplot()
which can plot a list
of ggplot
-objects in a grid. To call the multiplot()
function simply run: multiplot(plotlist = cat.sum.report$cat.sum$plot.list, cols = 2)
which leads to following plot:
include_graphics("images/categoricalBar1.png")
Since we can add geom_bar
arguments into the makeCatSumTask
the user can modify the barplot. For example we stack the frequency for a categorical variable instead of dodging it. This can be done easily by the same procedure as the example above, with just one additional parameter:
cat.sum.task = makeCatSumTask(id = "Arthritis.Task", data = Arthritis, target = "Improved", position = "stack")
leading to the following image after using multiplot()
:
include_graphics("images/categoricalBar2.png")
In the numeric data summary we create a numeric summary data.frame which provides several characteristics for each numeric column. Furthermore for each numeric column a boxplot, histogramm and kernel density plot will be created. In order to to get the numeric report the user must run following functions:
#load data data("Aids2", package = "MASS") #create the task num.sum.task = makeNumSumTask(id = "AidsTask", data = Aids2, target = "sex") #compute the analysis num.sum = makeNumSum(num.sum.task) #create the report num.sum.report = makeReport(num.sum)
In order to call the numeric summary data.frame execute num.sum.report$num.sum.df
.
The characteristics we calculated are:
colnames(num.sum.report$num.sum.df)
Showing the result after selecting specific columns we receive:
kable(num.sum.report$num.sum.df[, c(1,2,4,5,6,7,8,10,12,13,14,16,21,22)])
l.bound
is defined as 25% - 1.5*IQR
and u.bound
is defined as 75% + 1.5*IQR
where IQR = 75% - 25%
To show the ggplot
-objects for a variable, e.g age
with multiplot it can be done in one grid:
multiplot(plotlist = num.sum.report$num.sum.var$age[2:4], cols = 2)
leading to the following plot:
include_graphics("images/plotNumeric1.png")
Just like the makeCatSumTask
in the numeric summary the user can change input parameters for geom_density()
, geom_box
or geom_histogramm()
. with task = makeNumSumTask(id = "AidsTask", data = Aids2, target = "sex", geom.hist.args = list(bins = 20, alpha = 0.5))
.
Correlation describes the relationship between two features. Depending on the used method num and ordered features can be used in the analysis. The default option is to use all suitable features, but the user can specify which columns should be included.
The resulting correlation matrix is calculated with the base::cor
data("survey", package = "MASS") # Since Exer and Smoke are ordinal, encode them as ordered survey$Exer = factor(survey$Exer,levels(survey$Exer)[c(2,3,1)], ordered = TRUE) survey$Smoke = factor(survey$Smoke,levels(survey$Smoke)[c(2,3,4,1)], ordered = TRUE) #create the task corr.task = makeCorrTask(id = "survey", data = survey) #compute the analysis corr.result = makeCorr(corr.task) #create the report corr.report = makeReport(corr.result) #default method is "pearson" corr.report$method #print the correlation matrix corr.report$corr.matrix
The correlation report will create a circle shaped correlation plot:
library(GGally) ggcorr(data = NULL, cor_matrix = corr.report$corr.matrix, geom = "circle", nbreaks = 10)
To include the ordered features use the method spearman
or kendall
:
# As mentioned there is shortcut to get a report s.corr.report = createCorrReport(data = survey, method = "spearman") #method is now ... s.corr.report$method #correlation matrix includes ordered features s.corr.report$corr.matrix
The resulting plot:
ggcorr(data = NULL, cor_matrix = s.corr.report$corr.matrix, geom = "circle", nbreaks = 10)
Cluster analysis is a unsupervised learning task, which mainly focuses on grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Inspired by mlr's listed clustering algorithms our packages provides following clustering methods:
For Partitioning Clustering:
For Model-Based Clustering:
By default in the cluster analysis k-means clustering will be computed on the numeric columns of a dataset.
The components of the returned ClusterAnalysisObj
are the following: (in case the method is not hierarchical)
The comb.cluster.list
has the purpose to create clusters and grouping relationships between 2 numeric variables. The list is restricted up to 10 cluster analysis. If a dataset contains of 4 numeric columns, there are $\binom{4}{2} =$ r choose(4,2)
combinations. If the tuple-combinations is above 10, out of the $\binom{k}{2}$ combinations, our function randomly selects 10 of it.
In case of a hierarchical clustering algorithmus the comb.cluster.list
is an empty list, since the calculation will be done with respect to all numeric columns only.
Following code snippets are needed in order to create a cluster analysis report:
#create the task cluster.task = makeClusterTask(id = "iris", data = iris) #compute the analysis cluster.analysis = makeClusterAnalysis(cluster.task) #create the report cluster.report = makeReport(cluster.analysis)
Plotting the diagnosis results for the cluster.all
section can be done via:
multiplot(plotlist = cluster.report$cluster.analysis$cluster.all$cluster.diag, cols = 2)
leading to the diagnosis plot:
include_graphics("images/clusterDiagnosisAll1.png")
Since the iris dataset contains of 4 numeric columns, the 2 cluster centers $c_1$ and $c_2$ are $\in \mathbb{R}^4$.
#show centers cluster.report$cluster.analysis$cluster.all$cluster.res$centers #show plot: Note, before a principal component analysis was conducted in order to map the centers and points into 2 dimensional space cluster.report$cluster.analysis$cluster.all$cluster.plot
In case the user wants to perform hierarchical clustering, he can simply choose one of the methods, our function provides:
##check available methods #?makeClusterTask #create the task: hierarchical clustering with average linkage. cluster.task.2 = makeClusterTask(id = "iris", data = iris, method = "cluster.h", par.vals = list(method = "average")) #compute the analysis cluster.analysis.2 = makeClusterAnalysis(cluster.task.2) #we do not need to create the report, as cluster.analysis.2 already contains the analysis information. If we were to include the cluster.analysis.2 into our final report, then we would have to create the report. #multiplot one diagnosis plot and the resulting dendogramm multiplot(plotlist = list(cluster.analysis.2$cluster.analysis$cluster.all$cluster.plot, cluster.analysis.2$cluster.analysis$cluster.all$cluster.diag[[2]]), cols = 2)
In order to gain more insights into the cluster analysis for all numeric columns you can access cluster.report$cluster.analysis$cluster.all$cluster.res$
. Since the user can specify the algorithm, the returned values differ for each method. The links for the the cluster clustering algorithm choices forward you to the respective methods in order to have a look at the returned values.
Principial Component Analysis (PCA) is a dimensionality reduction method that uses an an orthogonal transformation to reduce a large set of (numeric) variables to a small set of (numeric) variables, called principal components, that still contain most of the information of the large set. Those computed principal components are linearely independent, and hence uncorrelated.
The first principal component accounts for as much of the variability/variance in the data as possible, and each succeeding component accounts for as much of the remaining variability/variance as possible.
Following code snippets are needed in order to create a principal component analysis report:
#create the task pca.task = makePCATask(id = "iris", data = iris, target = NULL) #compute the analysis pca.result = makePCA(pca.task) #create the report pca.report = makeReport(pca.result) #show scree- and scatterplot: multiplot(plotlist = list(scree = pca.report$pca.result$plotlist$pca.scree, scatter = pca.report$pca.result$plotlist$pca.scatter.1), cols = 2L)
In order to gain more insights into the principal component analysis you can either access pca.result
or pca.report
. For the sake of consistency one can extract the standard deviations of the principal components via:
pca.report$pca.result$pca.result$sdev
or the (weighting) loadings via pca.report$pca.result$pca.result$rotation
.
For more information have a look at the returned values by stats::prcomp().
Multidimensional scaling (MDS) is a visual representation of distances or dissimilarities between sets of objects. Objects can be colors, faces, map coordinates, political persuasion, or any kind of real conceptual stimuli (Kruskal and Wish, 1978). Objects that are more similar (or have shorter distances) are closer together on the graph than objects that are less similar (or have linger distances). As well as interpreting dissimilarities as distances on a graph, MDS can also serve as a dimension reduction technique for high-dimensional data. An MDS algorithm aims to place each object in $K-$dimensional space such that the between-object distances are preserved as well as possible. In our approach we set $K = 2$ in order to create a two-dimensional scatterplot to represent the objects.
Our MDS-report provides following MDS algorithms:
Multidimensional scaling on a symmetric dissimilarity matrix using SMACOF
For Non-Metric Scaling:
Following code snippets are needed in order to create a multidimensional scaling report:
#create the task mds.task = makeMDSTask(id = "mtcars", data = mtcars) #compute the analysis mds.analysis = makeMDSAnalysis(mds.task) #create the report mds.report = makeReport(mds.analysis) #after computing the distances from following dataframe head(mtcars) #we map the objects (cars) based on dist(mtcars) back into 2-dimensional space head(mds.report$mds.analysis$mds.result) #Plot the MDS results mds.report$mds.analysis$mds.plot
In order to gain more insights into the multidimensional scaling analysis you can access mds.report.mds.anylsis
. Since the user can specify the algorithm, the returned values differ for each method. The links for the metric and non-metric forward you to the respective methods in order to have a look at the returned values.
Factor analysis is a way to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved (latent) variables, called factors.
Like principal component analysis as well as multidimensional scaling its purpose is dimension reduction. It's a way to find hidden patterns, show how those patterns overlap and show what characteristics are seen in multiple patterns.
These factors each embody a set of observed variables that have similar response patterns. Our makeFAReport
computes an exploratory factor analysis, since in the very beginning of the analysis we believe to not have any idea about what structure the data is or how many dimensions are in a set of variables.
Since we do not specify the number of factors, by default
our function computes the optimal number of factors according to the parallel analysis, A technique that compares the scree of factors of the observed data with that of a random data matrix of the same size as the original. If the user selects the number of factors, the computation will be done with the assigned number of factors.
Following code snippets are needed in order to create a factor analysis report:
#load data library(psych) data(bfi) #take small sample of size 200L: bfi.small = bfi[sample(seq_len(nrow(bfi)), size = 200L), ] #create the task fa.task = makeFATask(id = "bfi", data = bfi.small) #compute the analysis fa.result = makeFA(fa.task) #create the report fa.report = makeReport(fa.result) #Plot the factor analysis results fa.diagram(fa.report$fa.result)
In order to gain more insights into the exploratory factor analysis you can either access fa.result
or fa.report
object. For the sake of consistency we will show an example with the report object.
One can extract the loadings of thefactor analysis via:
fa.report$fa.result$Structure
.
For more information have a look at the returned values by psych::fa().
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.