In thllwg/AutoStatR: An R package for automated exploratory data analysis

knitr::opts_chunk$set(echo = TRUE)
require("summarytools")
require("tpotr")
require("mlr")
require("stringr")
require("knitr")
require("iml")
require("lime")
require("ggplot2")
require("kableExtra")
require("magrittr")
require("dplyr")

`r paste('Automatic report for the dataset:',params$title)`

`r Sys.Date()`

Abstract

This report analyses the dataset `r params$title`.

Brief description of the data set

The dataset analyzed in this report is described hereafter. The passed dataset has `r ncol(params$data)` features (columns) and `r nrow(params$data)` observations (rows). The feature of interest, the target column, is `r params$target` and has `r length(levels(params$data[,params$target]))` classes (`r levels(data[,params$target])`). The following table summarizes all features of the data set and provides a descriptive overview of the individual features. It provides the following information: No: The number of the feature indicating the order in which it appears in the dataset. Variable: The name of the feature and its class. Stats/Values: An insight into the feature's values. Freqs: The frequency, proportions or number of distinct values. Graph: A histogram or barplot of the feature's values. Valid/Missing: The number and proportion of valid and missing values in the feature. You can expand/collapse the table.

wzxhzdk:1

Fitting a machine learning pipeline

wzxhzdk:2

The Automated Statistician has fitted a machine learning pipeline, that can predict the target variable `r params$target` in the dataset `r params$title` with an classification accurancy of `r max(generations)`. The best machine learning pipeline is:


`r pipeline`

To fit a machine learning pipeline, the Automated Statistician tried different combination of pipeline operators over `r length(generations)` generations. The best pipeline is the one with the highest classification accurancy for the input training data. The plot below shows how the accuracy increased during the fitting of the pipeline.

wzxhzdk:3

`r if(nrow(params$data_to_predict)){"For the provided test data the pipeline predicted the missing target feature values:"}` wzxhzdk:4 wzxhzdk:5 wzxhzdk:6

`r if(nrow(params$data_to_predict)){link}`

Model explanation

To explain the fitted machine learning model different model-agnostic methods are used in the following. They are based on the book Interpratable Machine Learning by Christoph Molnar.

wzxhzdk:7

Feature Importance

To assess the quality of the fitted machine learning model first the importance of each features for the prediction is shown.

The importance of a feature is measured by calculating the increase in the model’s prediction error after permuting the feature. A feature is important if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction.

The following plot shows the importance of each feature. The classification error increase is calculated as follows: Classification error through permutation minus the original classification error. Consequently if a feature has a classification error increase of 0, it is not important for the prediction. The dot is the median after several permutations:

wzxhzdk:8

The plot shows that from the `r length(data) - 1` features of the data set the following `r length(imp.features)` `r if (length(imp.features) == 1) {"feature is identified as the most important features that is at least equally important as the remaining features: "} else {"features are identified as the most important features that are at least equally as important as the remaining features:"}` `r paste(imp.features, sep = ", ")`. Randomely shuffeling the feature `r imp.features[1]` increases the prediction error the most from `r 1 - max(generations)` to `r imp.df[1,"permutation.error"]` by a factor of `r imp.df[1,"permutation.error"] / (1 - max(generations))`. Therefore `r imp.features[1]` can bee seen as the most important feature for predicting `r target`.

Accumulated Local Effects

The goal of the first step was to analyse which features influence the prediction of the fitted model the most. As a next step it is analyzed, how individual features influence the prediction of the model. This is done by accumulated local affects.

Accumulated local effects (ALE) describe how features influence the prediction of a machine learning model on average.

The following two ALE-plots show how the two most important features (`r paste(imp.df[1:2,"feature"], sep = ", ")`) influence the prediction of the model:

wzxhzdk:9

For the given classification problem the most important feature according to the feature importance analysis is `r paste(imp.df[1,"feature"], sep = ", ")`. The ALE-plot provides an analysis how this feature influences the target `r target` with its `r length(levels)` classes. `r paste(ale.desc, collapse="")`

As an example if `r paste(imp.df[1,"feature"], sep = ", ")` has the value `r paste(ale.example[1,4], sep = ", ")` the average prediction for `r paste(ale.example[1,1], sep = ", ")` changes by `r paste(round(ale.example[1,2], digits = 4), sep = ", ")` or in other words the probability for `r paste(ale.example[1,1], sep = ", ")` `r if (ale.example[1,2] > 0) {"increases"} else {"decreases"}` by `r paste(abs((round(ale.example[1,2], digits = 4) * 100)), sep = ", ")`%.

wzxhzdk:10

The second most important feature according to the feature importance analysis is `r paste(imp.df[2,"feature"], sep = ", ")`. The plot illustrates how `r paste(imp.df[2,"feature"], sep = ", ")` influences the target `r target`. `r paste(ale.desc, collapse="")`

As an example if `r paste(imp.df[2,"feature"], sep = ", ")` has the value `r paste(ale.example[1,4], sep = ", ")` the average prediction for `r paste(ale.example[1,1], sep = ", ")` changes by `r paste(round(ale.example[1,2], digits = 4), sep = ", ")` or in other words the probability for `r paste(ale.example[1,1], sep = ", ")` `r if (ale.example[1,2] > 0) {"increases"} else {"decreases"}` by `r paste(abs((round(ale.example[1,2], digits = 4) * 100)), sep = ", ")`%.

Lime

The last step focuses on individual observation to explain why the model makes a certain classification in a given case. This is especially useful when explaining the reason for individual predictions on the predicted data.

Local surrogate models like LIME are interpretable models that are used to explain individual predictions of black box machine learning models.

Explaining the predicted values

For the predicted data Lime provides the explanations below. The colour represents the feature weight and gives an indication, which features and feature values are most responsible for the prediction.

wzxhzdk:11

To explain the inner working of the fitted model even further, 5 observations are randomely selected for each class and their belonging to each class is explained with LIME.

In the following plots for each class 5 selected observations are eplained by LIME.

wzxhzdk:12

thllwg/AutoStatR documentation built on July 5, 2019, 12:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com