visualizeR: visualizeR - Automated exploratory data analysis for...

Description Usage Arguments Author(s) Examples

Description

visualizeR automates exploratory data analysis for classification problems in machine learning. The problem can be two-class or multi-class classification. It is recommended that all ID and Date features be removed before running this algorithm, cleaning the data before running this is also recommended. visualizeR has some data cleaning aspects built into it but cannot account for domain knowledge cleaning.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
visualizeR(df, Outcome,
           nrBins = 30,
           sample = 0.3,
           clipOutliers = TRUE,
           handleMissing = TRUE,
           CatChartType = "stackedHist",
           NumChartType = "boxPlot",
           summaryStats = FALSE,
           seed = 1234,
           maxLevels = 25,
           nrUniques = 20,
           outputPath = "",
           outputFileName = "outputPlots")

Arguments

df

A data.frame object containing plotting features and target/outcome feature. Cannot be left blank.

Outcome

The feature name of the outcome as character format, e.g. 'Target'. Cannot be left blank.

nrBins

The number of bins to use in histogram plots of numerical features should 'stackedHist' be used as the chart type in the parameter 'NumChartType'.

sample

Should a random sample be taken in order to speed the plotting process up.

clipOutliers

Should outliers be fixed in the data using a median approach. Possible values: TRUE,FALSE

handleMissing

Should missing values be corrected with 'Missing' value for categorical variables and median imputation for conitnuous variables. Possible values: TRUE,FALSE. Should this be left as FALSE then missing observations will be removed from the plots.

CatChartType

Indicates the type of chart to use when plotting categorical/factor features. Possible values: 'stackedHist', 'Confusion'

NumChartType

Indicates the type of chart to use when plotting numerical/continuous features. Possible values: 'stackedHist', 'densityLine', 'densityFill', 'boxPlot'

summaryStats

Should summary statistics be printed for predictors in the dataset, summary stats for continuous and frequency tables for categorical variables. Possible values: TRUE,FALSE

seed

Used only for the sampling of the data and to reproduce the plots.

maxLevels

The maximum levels allowed for factor features, if a feature has levels more than the threshold it will not be plotted.

nrUniques

The number of allowed unique values for a feature before it is automatically changed to a categorical feature. If a feature has less than this threshold, the feature will be changed to a categorical feature.

outputFileName

The name of the file containing all the plots.

ouputPath

A file path where the plots should be saved in a PDF document. If left blank all plots will be displayed in R.

Author(s)

Xander Horn

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
EXAMPLE 1:
library(datasets)
train <- data.frame(iris)
visualizeR(df = train,
          Outcome = 'Species',
          nrBins = 30,
          sample = 1,
          clipOutliers = 'Y',
          CatChartType = 'stackedHist',
          NumChartType = 'boxPlot')
          
EXAMPLE 2:
visualizeR(df = train,
Outcome = 'Species',
nrBins = 30,
sample = 1,
clipOutliers = 'Y',
CatChartType = 'Confusion',
NumChartType = 'stackedHist',
summaryStats = 'Y',
outputPath = 'C:/Users/User/Documents',
outputFileName = 'IrisExploratoryDataAnalysis')

XanderHorn/visualizeR documentation built on May 9, 2019, 11:05 p.m.