autoEDA: Automated visual exploratory data analysis

Description Usage Arguments Value Author(s) Examples

View source: R/autoEDA.R


Automated visual exploratory analysis in a univariate or bivariate manner. Utilizes the other functions in the package should that be specified. Plots are produced using the ggplot2 library and themes are designed partly from the inspiration of the RColorBrewer library. Ability to customize plots are available. Data cleaning options are available which is essential before plotting. When data cleaning is not used, it can serve the purpose to identify areas in the data where attention needs to be paid to. Ability to output all plots to a PDF file.


autoEDA(x, y = NULL, IDFeats = NULL, sampleRate = 1,
  outcomeType = "automatic", maxUniques = 15, maxLevels = 25,
  removeConstant = TRUE, removeZeroSpread = TRUE,
  removeMajorityMissing = TRUE, imputeMissing = TRUE, clipOutliers = TRUE,
  minLevelPercentage = 0.025, predictivePower = TRUE,
  outlierMethod = "tukey", lowPercentile = 0.01, upPercentile = 0.99,
  plotCategorical = "stackedBar", plotContinuous = "histogram", bins = 20,
  rotateLabels = FALSE, colorTheme = 1, theme = 2, color = "#26A69A",
  transparency = 1, outputPath = NULL, filename = "ExploratoryPlots",
  verbose = TRUE)



[data.frame | Required] Dataset which should contain the outcome feature. If x is not a data.frame object it will be converted to one.


[character | Optional] The name of the outcome feature contained in the dataset specified in x. If y is NULL, a univariate analysis will be performed, else a bivariate analysis will take place with respect to the type of feature y is. Continuous features will be identified as a regression outcome else a classification outcome will be identified. Defaults to NULL.


[character | Optional] A name or vector of names relating to ID features contained in the dataset that will be removed for plotting purposes. Defaults to NULL.


[numeric | Optional] A value between 0 and 1 specifying if a simple random sample of the data should be taken to improve plotting speed. A value of 1 means that the entire dataset will be utilized. Defaults to 1.


[character | Optional] The outcome type of the outcome feature specified in y. Available options are: automatic, binary (Binary classification), multi (Multi-class classification) and regression. For high cardinal categorical outcomes (>= 15), it is recommended to specify the outcome type manually. Defaults to automatic.


[integer | Optional] The maximum allowed number of unique values a feature can have before it is converted to a numeric object type. Features with unique values <= the specified value will be converted to factor features. Defaults to 15.


[integer | Optional] The maximum number of allowed levels a categorical feature is allowed to have before it is removed from plotting. High cardinal categorical features can pose problems with visualization techniques. Defaults to 25.


[logical | Optional] Features containing a constant value or single unique value will be removed. Defaults to TRUE.


[logical | Optional] Features which exhibit zero spread will be removed. Zero spread is calculated by using the IQR values of features and only applies to continuous and discrete features. Defaults to TRUE.


[logical | Optional] Features where more than half of the observations are missing will be removed. Defaults to TRUE.


[logical | Optional] Features containing missing values will be imputed. Imputation for continuous and discrete variables will be imputed using the median value. Categorical features will be replaced by a level called MISSING. Defaults to TRUE.


[logical | Optional] Features containing outliers as flagged by the specified outlier method will be clipped by using median replacement and only applies to continuous and discrete features. Defaults to TRUE.


[numeric | Optional] The minimum percentage data representation per level required for a categorical feature. Categorical features should ideally exhibit levels which contains adequate data proportions and levels with low proportions should require data cleaning. If a categorical feature has levels lower than the specified percentage, these levels will be used to determine the imputation value used. If the cumulative sum of the minimum levels are less than the specified minimum level, the imputation value is simply the mode of the feature, else all minimum levels are combined into a new level called ALL_OTHER. Defaults to 0.025.


[logical | Optional] Should the predictive power be calculated per feature using the predictivePower function. Defaults to TRUE.


[character | Optional] Determines how outliers are identified. Two possible methods are available, tukey and percentile. When specifying percentile based outlier detection, it is recommended to manually set the lower and upper percentile values for detection. Defaults to tukey.


[numeric | Optional] The lower percentile value that will be used to flag any values less than the calculated percentile as lower outliers. Recommended to set values between 0.01 and 0.05. Defaults to 0.01.


[numeric | Optional] The upper percentile value that will be used to flag any values greater than the calculated percentile as upper outliers. Recommended to set values between 0.95 and 0.99. Defaults to 0.99.


[character | Optional] Specifies the type of plot to use when encountering categorical features. Available categorical plot types include: bar, stackedBar, groupedBar. When using groupedBar as a plot type, it is recommended to specify rotateLabels as TRUE. All bar plots are displayed in a relative frequency manner. Only applies to situations where a univariate analysis is being performed and categorical features are present or when a categorical outcome is specified and categorical features are present. Defaults to stackedBar.


[character | Optional] Specifies the type of plot to use when encountering continuous/discrete features. Available plot types for continuous features include: boxplot, qqplot, density, histogram. When specifying density as the desired plot type, transparency is automatically reduced. For continuous/discrete outcomes, continuous plots will be used when a categorical feature is present. Defaults to histogram.


[integer | Optional] The number of bins to use when histograms are the chosen plot type. Defaults to 20.


[logical | Optional] Should x-axis labels be rotated by 90 degrees. Defaults to FALSE.


[integer | Optional] Specifies the color theme to use for plots when an outcome feature has been provided. Available values range from 1 to 4. Alternatively a vector of color names or hash codes can be provided to create a custom theme. Only applicable to univariate analyses. Defaults to 1.


[integer | Optional] Specifies the plot theme to use. Available options range from 1 to 2. Defaults to 2.


[character | Optional] Specifies the color to use when performing univariate analyses. Defaults to "#26A69A".


[numeric | Optional] Specifies the color transparency for plots. Lower values means more transparency and higher values means no transparency. Defaults to 1.


[character | Optional] The destination path where the output plots will be contained in a PDF file format. Should the path be left as NULL, all plotting will occur in R, else a valid path should be provided to create a PDF document containing all plots. Defaults to NULL.


[character | Optional] The filename of the PDF file that will consists of the plots should the output path be specified. Defaults to ExploratoryPlots.


[logical | Optional] Should the function be chatty and provide feedback or not. Defaults to TRUE.


Object of type data.frame containing exploratory information and if specified predictive power per feature. Output is the same as the output generated from dataOverview.


Xander Horn


# Bivariate classification example:
overview <-  autoEDA(x = iris,
                     y = "Species")

# Bivariate regression example:
overview <-  autoEDA(x = iris,
                     y = "Sepal.Length")

# Univariate example:
overview <-  autoEDA(x = iris)

souravbose1991/Auto-EDA documentation built on May 17, 2019, 8:21 a.m.