dataOverview: Automated tabular exploratory data analysis

Description Usage Arguments Value Author(s) Examples

Description

Performs automated tabular exploratory data analysis. Summary statistics per feature is also calculated along with common data issues which will be flagged. Imputation values are also calculated per feature.

Usage

1
2
dataOverview(x, outlierMethod = "tukey", lowPercentile = 0.01,
  upPercentile = 0.99, minLevelPercentage = 0.025)

Arguments

x

[data.frame | Required] Dataset which should contain all relevant features. If x is not a data.frame object it will be converted to one.

outlierMethod

[character | Optional] Determines how outliers are identified. Two possible methods are available, tukey and percentile. When specifying percentile based outlier detection, it is recommended to manually set the lower and upper percentile values for detection. Defaults to tukey.

lowPercentile

[numeric | Optional] The lower percentile value that will be used to flag any values less than the calculated percentile as lower outliers. Recommended to set values between 0.01 and 0.05. Defaults to 0.01.

upPercentile

[numeric | Optional] The upper percentile value that will be used to flag any values greater than the calculated percentile as upper outliers. Recommended to set values between 0.95 and 0.99. Defaults to 0.99.

minLevelPercentage

[numeric | Optional] The minimum percentage data representation per level required for a categorical feature. Categorical features should ideally exhibit levels which contains adequate data proportions and levels with low proportions should require data cleaning. If a categorical feature has levels lower than the specified percentage, these levels will be used to determine the imputation value used. If the cumulative sum of the minimum levels are less than the specified minimum level, the imputation value is simply the mode of the feature, else all minimum levels are combined into a new level called ALL_OTHER. Defaults to 0.025.

Value

Object of type data.frame containing exploratory information of all features passed on in x.

Author(s)

Xander Horn

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Tukey outlier detection example:
overview <-  dataOverview(x = iris,
                          outlierMethod = "tukey",
                          minLevelPercentage = 0.025)

# Percentile outlier detection example:
overview <-  dataOverview(x = iris,
                          outlierMethod = "percentile",
                          lowPercentile = 0.025,
                          upPercentile = 0.975,
                          minLevelPercentage = 0.025)

XanderHorn/autoEDA documentation built on June 21, 2019, 9:40 a.m.