hugo_clean_data: Clean your data

Description Usage Arguments Value Author(s) Examples

Description

This function fills missing values - a median for numeric variables and a mode for categorical variables (factors). Additionally, the outliers from numeric variables are replaced according to the IQR rule for outliers. In factors rare levels are merged into 'Other' level.

Usage

1
hugo_clean_data(data, prop = 0.01, num_to_fac_amount = 5)

Arguments

data

data.frame to clean

prop

proportion of occurence of the level in a categorical variable which decides which levels are rare

num_to_fac_amount

numeric columns with less than num_to_fac_amount unique values are treated as factors. You can disable this option by setting this parameter to 0

Value

data.frame that has been cleaned

Author(s)

Eliza Kaczorek

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
## Not run: 
# Dataset in base R: airquality
# There are 44 missing values
sum(is.na(airquality))

hugo_clean_data(airquality)
# The data was cleaned.

# Two original rows from data:

# Ozone Solar.R  Wind Temp Month Day
#     8      19  20.1   61     5   9
#     NA      NA 14.3   56     5   5

# After cleaning:

# Ozone Solar.R  Wind Temp Month Day
#     8      19 17.65   61     5   9
#  31.5     205 14.30   56     5   5

# We can see that the outlier in 'Wind' was
# replaced by the value Q3+1.5*IGR for this column.
# Missing values were replaced with medians.

## End(Not run)

pbiecek/hugo documentation built on May 12, 2019, 6:24 p.m.