prep_outliers: Outliers Data Preparation

Description Usage Arguments Value Examples

View source: R/outliers.R

Description

Deal with outliers by setting an 'NA value' or by 'stopping' them at a certain. There are three supported methods to flag the values as outliers: "bottom_top", "tukey" and "hampel". The parameters: 'top_percent' and/or 'bottom_percent' are used only when method="bottom_top".

For a full reference please check the official documentation at: https://livebook.datascienceheroes.com/data-preparation.html#treatment_outliers> Setting NA is recommended when doing statistical analysis, parameter: type='set_na'. Stopping is recommended when creating a predictive model without biasing the result due to outliers, parameter: type='stop'.

The function can take a data frame, and returns the same data plus the transformations specified in the input parameter. Or it can take a single vector (in the same 'data' parameter), and it returns a vector.

Usage

1
2
3
4
5
6
7
8
9
prep_outliers(
  data,
  input = NA,
  type = NA,
  method = NA,
  bottom_percent = NA,
  top_percent = NA,
  k_mad_value = NA
)

Arguments

data

a data frame or a single vector. If it's a data frame, the function returns a data frame, otherwise it returns a vector.

input

string input variable (if empty, it runs for all numeric variable).

type

can be 'stop' or 'set_na', in the first case all falling out of the threshold will be converted to the threshold, on the other case all of these values will be set as NA.

method

indicates the method used to flag the outliers, it can be: "bottom_top", "tukey" or "hampel".

bottom_percent

value from 0 to 1, represents the lowest X percentage of values to treat. Valid only when method="bottom_top".

top_percent

value from 0 to 1, represents the highest X percentage of values to treat. Valid only when method="bottom_top".

k_mad_value

only used when method='hampel', 3 by default, might seem quite restrictive. Set a higher number to spot less outliers.

Value

A data frame with the desired outlier transformation

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
## Not run: 
# Creating data frame with outliers
set.seed(10)
df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000))
df=rbind(df, 1135, 2432) # forcing outliers
df$id=as.character(seq(1:1002))

# for var1: mean is ~ 4.56, and max 2432
summary(df)

########################################################
### PREPARING OUTLIERS FOR DESCRIPTIVE STATISTICS
########################################################

#### EXAMPLE 1: Removing top 1%% for a single variable
# checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05
quantile(df$var1, 0.99)

# Setting type='set_na' sets NA to the highest value specified by top_percent.
# In this case 'data' parameter is single vector, thus it returns a single vector as well.
var1_treated=prep_outliers(data = df$var1, type='set_na', top_percent  = 0.01,method = "bottom_top")

# now the mean (~ 1) is more accurate, and note that: 1st, median and 3rd
#  quartiles remaining very similar to the original variable.
summary(var1_treated)

#### EXAMPLE 2: Removing top and bottom 1% for the specified input variables.
vars_to_process=c('var1', 'var2')
df_treated3=prep_outliers(data = df, input = vars_to_process, type='set_na',
 bottom_percent = 0.01, top_percent  = 0.01, method = "bottom_top")
summary(df_treated3)

########################################################
### PREPARING OUTLIERS FOR PREDICTIVE MODELING
########################################################

data_prep_h=funModeling::prep_outliers(data = heart_disease,
input = c('age','resting_blood_pressure'),
 method = "hampel",  type='stop')

# Using Hampel method to flag outliers:
summary(heart_disease$age);summary(data_prep_h$age)
# it changed from 29 to 29.31, and the max remains the same at 77
hampel_outlier(heart_disease$age) # checking the thresholds

data_prep_a=funModeling::prep_outliers(data = heart_disease,
input = c('age','resting_blood_pressure'),
 method = "tukey",  type='stop')

max(heart_disease$age);max(data_prep_a$age)
# remains the same (77) because the max thers for age is 100
tukey_outlier(heart_disease$age)


## End(Not run)

Example output

Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: 'Hmisc'

The following objects are masked from 'package:base':

    format.pval, units

sh: 1: cannot create /dev/null: Permission denied
funModeling v.1.7 :)
Examples and tutorials at livebook.datascienceheroes.com

      var1                var2                id           
 Min.   :   0.0000   Min.   :  -3.2282   Length:1002       
 1st Qu.:   0.0989   1st Qu.:  -0.6304   Class :character  
 Median :   0.4455   Median :  -0.0352   Mode  :character  
 Mean   :   4.5666   Mean   :   3.5512                     
 3rd Qu.:   1.3853   3rd Qu.:   0.6242                     
 Max.   :2432.0000   Max.   :2432.0000                     
     99% 
7.052883 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
0.000003 0.095676 0.438830 0.940909 1.326450 6.794558       11 
      var1               var2                id           
 Min.   :0.000135   Min.   :-2.323164   Length:1002       
 1st Qu.:0.103132   1st Qu.:-0.620509   Class :character  
 Median :0.445110   Median :-0.035221   Mode  :character  
 Mean   :0.950500   Mean   :-0.004297                     
 3rd Qu.:1.344497   3rd Qu.: 0.605961                     
 Max.   :6.794558   Max.   : 2.280367                     
 NA's   :21         NA's   :21                            
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  29.00   48.00   56.00   54.44   61.00   77.00 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  29.31   48.00   56.00   54.44   61.00   77.00 
bottom_threshold    top_threshold 
         29.3132          82.6868 
[1] 77
[1] 77
bottom_threshold    top_threshold 
               9              100 

funModeling documentation built on July 1, 2020, 5:40 p.m.