Description Usage Arguments Value Examples
Deal with outliers by setting an 'NA value' or by 'stopping' them at a certain. There are three supported methods to flag the values as outliers: "bottom_top", "tukey" and "hampel". The parameters: 'top_percent' and/or 'bottom_percent' are used only when method="bottom_top".
For a full reference please check the official documentation at: https://livebook.datascienceheroes.com/data-preparation.html#treatment_outliers> Setting NA is recommended when doing statistical analysis, parameter: type='set_na'. Stopping is recommended when creating a predictive model without biasing the result due to outliers, parameter: type='stop'.
The function can take a data frame, and returns the same data plus the transformations specified in the input parameter. Or it can take a single vector (in the same 'data' parameter), and it returns a vector.
1 2 3 4 5 6 7 8 9 |
data |
a data frame or a single vector. If it's a data frame, the function returns a data frame, otherwise it returns a vector. |
input |
string input variable (if empty, it runs for all numeric variable). |
type |
can be 'stop' or 'set_na', in the first case all falling out of the threshold will be converted to the threshold, on the other case all of these values will be set as NA. |
method |
indicates the method used to flag the outliers, it can be: "bottom_top", "tukey" or "hampel". |
bottom_percent |
value from 0 to 1, represents the lowest X percentage of values to treat. Valid only when method="bottom_top". |
top_percent |
value from 0 to 1, represents the highest X percentage of values to treat. Valid only when method="bottom_top". |
k_mad_value |
only used when method='hampel', 3 by default, might seem quite restrictive. Set a higher number to spot less outliers. |
A data frame with the desired outlier transformation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | ## Not run:
# Creating data frame with outliers
set.seed(10)
df=data.frame(var1=rchisq(1000,df = 1), var2=rnorm(1000))
df=rbind(df, 1135, 2432) # forcing outliers
df$id=as.character(seq(1:1002))
# for var1: mean is ~ 4.56, and max 2432
summary(df)
########################################################
### PREPARING OUTLIERS FOR DESCRIPTIVE STATISTICS
########################################################
#### EXAMPLE 1: Removing top 1%% for a single variable
# checking the value for the top 1% of highest values (percentile 0.99), which is ~ 7.05
quantile(df$var1, 0.99)
# Setting type='set_na' sets NA to the highest value specified by top_percent.
# In this case 'data' parameter is single vector, thus it returns a single vector as well.
var1_treated=prep_outliers(data = df$var1, type='set_na', top_percent = 0.01,method = "bottom_top")
# now the mean (~ 1) is more accurate, and note that: 1st, median and 3rd
# quartiles remaining very similar to the original variable.
summary(var1_treated)
#### EXAMPLE 2: Removing top and bottom 1% for the specified input variables.
vars_to_process=c('var1', 'var2')
df_treated3=prep_outliers(data = df, input = vars_to_process, type='set_na',
bottom_percent = 0.01, top_percent = 0.01, method = "bottom_top")
summary(df_treated3)
########################################################
### PREPARING OUTLIERS FOR PREDICTIVE MODELING
########################################################
data_prep_h=funModeling::prep_outliers(data = heart_disease,
input = c('age','resting_blood_pressure'),
method = "hampel", type='stop')
# Using Hampel method to flag outliers:
summary(heart_disease$age);summary(data_prep_h$age)
# it changed from 29 to 29.31, and the max remains the same at 77
hampel_outlier(heart_disease$age) # checking the thresholds
data_prep_a=funModeling::prep_outliers(data = heart_disease,
input = c('age','resting_blood_pressure'),
method = "tukey", type='stop')
max(heart_disease$age);max(data_prep_a$age)
# remains the same (77) because the max thers for age is 100
tukey_outlier(heart_disease$age)
## End(Not run)
|
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Attaching package: 'Hmisc'
The following objects are masked from 'package:base':
format.pval, units
sh: 1: cannot create /dev/null: Permission denied
funModeling v.1.7 :)
Examples and tutorials at livebook.datascienceheroes.com
var1 var2 id
Min. : 0.0000 Min. : -3.2282 Length:1002
1st Qu.: 0.0989 1st Qu.: -0.6304 Class :character
Median : 0.4455 Median : -0.0352 Mode :character
Mean : 4.5666 Mean : 3.5512
3rd Qu.: 1.3853 3rd Qu.: 0.6242
Max. :2432.0000 Max. :2432.0000
99%
7.052883
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000003 0.095676 0.438830 0.940909 1.326450 6.794558 11
var1 var2 id
Min. :0.000135 Min. :-2.323164 Length:1002
1st Qu.:0.103132 1st Qu.:-0.620509 Class :character
Median :0.445110 Median :-0.035221 Mode :character
Mean :0.950500 Mean :-0.004297
3rd Qu.:1.344497 3rd Qu.: 0.605961
Max. :6.794558 Max. : 2.280367
NA's :21 NA's :21
Min. 1st Qu. Median Mean 3rd Qu. Max.
29.00 48.00 56.00 54.44 61.00 77.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
29.31 48.00 56.00 54.44 61.00 77.00
bottom_threshold top_threshold
29.3132 82.6868
[1] 77
[1] 77
bottom_threshold top_threshold
9 100
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.