cleanNumericFeatures: Clean Numeric Features

Description Usage Arguments Value Examples

Description

This function performs some data cleansing on numeric fields. It currently supports centering, scaling, imputation (via mean, median or value).

Usage

1
2
3
cleanNumericFeatures(d, features = NULL, exclude = NULL,
  num.transform = c("SCALE", "CENTER", "IMPUTE"),
  num.impute.method = "VALUE", num.impute = -1, verbose = FALSE)

Arguments

d

A data frame or data table containing the data set.

features

A character vector containing a list of features to process. If left NULL, will choose ALL the numeric fields within the data set. Can optionally use regular expression matching to derive the list of features by prepending it with a ~ (refer to Examples). Default: NULL.

num.transform

A character vector containing the type of numeric transformation to perform. Can be a combination of SCALE, CENTER or IMPUTE.

num.impute.method

A character string containing the imputation method to select. Can be VALUE, MEAN or MEDIAN.

num.impute

A numeric containing the imputation value to impute (used for value-based imputation).

Value

A data frame or data table containing the transformed data set with the transformed numeric values.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Features use ~ to use regexp, else leave NULL if all numeric else specify exactly the attribute
sample.df <- data.frame(ID = floor(runif(100, 0, 10000))),
EFF_DATE = Sys.time() + runif(100, 0, 24*60*60*100),
EFF_TO = Sys.time() + runif(100, 24*60*60*100+1, 24*60*60*1000),
CUST_SEGMENT_CHR = as.character(floor(runif(100,0,10))),
STATE_NAME = ifelse(runif(100,0,1) < 0.56, 'VIC', ifelse(runif(100,0,1) < 0.44,'NSW', 'QLD')),
REVENUE = floor(rnorm(100, 500, 200)),
NUM_FEAT_1 = rnorm(100, 1000, 250),
NUM_FEAT_2 = rnorm(100, 20, 2),
NUM_FEAT_3 = floor(rnorm(100, 3, 0.5)),
NUM_FEAT_4 = floor(rnorm(100, 100, 10)),
RFM_SEGMENT = factor(x = letters[floor(runif(100,1,6))], levels = c("a","b","c","d","e")))

# Introduce some missing values
sample.df$NUM_FEAT_1 <- ifelse(sample.df$NUM_FEAT_1 > 1150, NA, sample.df$NUM_FEAT_1)
sample.df$NUM_FEAT_2 <- ifelse(sample.df$NUM_FEAT_2 < 17, NA, sample.df$NUM_FEAT_2)
sample.df$NUM_FEAT_3 <- ifelse(sample.df$NUM_FEAT_3 > 3, NA, sample.df$NUM_FEAT_3)
sample.df$NUM_FEAT_4 <- ifelse(sample.df$NUM_FEAT_4 > 110 | sample.df$NUM_FEAT_4 < 88, NA, sample.df$NUM_FEAT_4)
sample.df$CUST_SEGMENT_CHR <- ifelse(sample.df$CUST_SEGMENT_CHR == '8', NA, sample.df$CUST_SEGMENT_CHR)

# Impute all missing numeric features to -1
(cleanNumericFeatures(sample.df, num.transform = "IMPUTE", num.impute = -1))
# Only choose the CUST_SEGMENT_CHR - will convert it to a numeric
(cleanNumericFeatures(sample.df, features = "CUST_SEGMENT_CHR", num.transform = "NONE"))
(str(cleanNumericFeatures(sample.df, features = "CUST_SEGMENT_CHR", num.transform = "NONE")))
# Supports regular expressions to choose columns, so doing the exact same thing but using a regex:
(cleanNumericFeatures(sample.df, features = "~*CHR", num.transform = "NONE"))
# Supports mean-valued imputation. If features not supplied, takes all numeric features into consideration.
(cleanNumericFeatures(sample.df, num.transform = "IMPUTE", num.impute.method = "MEAN"))
# Can also do SCALE/CENTER:
(cleanNumericFeatures(sample.df, features = "~*", num.transform = c("SCALE","CENTRE")))
# Can also do SCALE/CENTER + combine with imputing values. Note that imputation is ALWAYS done first before SCALE/CENTRE.
(cleanNumericFeatures(sample.df, features = "~*", num.transform = c("SCALE","CENTRE", "IMPUTE"), num.impute.method = "MEDIAN"))

ivanliu1989/RQuant documentation built on Sept. 13, 2019, 11:53 a.m.