cleanse.data.frame: Cleansing the dataset for classification modeling

View source: R/preprocess.R

cleanse.data.frameR Documentation

Cleansing the dataset for classification modeling

Description

The cleanse() cleanse the dataset for classification modeling

Usage

## S3 method for class 'data.frame'
cleanse(
  .data,
  uniq = TRUE,
  uniq_thres = 0.1,
  char = TRUE,
  missing = FALSE,
  verbose = TRUE,
  ...
)

cleanse(.data, ...)

Arguments

.data

a data.frame or a tbl_df.

uniq

logical. Set whether to remove the variables whose unique value is one.

uniq_thres

numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.

char

logical. Set the change the character to factor.

missing

logical. Set whether to removing variables including missing value

verbose

logical. Set whether to echo information to the console at runtime.

...

further arguments passed to or from other methods.

Details

This function is useful when fit the classification model. This function does the following.: Remove the variable with only one value. And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable. In this case, it is a variable that corresponds to an identifier or an identifier. And converts the character to factor.

Value

An object of data.frame or train_df. and return value is an object of the same type as the .data argument.

Examples

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(dat)

# cleansing dataset
newDat <- cleanse(dat)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, uniq = FALSE)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, uniq_thres = 0.3)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, char = FALSE)

# structure of cleansing dataset
str(newDat)


alookr documentation built on May 29, 2024, 10:38 a.m.