cleanse.data.frame: Cleansing the dataset for classification modeling
In alookr: Model Classifier for Binary Classification

cleanse.data.frame

R Documentation

Cleansing the dataset for classification modeling

Description

The cleanse() cleanse the dataset for classification modeling

Usage

## S3 method for class 'data.frame'
cleanse(
  .data,
  uniq = TRUE,
  uniq_thres = 0.1,
  char = TRUE,
  missing = FALSE,
  verbose = TRUE,
  ...
)

cleanse(.data, ...)

Arguments

`.data`	a data.frame or a `tbl_df`.
`uniq`	logical. Set whether to remove the variables whose unique value is one.
`uniq_thres`	numeric. Set a threshold to removing variables when the ratio of unique values(number of unique values / number of observation) is greater than the set value.
`char`	logical. Set the change the character to factor.
`missing`	logical. Set whether to removing variables including missing value
`verbose`	logical. Set whether to echo information to the console at runtime.
`...`	further arguments passed to or from other methods.

Details

This function is useful when fit the classification model. This function does the following.: Remove the variable with only one value. And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable. In this case, it is a variable that corresponds to an identifier or an identifier. And converts the character to factor.

Value

An object of data.frame or train_df. and return value is an object of the same type as the .data argument.

Examples

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

dat <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)
# structure of dataset
str(dat)

# cleansing dataset
newDat <- cleanse(dat)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, uniq = FALSE)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, uniq_thres = 0.3)

# structure of cleansing dataset
str(newDat)

# cleansing dataset
newDat <- cleanse(dat, char = FALSE)

# structure of cleansing dataset
str(newDat)

alookr documentation built on May 29, 2024, 10:38 a.m.