refineFactors: Refine factors in two twin data frames.
In tsieger/tsiMisc: Miscelaneous utility functions, mostly related to exploratory statistical analysis

refineFactors

R Documentation

Refine factors in two twin data frames.

Description

Refine two data frames x1 and x2 to be usable as a pair of a train/test set pair in a modeling or classification task, such that a model/classifier (e.g. a binomial GLM) can be trained on the train set and can be then directly applied to the test set. Note that had the sets not been refined this way, a model trained on a train set could not have been directly applied to the test set because a new factor level (not appearing in the train set) would have appeared in it, giving no clue how to predict response using such an unknown factor level.

Usage

refineFactors(x1, x2, 
    unify = TRUE, dropSingular = TRUE, 
    naLimit = Inf, k = 5, 
    naLevelName = "(NA)", 
    verbose = FALSE, 
    debug = FALSE)

Arguments

`x1`	first data frame
`x2`	second data frame
`unify`	shall factors be unified?
`dropSingular`	shall factors having only a single level be removed?
`naLimit`	numeric columns containing more than 'naLimit' NA's (in `x1` and in `x2`) will be converted into a factor created by 'cut'-ting the numeric values into 'k' intervals, and adding a special 'naLevelName' level to hold the missing values
`k`	the number of intervals into which numeric columns having at least 'naLimit' NA's (in 'x1' and in 'x2') will be converted
`naLevelName`	the name of the special factor level used to represent missing values
`verbose`	report progress?
`debug`	if TRUE, debugs will be printed. If numeric of value greater than 1, `verbose` debugs will be produced.

Details

Usually, the refinement consists of i) making the levels of factors in individual columns in each of the sets identical, and ii) removing columns containing factors of only a single level. This first task is achieved by removing rows in which appears a factor of level not appearing in the twin data frame, and dropping unused levels from factors.

Value

A list of refined x1 and x2.

Author(s)

Tomas Sieger

Examples

# unify factor levels and remove constant factors:
x<-data.frame(x = 1:6, y = c('a','b','c','b','c','d'), z = c('d','c','c','c','c','d'))
x.train <- x[1:3, ]
x.test <- x[4:6, ]
print(x.train)
print(x.test)
refineFactors(x.train, x.test)
# Note: 'x[1,]' and 'x2[3,]' dropped because it had no counterpart
#   in the twin data frame.
# Note: 'x$z' dropped because after removal of 'x[1,]' and 'x2[3,]',
#   there was only a single factor level left, which was dropped, by default.

# unify factor levels but keep constant factors:
refineFactors(x.train, x.test, dropSingular = FALSE)
# Note: now 'x$z' is left

# convert numeric columns with many NA's into a factor
x<-data.frame(x = 1:10, y = c(NA,NA,1,2,3,NaN,NA,1,2,3), z = c(1,2,NA,4,5,1,2,NA,4,5))
x1 <- x[1:5, ]
x2 <- x[6:10, ]
print(x1)
print(x2)
refineFactors(x1, x2, naLimit=2)

# add a special 'NA' level to factors with many NA's
x<-data.frame(x = 1:10, y = factor(c(NA,NA,1,2,3,NA,NA,1,2,3)), z = factor(c(1,2,NA,4,5,1,2,NA,4,5)))
x1 <- x[1:5, ]
x2 <- x[6:10, ]
print(x1)
print(x2)
refineFactors(x1, x2, naLimit=2)

# NaN in factors differs from NA - this would behave differently:
# x<-data.frame(a = 1:10, b = factor(c(NA,NA,1,2,3,NA,NaN,1,2,3)), c=factor(c(1,2,NA,4,5,1,2,NA,4,5)))

tsieger/tsiMisc documentation built on Dec. 14, 2024, 3:23 p.m.