keysPool: Homogenize class values and create a long key by pooling...

View source: R/variableKey.R

keysPoolR Documentation

Homogenize class values and create a long key by pooling variable keys.

Description

For long-format keys, this is one way to correct for errors in "class_old" or "class_new" for common variables. For a long key created by stacking together several long keys, or for a list of long keys, this will try to homogenize the classes by using a "highest common denominator" approach. If one key has x1 as a floating point, but another block of rows in the key has x1 as integer, then class must be changed to floating point (numeric). If another section of a key has x1 as a character, then character becomes the class.

Usage

keysPool(
  keylong = NULL,
  keysplit = NULL,
  classes = list(c("logical", "integer"), c("integer", "numeric"), c("ordered",
    "factor"), c("factor", "character")),
  colnames = c("class_old", "class_new"),
  textre = "TEXT$"
)

Arguments

keylong

A list of long keys, or just one long key, presumably a result of rbinding several long keys.

keysplit

Not often needed for user-level code. A list of key blocks, each of which is to be inspected and homogenized. Not used if a keylong argument is provided.

classes

A list of vectors specifying legal promotions.

colnames

Either c("class_old","class_new), ""class_old", or "class_new". The former is the default.

textre

A regular expression matching a column name to be treated as character. Default matches any variable name ending in "TEXT"

Details

Users might run keyTemplate on several data sets, arriving at keys that need to be combined. The long versions of the keys can be stacked together by a function like rbind. If the values class_old and class_new for a single variable are inconsistent, then the "key stack" will fail the tests in keyCheck. This function automates the process of fixing the class variables by "promoting" classes where possible.

Begin with a simple example. In one data set, the value of x is drawn from integers 1L, 2L, 3L, while in another set it is floating values like 1.1, 2.2. After creating long format keys, and stacking them together, the values of class_old will clash. For x, we will observe both "integer" and "numeric" in the class_old column. In that situation, the class_old for all of the rows under consideration should be set as "numeric".

The promotion schemes are described by the variable classes, where we have the most conservative changes first. The most destructive change is when variables are converted from integer to character, for example. The conservative conversion strategies are specified in the classes variable, in which the last element in a vector will be used to replace the preceeding classes. For example, c("ordered", "factor", "character") means that the class_old values of "ordered" and "factor" will be replaced by "character".

The conversions specified by classes are tried, in order. 1. logical -> integer 2. integer -> numeric 3. ordered -> factor

If their application fails to homogenize a vector, then class is changed to "character". For example, when the value of class_old observed is c("ordered", "numeric", "character"). In that case, the class is promoted to "character", it is the least common denominator.

Value

A class-corrected version of the same format as the input, either a long key or a list of key elements.

Author(s)

Paul Johnson <pauljohn@ku.edu>

Examples

dat1 <- data.frame(x1 = as.integer(rnorm(100)), x2 = sample(c("Apple", "Orange"),
                   100, replace = TRUE), x3 = ifelse(rnorm(100) < 0, TRUE, FALSE))
dat2 <- data.frame(x1 = rnorm(100), x2 = ordered(sample(c("Apple", "Orange"),
                   100, replace = TRUE)), x3 = rbinom(100, 1, .5),
                   stringsAsFactors = FALSE)
key1 <- keyTemplate(dat1, long = TRUE)
key2 <- keyTemplate(dat2, long = TRUE)
keys2stack <- rbind(key1, key2)
keys2stack.fix <- keysPool(keys2stack)
keys2stack.fix2 <- keysPool(keys2stack.fix, colname = "class_new")
## Sometimes this will not be able to homogenize
dat1 <- data.frame(x1 = as.integer(rnorm(100)),
                   x2 = sample(c("Apple", "Orange"), 100, replace = TRUE))
dat2 <- data.frame(x1 = rnorm(100),
                   x2 = sample(c("Apple", "Orange"), 100, replace = TRUE),
                   stringsAsFactors = FALSE)
key1 <- keyTemplate(dat1, long = TRUE)
key2 <- keyTemplate(dat2, long = TRUE)
## Create a stack of keys for yourself
keys2stack <- rbind(key1, key2)
keys.fix <- keysPool(keys2stack)
## We will create stack of keys for you
keys.fix2 <- keysPool(list(key1, key2)) 
## View(keys.fix)
## View(keys.fix2)


## If you have wide keys, convert them with wide2long, either by
key1 <- keyTemplate(dat1)
key2 <- keyTemplate(dat2)
keysstack.wide <- rbind(wide2long(key1), wide2long(key2))
keys.fix <- keysPool(keysstack.wide)
## or
keysPool(list(wide2long(key1), wide2long(key2)))

kutils documentation built on Sept. 17, 2023, 5:06 p.m.