Introduction

Working with factors in R can be frustrating. John Mount, puts it well by saying that factors are not first-class citizens in R. Even the most basic code often behaves in an unexpected manner and without any warning. Let's start off with a simple example - creating two factor vectors and combining them together - that illustrates some of the issues:

lower <- factor(letters[1:3], levels = letters[2:4])
upper <- factor(LETTERS[1:5], levels = LETTERS[1:4])
(combined <- c(lower, upper))

One can note the following issues:

The goal of the refactor package is to make working with factors more natural and fun by providing

cfactor

cfactor has been defined to give warnings in cases when empty factors are created from strings or when existing strings are not preserved. Essentially, it is a wrapper for factor(), and the 'c' stands for enhanced control. It also has improved level order detection based on numerical values within strings that is superior to the way factor assesses this order in certain cases.

unmatched factors

library(refactor)

string <- c("a", "b", "c")
cfactor(string, levels = c("b", "c", "d"))

detect levels

The default behavior of factor (if levels are not explicitly supplied), is to first convert x to character, and then to take the unique values and sort the characters (sort(unique(as.character(x)))). This approach is fine if the string ordering was the same as ordering the numbers within the string, but in general this is not the case, as the following examples will illustrate:

\ A "clean" example

With all numbers having the same number of digits, factor can detect the order correctly.

easy_to_dectect <- c("EUR 11 - EUR 20", "EUR 1 - EUR 10", "EUR 21 - EUR 22")
factor(easy_to_dectect, ordered = TRUE) # correctly detects level

\ A more "dirty" example

However, in the general case, where number of digits might be destinct, this does not work anymore. The category "EUR 100 - 101" comes second, but it should be last.

hard_to_dectect <- c("EUR 21 - EUR 22", "EUR 100 - 101", 
                     "EUR 1 - EUR 10", "EUR 11 - EUR 20")

factor(hard_to_dectect, ordered = TRUE)

cfactor detects levels using regular expressions. Concretely, it extracts the substrings preceding sep, removes everything except digits and the decimal point in x and orders the remaining numbers to find the order of the levels.

cfactor(hard_to_dectect, ordered = TRUE, sep = "-")

This detection algorithm can be turned off and the default ordering of factor can be applied by setting the sep argument to NULL. Also, in the absence of any numbers, the default ordering of factor is applied.

identical(
  cfactor(hard_to_dectect, ordered = TRUE, sep = NULL),
   factor(hard_to_dectect, ordered = TRUE)
)

Intersection of underlying values and labels

When creating a factor you sometimes want to label that data, that is, assign a label to each value supplied in x. A situation to avoid is using labels that are identical to its underlaying value for some but not all levels.

cfactor(x = c("a", "b", "c"), levels = c("a", "b", "c"), labels = c("a", "letter b", "b"))

The outcome of this is kind of messy. Whereas factor lets you do that without returing any warning, cfactor will issue a such to inform about the intersection. \ In adition, a second warning that indicates that a is choosen as a representation of a itself is issued. In a context where data are given labels distinct from their values, which is the whole point of labeling, this in many cases proabably not what the programmer wanted to see happening.

index_cfactor

If data is encoded, the labels argument of factor() can be used to label the data, which is a common data pre-processing step. This works the same as with cfactor(). Here, we want to give an example where the relationship b eteween the encoding and the label is stored in a data.frame, e.g. after it was imported from a spread sheet.

data <- sample(x = 1:10, size = 20, replace = TRUE)
index <- data.frame(encoding = 1:10,
                    label = letters[1:10])

cfactor(data, levels = index$encoding, labels = index$label)

\ In a real-world situation, it is likely that a lot of variables have to be decoded. index_cfactor was created to assist with this task. First, we need some sample data.

data <- data.frame(var1 = sample(x = 1:10, size = 20, replace = TRUE),
                  var2 = rep(1:2, 20),
                  var3 = sample(20),
                  var4 = 2, 
                  var5 = sample(row.names(USArrests), size = 20),
                  stringsAsFactors = FALSE)
head(data)

index <- data.frame(var = rep(paste0("var", 1:3), c(10, 2, 20)),
                    encoding = c(1:10, 1:2, 1:20),
                    label = c(letters[1:10], c("male", "female"), LETTERS[1:20]))
head(index)

Now we use index_cfactor to decode the data.frame. Note that var4 and var5 are left as is, since no variable encoding for them is defined in index. Using the ... argument of index_cfactor, we can pass additional arguments to cfactor.

final <- head(index_cfactor(data = data, index = index, variable = "var", 
                            ordered = c(TRUE, TRUE, FALSE)))

print(final)
sapply(final, class)

index_cfactor converts all columns in data that have a match in index to factors with the respective encoding using cfactor. Further arguments are variable, label and encoding referring to the column names in index that contain the respective information.

Other example strings

c("1 to 4", "5 to 6") # properly separated
c("1 to 4", "4 to 6") # not properly separated
c("from 1,000 to 2,000", "from 2000 to 4,000") # comma separated and 'from' and 'to'
c("4.0 / 4.1", "4.2 / 4.3") # point and slash separator
c("one minute", "three minutes", "1 hour")

tailored methods for categorical data

Instead of defining completely new functions, we decided to provide some S3 generics to extend the functionality of existing R functions and tailor them so they are better suited for categorical data. Also, a more extensive warning and error behavior than for their base R counterparts is implemented.

cut

Applying cut can result in categorical data, for example if integers are binned or if ordered factor levels are summarized into larger categories.

a cut method for integers

cut.default provides rather inappropriate labels for integer values.

random <- sample(100)
cut.default(random, breaks = seq(0, 100, by = 10))[1:10]

refactor extends the S3 method cut with cut.integer to provide more natural labels for this data type.

cut(random, breaks = seq(0, 100, by = 10))[1:10]

The remainder of the section will outline and describe how cut.integer deviates from the default cut method.


Creating missing values will yield a warning.

cut(sample(10), breaks = c(0, 3, 5))

It is possible to define bins with width 1. This will generate a label with just the value of the integer containted (e.g. 2 instead of 2-2) and issue a warning.

cut(sample(10), breaks = c(1, 4, 6, 8, 9, 10))

Unordered breaks will be ordered before proceeding and a warning will be issued.

cut(sample(10), breaks = c(10, 0, 3))

Decimal values for breaks will be rounded to integers and a warning will be issued.

cut(sample(10), breaks = c(1, 2.6, 5.1, 10))

a cut method for ordered factors

This method allows one to combine the levels of a factor into fewer categories given some break points:

some_letters <- cfactor(letters, ordered = TRUE)
head(cut(some_letters, breaks = c("a", "q", "z"), 
         labels = c("beginning of the alphabet", "the rest of the alphabeth"), 
         right = TRUE, include.lowest = TRUE))

cc

The purpose of cc is to combine elements into a vector, just as base c does. They only differ with regard to factor treatment, for which c most likely does not return what might be expected.

c(cfactor("a"), cfactor("b"))
cc(cfactor("a"), cfactor("b"))

cc can also deal with ordered factors. It is quite verbose in the sense that if the levels and/or the order of the levels do not match exactly, it will issue warnings.

a_b <- cfactor(c("a", "b"), ordered = T)
b_d <- cfactor(c("b", "c", "d"), ordered = T)
cc(a_b, b_d)

append

The append function is a simple extension of the base::append() function, which is slightly refined in order to better handle appending of factors.

As shown in the example below, the base version of append() converts factors back into their underlying integer representations in the process of combining them:

f <- factor(c('c','a','a',NA,'b','a'), levels= c('a','b','c'))
g <- factor(sample(letters[4:10]), levels = sample(letters[4:10]))
base::append(f, g)

This is probably not what you'd expect to get. What refactor::append will give you instead, is this:

refactor::append(f, g)

It returns a factor, preserves the levels, and also warns you about the NA value. Hopefully that is more in line with what you wanted.

refactor::append() will also handle ordered factors, provided that the ordering is consistent. If levels and/or ordering are not consistent, then it will simply fall back to the base method.



jonmcalder/refactor documentation built on Nov. 16, 2020, 3:46 a.m.