Working with factors in R can be frustrating. John Mount, puts it well by saying that factors are not first-class citizens in R. Even the most basic code often behaves in an unexpected manner and without any warning. Let's start off with a simple example - creating two factor vectors and combining them together - that illustrates some of the issues:
lower <- factor(letters[1:3], levels = letters[2:4]) upper <- factor(LETTERS[1:5], levels = LETTERS[1:4]) (combined <- c(lower, upper))
One can note the following issues:
factor()
does not yield an error when NA
s are created silently, neither
when levels do not match x nor when too few levels are specified. c()
is used to combine the two, the levels of the two elements (which
are just attributes to integer vectors) are not conveyed, R simply combines the
underlying integers from each vector.lower
and upper
, so a 1 in the combined
vector could be either b
or an A
from one of the original vectors.The goal of the refactor
package is to make working with factors more natural
and fun by providing
cfactor
for factor
to enhance control at the point of factor
creation.index_cfactor
to decode numerical data into (ordered) factors
given the encoding.ordered
and factor
) where
current methods are not tailored for categorical data.cfactor
cfactor
has been defined to give warnings in cases when empty factors are
created from strings or when existing strings are not preserved. Essentially,
it is a wrapper for factor()
, and the 'c' stands for enhanced control. It also
has improved level order detection based on numerical values within strings that
is superior to the way factor
assesses this order in certain cases.
library(refactor) string <- c("a", "b", "c") cfactor(string, levels = c("b", "c", "d"))
The default behavior of factor
(if levels are not explicitly supplied), is to
first convert x
to character, and then to take the unique values and sort the
characters (sort(unique(as.character(x)))
). This approach is fine if the
string ordering was the same as ordering the numbers within the string, but in
general this is not the case, as the following examples will illustrate:
\ A "clean" example
With all numbers having the same number of digits, factor
can detect the order
correctly.
easy_to_dectect <- c("EUR 11 - EUR 20", "EUR 1 - EUR 10", "EUR 21 - EUR 22") factor(easy_to_dectect, ordered = TRUE) # correctly detects level
\ A more "dirty" example
However, in the general case, where number of digits might be destinct, this does not work anymore. The category "EUR 100 - 101" comes second, but it should be last.
hard_to_dectect <- c("EUR 21 - EUR 22", "EUR 100 - 101", "EUR 1 - EUR 10", "EUR 11 - EUR 20") factor(hard_to_dectect, ordered = TRUE)
cfactor
detects levels using regular expressions. Concretely, it extracts the
substrings preceding sep
, removes everything except digits and the decimal
point in x
and orders the remaining numbers to find the order of the levels.
cfactor(hard_to_dectect, ordered = TRUE, sep = "-")
This detection algorithm can be turned off and the default ordering of factor
can be applied by setting the sep
argument to NULL
. Also, in the absence of
any numbers, the default ordering of factor
is applied.
identical( cfactor(hard_to_dectect, ordered = TRUE, sep = NULL), factor(hard_to_dectect, ordered = TRUE) )
When creating a factor
you sometimes want to label that data, that is, assign
a label to each value supplied in x
. A situation to avoid is using labels
that are identical to its underlaying value for some but not all levels.
cfactor(x = c("a", "b", "c"), levels = c("a", "b", "c"), labels = c("a", "letter b", "b"))
The outcome of this is kind of messy. Whereas factor
lets you do that without
returing any warning, cfactor
will issue a such to inform about the
intersection.
\
In adition, a second warning that indicates that a
is choosen as a
representation of a
itself is issued. In a context where data are given
labels distinct from their values, which is the whole point of labeling, this
in many cases proabably not what the programmer wanted to see happening.
index_cfactor
If data is encoded, the labels
argument of factor()
can be used to label
the data, which is a common data pre-processing step. This works the same as
with cfactor()
. Here, we want to give an example where the relationship b
eteween the encoding and the label is stored in a data.frame
, e.g. after it
was imported from a spread sheet.
data <- sample(x = 1:10, size = 20, replace = TRUE) index <- data.frame(encoding = 1:10, label = letters[1:10]) cfactor(data, levels = index$encoding, labels = index$label)
\
In a real-world situation, it is likely that a lot of variables have to be
decoded. index_cfactor
was created to assist with this task. First, we need
some sample data.
data <- data.frame(var1 = sample(x = 1:10, size = 20, replace = TRUE), var2 = rep(1:2, 20), var3 = sample(20), var4 = 2, var5 = sample(row.names(USArrests), size = 20), stringsAsFactors = FALSE) head(data) index <- data.frame(var = rep(paste0("var", 1:3), c(10, 2, 20)), encoding = c(1:10, 1:2, 1:20), label = c(letters[1:10], c("male", "female"), LETTERS[1:20])) head(index)
Now we use index_cfactor
to decode the data.frame
. Note that var4
and
var5
are left as is, since no variable encoding for them is defined in
index
. Using the ...
argument of index_cfactor
, we can pass additional
arguments to cfactor
.
final <- head(index_cfactor(data = data, index = index, variable = "var", ordered = c(TRUE, TRUE, FALSE))) print(final) sapply(final, class)
index_cfactor
converts all columns in data
that have a match in index
to
factors with the respective encoding using cfactor
. Further arguments are
variable
, label
and encoding
referring to the column names in index that
contain the respective information.
Other example strings
c("1 to 4", "5 to 6") # properly separated c("1 to 4", "4 to 6") # not properly separated c("from 1,000 to 2,000", "from 2000 to 4,000") # comma separated and 'from' and 'to' c("4.0 / 4.1", "4.2 / 4.3") # point and slash separator c("one minute", "three minutes", "1 hour")
Instead of defining completely new functions, we decided to provide some S3 generics to extend the functionality of existing R functions and tailor them so they are better suited for categorical data. Also, a more extensive warning and error behavior than for their base R counterparts is implemented.
cut
Applying cut can result in categorical data, for example if integers are binned or if ordered factor levels are summarized into larger categories.
cut
method for integerscut.default
provides rather inappropriate labels for integer values.
random <- sample(100) cut.default(random, breaks = seq(0, 100, by = 10))[1:10]
refactor
extends the S3 method cut
with cut.integer
to provide more
natural labels for this data type.
cut(random, breaks = seq(0, 100, by = 10))[1:10]
The remainder of the section will outline and describe how cut.integer
deviates from the default cut
method.
Creating missing values will yield a warning.
cut(sample(10), breaks = c(0, 3, 5))
It is possible to define bins with width 1. This will generate a label with just the value of the integer containted (e.g. 2 instead of 2-2) and issue a warning.
cut(sample(10), breaks = c(1, 4, 6, 8, 9, 10))
Unordered breaks will be ordered before proceeding and a warning will be issued.
cut(sample(10), breaks = c(10, 0, 3))
Decimal values for breaks
will be rounded to integers and a warning will be
issued.
cut(sample(10), breaks = c(1, 2.6, 5.1, 10))
cut
method for ordered factorsThis method allows one to combine the levels of a factor into fewer categories given some break points:
some_letters <- cfactor(letters, ordered = TRUE) head(cut(some_letters, breaks = c("a", "q", "z"), labels = c("beginning of the alphabet", "the rest of the alphabeth"), right = TRUE, include.lowest = TRUE))
The purpose of cc
is to combine elements into a vector, just as base c
does.
They only differ with regard to factor treatment, for which c
most likely
does not return what might be expected.
c(cfactor("a"), cfactor("b")) cc(cfactor("a"), cfactor("b"))
cc
can also deal with ordered factors. It is quite verbose in the sense
that if the levels and/or the order of the levels do not match exactly, it will
issue warnings.
a_b <- cfactor(c("a", "b"), ordered = T) b_d <- cfactor(c("b", "c", "d"), ordered = T) cc(a_b, b_d)
The append function is a simple extension of the base::append()
function,
which is slightly refined in order to better handle appending of factors.
As shown in the example below, the base version of append()
converts factors
back into their underlying integer representations in the process of combining
them:
f <- factor(c('c','a','a',NA,'b','a'), levels= c('a','b','c')) g <- factor(sample(letters[4:10]), levels = sample(letters[4:10])) base::append(f, g)
This is probably not what you'd expect to get. What refactor::append
will give
you instead, is this:
refactor::append(f, g)
It returns a factor, preserves the levels, and also warns you about the NA
value. Hopefully that is more in line with what you wanted.
refactor::append()
will also handle ordered factors, provided that the
ordering is consistent. If levels and/or ordering are not consistent, then it
will simply fall back to the base method.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.