Base R Data Structures

Vectors

The most common data structure in R is the vector. R's vectors can be organised by their dimensionality (1d, 2d, or nd) and whether they're homogeneous or heterogeneous. This gives rise to the five data types most often used in data analysis:

| | Homogeneous | Heterogeneous | |----|---------------|---------------| | 1d | Atomic vector | List | | 2d | Matrix | Data frame | | nd | Array | |

Given an object, the best way to understand what data structures it is composed of is to use str(). str() is short for structure and it gives a compact, human readable description of any R data structure.

Vectors have three common properties:

They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.

``{block type='rmdnote'}is.vector()does not test if an object is a vector. Instead it returns TRUE only if the object is a vector with no attributes apart from names. Useis.atomic(x) || is.list(x)` to test if an object is actually a vector.

### Atomic Vectors

There are many "atomic" types of data: `logical`, `integer`, `double` and `character` (in this order, see below). There are also `raw` and `complex` but they are rarely used.

You can't mix types in an atomic vector (you can in a list). Coercion will automatically occur if you mix types:

```r
(a <- FALSE)
typeof(a)

(b <- 1:10)
typeof(b)
c(a, b)         ## FALSE is coerced to integer 0

(c <- 10.5)
typeof(c)
(d <- c(b, c))  ## coerced to double

c(d, "a")       ## coerced to character

50 < "7"

You can force coercion with as.logical, as.integer, as.double, as.numeric, and as.character. Most of the time the coercion rules are straight forward, but not always.

x <- c(TRUE, FALSE)
typeof(x)

as.integer(x)
as.numeric(x)
as.character(x)

However, coercion is not associative.

x <- c(TRUE, FALSE)

x2 <- as.integer(x)
x3 <- as.numeric(x2)
as.character(x3)

What would you expect this to return?

x <- c(TRUE, FALSE)

as.integer(as.character(x))

You can test for an "atomic" types of data with: is.logical, is.integer, is.double, is.numeric[^numeric], and is.character.

[^numeric]: is.numeric() is a general test for the “numberliness” of a vector and returns TRUE for both integer and double vectors. It is not a specific test for double vectors, which are often called numeric.

x <- c(TRUE, FALSE)

is.logical(x)
is.integer(x)

What would you expect these to return?

x <- 2

is.integer(x)
is.numeric(x)
is.double(x)

Missing values are specified with NA, which is a logical vector of length 1. NA will always be coerced to the correct type if used inside c(), or you can create NAs of a specific type with NA_real_ (a double vector), NA_integer_ and NA_character_.

Lists

Lists are different from atomic vectors because their elements can be of any type, including other lists. Lists can contain complex objects so it's not possible to pick one visual style that works for every list. You construct lists by using list() instead of c():

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)

Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.

x <- list(list(list(list(1))))
str(x)
is.recursive(x)

c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them. Compare the results of list() and c():

l1 <- list(1, 2)
c1 <- c(3, 4)
x <- list(l1, c1)
y <- c(l1, c1)
str(x)
str(y)

The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().

Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by lm()) are lists

NULL

Closely related to vectors is NULL, a singleton object often used to represent a vector of length 0. NULL is different than NA. For a good explanation of the differences see this blog post.

Attributes

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list[^pairlist] (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes().

[^pairlist]: The reality is a little more complicated: attributes are actually stored in something called pairlists, which can you learn more about in Advanced R

a <- 1:3
attr(a, "createdBy") <- "Brian Davis"
attr(a, "version") <- 1.0
attr(a, "z") <- list(list())
a
attributes(a)
str(attributes(a))

The structure() function returns a new object with modified attributes. Care must be taken with attributes since, by default, most attributes are lost when modifying a vector.

attributes(a[1])
attributes(sum(a))

The only attributes not lost are the three most important:

Each of these attributes has a specific accessor function to get and set values. When working with these attributes, use names(x), dim(x), and class(x), not attr(x, "names"), attr(x, "dim"), and attr(x, "class").

Names

You can name a vector in a couple[^named-vectors] ways:

Named vectors are a great way to make an easy, human readable look up table. We will see this use case extensively when we get to data visualizations.

[^named-vectors]: There are a couple less common ways. See Advanced R

Factors

One important use of attributes is to define factors. A factor is a vector that can contains only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, "factor", which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values. Factors can also have labels which effect how the factors are displayed. By default the labels are the same as the levels.

The order of the levels of a factor can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level. This feature can also be used to customize order in plots that include factors, since by default factors are plotted in the order of their levels. Labels are also useful in plotting where you want the displayed text to be different than the underlying representation.

Factors are useful when you know the possible values a variable may take, even if you don't see all values in a given data set. Using a factor instead of a character vector makes it obvious when some groups contain no observations:

gender_char <- c("m", "m", "m")
gender_factor <- factor(gender_char, levels = c("m", "f"))

gender_char
table(gender_char)
gender_factor
table(gender_factor)
# See the underlying representation of a factor
unclass(gender_factor)

gender_factor2 <- factor(gender_char, levels = c("m", "f"), labels = c("Male", "Female"))
gender_factor2
table(gender_factor2)
# See the underlying representation of a factor
unclass(gender_factor2)

While factors look like (and often behave like) character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer values. For this reason, it is best to explicitly convert factors to character vectors if you need string-like behavior.

Unfortunately, many base R functions (like read.csv() and data.frame()) automatically convert character vectors to factors. This is sub-optimal, because there's no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument stringsAsFactors = FALSE to suppress this behavior, and then manually convert character vectors to factors using your knowledge of the data only when you need the behavior of factors.

Factors tend to be most useful in data visualization and table creations where you want to report all categories but some categories may not be present in your data, or when you want to order the categories in something other than the default ordering. We will revisit factors and there usefulness later when we study the tidyverse and in particular the forcats package.

Matrices and arrays

Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

# Two scalar arguments to specify rows and columns
a <- matrix(1:12, ncol = 3, nrow = 4)
a
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))
b

# You can also modify an object in place by setting dim()
vec <- 1:12
vec
class(vec)

dim(vec) <- c(3, 4)
vec
class(vec)

dim(vec) <- c(3, 2, 2)
vec
class(vec)

length() and names() have high-dimensional generalizations:

c() generalizes to cbind() and rbind() for matrices, and to abind::abind() for arrays. You can transpose a matrix with t(); the generalized equivalent for arrays is aperm().

You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim(). as.matrix() and as.array() make it easy to turn an existing vector into a matrix or array.

Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren't too important, but it's useful to know they exist in case you get strange output from a function (tapply() is a frequent offender). As always, use str() to reveal the differences.

Matrices and arrays are most useful for mathematical calculations (particularly when fitting models); lists and data frames are a better fit for most other programming tasks in R.

Data Frames

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows. You can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix), we will discuss this further when we discuss subsetting.

Creation

You create a data frame using data.frame(), which takes named vectors as input:

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)

``{block2, type='rmdwarning'} Beware data.frame()'s default behavior which turns strings into factors. UsestringsAsFactors = FALSE` to suppress this behavior.

```r
df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
str(df)

Because a data.frame is an S3 class, its type reflects the underlying vector used to build it: the list.

typeof(df)

Testing and coercion

Because a data.frame is an S3 class, its type reflects the underlying vector used to build it: the list. To check if an object is a data frame, use is.data.frame():

is.data.frame(df)

You can coerce an object to a data frame with as.data.frame():

The automatic coercion that causes the most problems is if you select a single column of a data.frame. R will coerce the column to an atomic vector, which generally is not what you want[^data.frame].

(x1 <- df[, "y"])
str(x1)

(x2 <- df[, "y", drop = FALSE])
str(x2)

[^data.frame]: We'll revisit this when we get into R for Data Science and discuss tibbles

Combining data frames

You can combine data frames using cbind() and rbind():

cbind(df, data.frame(z = 3:1))
rbind(df, data.frame(x = 10, y = "z"))

When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match.

It's a common mistake to try and create a data frame by cbind()ing vectors together. This is unlikely to do what you want because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:

# This is always a mistake
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)

good <- data.frame(a = 1:2, b = c("a", "b"))
str(good)

List and matrix columns

Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list. This is a powerful technique because a list can contain any other R object. This means that you can have a column of data frames, or model objects, or even functions! We will see this again when we discuss tidy data.

df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df

However, when a list is given to data.frame(), it tries to put each item of the list into its own column, so this fails:

data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))

A workaround is to use I(), which causes data.frame() to treat the list as one unit:

dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(dfl)

I() adds the AsIs class to its input, but this can usually be safely ignored.

Similarly, it's also possible to have a column of a data frame that's a matrix or array, as long as the number of rows matches the data frame:

dfm <- data.frame(x = 1:3 * 10, y = I(matrix(1:9, nrow = 3)))
str(dfm)

Use list and array columns with caution. Many functions that work with data frames assume that all columns are atomic vectors, and the printed display can be confusing.

dfl[2, ]
dfm[2, ]


DavisBrian/rclassnotes documentation built on May 17, 2019, 8:19 a.m.