In olobiolo/Rdlazer: Resources for Teaching R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(Rdlazer)

a tour of R

Vector

a vector is 1-dimensional sequence of items in a particular order
the number of elements it the vector's length
a vector's length can be checked with length
a vector is the simplest data structure in R, hence it is called atomic
a vector can store only one type of data, e.g. logical or character, thus it has a type
virtually every data object is some kind of vector
even a single data point, e.g. a TRUE, is actually a vector of length 1
vectors can be created with the function vector: it can create any type (mode) of vector of any length
constructors for particular types exist: logical, numeric, character
when created with a constructor, all elements will be FALSE or 0 or "" (empty string)
functions exist for testing and converting types: is.character, as.character, etc.
any vector can contain NAs

\newline

an example of a character vector is letters

\newline

vectors are most commonly built by concatenating (merging) vectors with the function c

# merge two character vectors of length one
c('one', 'two')

when concatenating vectors of different types, some of them will be coerced to a type that can represent all the vectors in question

# merge a character vector with a numeric one
c('one', 1)
# the numeric is (silently!) coerced to character as a vector can only hold a single type

# concatenate existing vectors (variables)
v1 <- c('one', 'two')
v2 <- c(1, 2)
v3 <- c(FALSE, TRUE)
v4 <- c(v1, v2, v3)
v4

a vector can be treated as a set
use %in% ro is.element to test if something is an element of a vector:

'one' %in% v4

a vector can contain duplicated elements
use anyDuplicated to test if that is the case
use unique to get unique elements of a vector

\newline

inspect the beginning or end of a vector with head or tail, respectively (6 elements by default)
sort a vector with sort
reverse the item order with rev
replicate a vector with rep
create a vector of integers from i1 through i2 with the infix operator i1 : i2
use seq for more options for numeric sequences
use seq_along(x) to obtain a sequence of integers from 1 through length(x)

\newline

attributes

a vector can have attributes
an attribute is an additional feature of a vector, e.g. properties or information
the definition of attributes is necessarily vague as they can be anything, really
attributes are stored as a named list (more on lists below)
all attributes of an object can be viewed with the function attributes
a specific attribute can be viewed with attr
a frequently encountered attribute is names: a vector of character strings associated with each item
names must be unique, unless they are NA or ""
a vector that has a names attribute is a named vector
see .romans for a named vector of Roman numerals
names can be viewed and set with names(x)
names(x) returns names of x
names(x) <- [vector of names] sets names of x, much like assigning a variable
an equivalent method would be to use attr(x, "names")
names exists for convenience, as messing with names is such a common action

# get names
names(v2)

#set names
names(v2) <- c('first', 'second')
names(v2)

# remove names
names(v2) <- NULL
names(v2)

a vector can also be created with names when using c
elements are given as name = value pairs

v5 <- c('first' = 1, 'second' = 2, 'third' = 3)
names(v5)

# the names need not be quoted
v5 <- c(first = 1, second = 2, third = 3)

we will discuss more attributes when appropriate but names will keep coming back
the function as.vector will strip all atributes, leaving a bare vector

Every vector has a type and a length. It can also have attributes.

Vectorized Operations

most operations in R are vectorized
they are applied to each vector element separately and return a vector of the same length

n <- 1:6

# mathematical operation
n + 1

# change case
toupper(letters)

# compare numbers
n > 3

operations involving two vectors are done elementwise, by matching indices (1st with 1st, etc.)

n1 <- 1:4
n2 <- 4:1

n1
n2
n1 + n2
n1 / n2

when the lengths of the vectors involved do not match, the shorter vector is recycled
this can lead to unexpected results, so watch out

n3 <- rep(4, 4)
n4 <- 1:2

n3 + n4

l1 <- c(TRUE, FALSE)
paste(n3, l1)

\newline

of course, some operations do not deal with every element of a vector separately

s1 <- names(v5)
s1

length(s1)

but many do

nchar(s1)

be careful, read the documentation, test

l1 <- toupper(letters)
l2 <- LETTERS

identical(l1, l2)
l1 == l2

vna <- c(1, NA, 2, NA, 3, NA)

is.na(vna)
anyNA(vna)

Extraction/Subsetting

extraction, also called subsetting is picking out a some elements from an object
any language allows for extracting a single element from a sequence
R is especially robust and flexible at it
extraction is done with the [ operator

\newline

this simplest case is getting the n-th element of a vector: vec[n]

v4[1]

the number in the brackets is the index: the ordinal number of an element in the vector
the index is not a number - it is a vector of length 1
the index vector can be of any length

ind <- c(1,3)
v4[ind]

in fact, it can be any numeric vector at all

ind <- c(1, 1, 3, 4, 6, 6)
v4[ind]

# invalid index yields an NA
v4[7]

# fractions are rounded down
v4[2.3]
# this should be avoided regardless

if a vector is named, names can be used as indices

v5['first']
ind <- c('first', 'second', 'first', 'first')
v5[ind]

extraction can also be done with a logical vector, typically of the same length

ind <- c(T, T, F, T, F, F)
v4[ind]

a logical vector does not refer to indices and therefore its length matters

ten <- 1:10
x <- letters[ten]
x

# too long an index produces NAs
ind <- rep(T, 11)
x[ind]

# too short an index is recycled
ind <- c(T, T, F)
x[ind]

ind <- TRUE
x[ind]

a negative index will drop an item

v4
v4[-2]
v4[-c(2,5)]
v4[c(-2, -5)]

Replacement

R has a replacement functionality, a combination of extraction and assignment
a subset is used as a variable name in an assignment expression
the new value is assigned to the subset in the original object

ind <- 1:3 * 2

# extract elements
n[ind]

# replace elements
n[ind] <- NA
n

assigning to a non-existent index creates a new element

n[7] <- 15
n

as well as all intervening elements

n[10] <- 25
n

Matrix

a matrix is a 2-dimensional data structure
it has rows (dimension 1) and columns (dimension 2)
a matrix is created with the function matrix

m <- matrix(1:12, nrow = 3, ncol = 4)
m

by default a matrix is filled by columns, i.e. 1st column from 1st to last row, then 2nd column, etc.; this can be modified wyth the byrow argument of matrix
a matrx can be transposed with the funcion t

t(m)

we can check the number of rows and columns with nrow and ncol, respectively

\newline

a matrix is not a special atomic type with unique properties
it is merely a vector with an attribute dim
dim can be viewed or set with a special function dim

# get dimensions
dim(m)
# the first number is the number of rows and the second is the number of columns

# the attribute can be modified as will, even removed
dim(m) <- NULL
m
# note that the matrix is dismantled in the same way is is built by default:
# column after column

# dim can be set on any vector as long as the grid size matches its length
dim(m) <- c(4,3)
m
# note that, again, the matrix is build in the default order

since a matrix is basically a vector, it can hold only one type of data
it also has a length:

length(m)

subsetting a matrix

when extracting from a matrix, we must account for it having two dimensions
specify the range of each dimension and separate them with a comma:

m[1,3]
m[1:2, 2:3]

leaving one index empty will yield a whole row/column
be aware this returns a vector, not a 1 row matrix!

m1 <- m[1, ]
is.matrix(m1)
is.vector(m1)

in order to preserve tha matrix nature of the result, we must set the drop argument to FALSE:

m2 <- m[1, , drop = FALSE]
# mind the empty second dimension!

since it is a vector, we can also subset using a single index, bypassing the dimensions:

m[1]
m[7]

\newline

a matrix can be extended by adding rows or columns with rbind and cbind, respectively:

rbind(m, 101:103)
cbind(m, letters[1:4])
# note the coercion

c would do no good here as it strips attributes other than names

c(m, m)
cbind(m,m)

\newline

a matrix's dimensions can be named:

rownames(m) <- letters[1:4]
colnames(m) <- LETTERS[1:3]
m

we can now use the names for subsetting

m['a', 'B']
m['a', c('A', 'B'), drop = FALSE]
m[c('a', 'b'), c('A', 'B')]

the row and column names are stored a single attribute dimnames
dimanmes can be viewed or set with the function dimnames
it is a list with two unnamed elements
row and column names can be viewed or set separately with rownames and colnames

dimnames(m)
rownames(m)
colnames(m)

\newline

finally, replacement in a matrix works much the same as before

m[2,3] <- "hello"
m
# note the conversion to character

replacing to an empty index affects all items but keeps attributes

m[] <- "goodbye"
m

Array

an array is an n-dimensional data structure
it is a generalization of the matrix concept
a matrix is basically a specific case of an nd array with n = 2
an array is created with the function array
alternatively, you can take a vector and set its dim to a vector longer than 2

a1 <- array(1:12, dim = c(2, 2, 3))
a1

a2 <- 1:12
dim(a2) <- c(2, 2, 3)
a2

identical(a1, a2)

extending an array, e.g. to add a matrix to a 3d array requires abind from the abind package
see iris3 for an example array

\newline

otherwise an array behaves like a matrix
you need to specify three dimensions when extracting

a1[1, , ]

a1[, , 3, drop = F]

\newline

arrays are not often used nowadays

List

a list is a special type of vector
in a list any item can be of any type - including a list
this is called a recursive object
a list gives TRUE to is.list and is.recursive
look at state.center and ability.cov for examples of lists

\newline

a list is constructed with the function list
much like c, list takes a number of objects and returns a list object

l1 <- list(1, "one", TRUE, v5)
l1

concatenate lists with c or append
using list to add items to a list will take your list make it an element of a higher order list

l2 <- list(l1, "surprize")
l2

l3 <- c(l1, "surprize")
l3

turn a vector into a list with as.list and do the reverse with unlist
a list can be named just like any other vector
list names have restrictions just like variable names: no spaces, no special characters other than underscores or dashes (the latter never at the beginning)
forbidden names can be used, but must be backticked

l4 <- list(one = 1:3, two = 4:5, three = 7:8, `forbidden name` = 9)
l4

subsetting a list
it's a bit of a mess...

like any vector, a list can be extracted from
however, the recursive structure of a list calls for special measures

\newline

using the normal, single bracket returns a subset of the list
all normal rules apply
the result is always a list

# subset with numeric index
l4[2]

ind <- 2:3
l4[ind]

# subset with names
l4['three']

# subset with logical vector
ind <- c(T, F, F)
l4[ind]
# the logical vector has been recycled

# it's a list
x <- l4[1]
x
is.list(x)

a double bracket accepts only single indices and returns a single list item
using a double bracket returns the contents of that item of the list

# numeric
l4[[3]]

# names
l4[['forbidden name']]

# it's not a list
x <- l4[[3]]
x
is.list(x)
is.numeric(x)

Note: a double bracket can also be used on a normal vector. The difference to a single bracket is that a double one discards names. It also accepts only a length 1 index. [[ is virtually never used with atomic vectors.

then there is a special dollar operator for subsetting lists: $
it only accepts bare names of list elements: l4$one
it does exactly the same thing as a double bracket

l4$one

l4$`forbidden name`

A list is a kind of vector that can hold any item.
Subsetting a list with `[` yields a list of any length. Using `[[` or `$` yields a single list item.

R's Syntactic Flexibility

there is a somewhat confusing word: expression
for now, every function call is an expression
as soon as it is executed in the console, the expression is evaluated
its result is computed and possibly a value is returned
unless it is assigned, it is passed to print and, well, printed
however, assignment, like printing, is done by a function
indeed, the value of an expression can be used by any function
this means we can construct a complex expression where the value of one function call is used in another without assigning anything on the way

c(1, c(4, 7))

this can, obviously, be assigned rather than printed

v6 <- c(1, c(4, 7))

there are three function calls in the above expression: one to <- and two to c
sidenote:

| the arrow operator creates a binding rather than return a value | it actually does return, but invisibly | the value can be visualized by putting the expression in parentheses

(v6 <- c(1, c(4, 7)))

| this explains why we can do something like this:

a <- b <- 1:2
a
b

piling up function calls can get unintelligible really quickly:

head(c(paste(letters[1:3], LETTERS[8:5], c(1, 2)), 1:8), 7)

still, it is very useful

string <- 'This is a rather long character string, 
           like a sentence, it even has a period at the end.'

# get the five longest words from this sentence, in uppercase.
s <- strsplit(string, ' ')[[1]]
toupper(s[order(nchar(s), decreasing = T)][1:5])

(we will learn how to deal with the commas later)

Data Frame

a data frame is a 2-dimensional data structure
it is the closest thing to a spreadsheet that exists in R
a data frame is superficially similar to a matrix in that it has rows and columns
unlike a matrix, it can hold different types of data: each column is independent
see iris for an example of a data frame
a data frame has certain dimensions (number of rows and columns) and they can be checked

nrow(iris)

ncol(iris)

dim(iris)

test if an object is a data frame with is.data.frame
a data frame has two (virtually) obligatory attributes: row names and column names

rownames(iris)

colnames(iris)

unlike dimnames in a matrix, rownames and colnames are independent attributes
both are vectors of unique character strings
dimnames does still work, thoguh

dimnames(iris)

if not supplied when creating a data frame, they will be assigned default values
colnames can be set to NULL and be removed
rownames can be set to NULL but will be replaced by a character vector of integers

\newline

a data frame is created with the data.frame function, enumerating the vectors that build columns

# different lengths of columns throw an error
data.frame(1:4, letters[1:4], c('some', 'strings', 'here'))

data.frame(1:4, letters[1:4], c('some', 'strings', 'go', 'here'))
# the automatic names are unhelpful

d1 <- data.frame(first = 1:4, 
                 second = letters[1:4], 
                 'third' = c('some', 'strings', 'go', 'here'))
d1
# note names need not be quoted

matrices can be coerced to a data frame with as.data.frame

as.data.frame(m)

dimnames(m) <- NULL
as.data.frame(m)
# these automatic names are better

as can some lists but it the results may be surprising

l3
as.data.frame(l3)

l4
as.data.frame(l4)

manually created data frames are uncommon, usually they result from importing data

\newline

to be fair, it is possible for a column to be a list of lists rather than an atomic vector
this is called a list column
it is rarely used, just be aware it can happen

\newline

a data frame is actually a special type of list with some restrictions:
- all list elements are (usually) atomic vectors
- all list elements are of the same length
- the list must be named
- the list elements are the columns of the table
this is why length returns the number of columns rather than rows of a data frame
since columns are list elements, column names are actually the names of the list elements

colnames(iris)
names(iris)

identical(names(iris), colnames(iris))

\newline

data frames are where head starts coming in really handy, as it displays the first rows

head(iris)

another useful function is str, which displays the structure of an object

str(iris)

(What's this "Factor" thing here, you ask? Hold that thought.)

subsetting a data frame

use the [ operator and specify two dimensions

# use numeric indices
iris[1:5, 3:4]

# use row/column names
iris[c('1', '4', '9'), c('Petal.Length', 'Species')]

# leave columns unspecified to get them all
iris[1:5, ]

# leave rows unspecified to get them all
iris[, 4:5]

# selecting a single column will yield a vector, not a data frame
iris[, 'Species']

# unless you specify that dimensions not be dropped
iris[, 'Species', drop = F]

# use logical vectors - highly unorthodox but possible
iris[c(TRUE, FALSE), c(TRUE, FALSE)]
# note the recycling

# again, selecting a single column drops dimensions
iris[, c(F, F, T, F, F)]

# unless forbidden
iris[, c(F, F, T, F, F), drop = FALSE]

use the [ operator and only specify the column(s), like with a list

iris[4:5]
iris['Species']
# always yields a data frame, consistent with list behavior

use the [[ to get one column as a vector, like with a list

iris[[5]]
iris[['Species']]
# always yields a vector

iris[[5, drop = F]]
# keeping dimensions is not applicable

iris[[c('Species', 'Petal.Length')]]
# only one item

iris[[1:5]]
iris[[1:2]]
#madness

use the $ operator to extract one column as a vector, like with a list

iris$Species

\newline

so far this replicates list behavior
now for something specific to data frames
use a logical expression derived from the data frame to determine rows

iris[iris$Species == 'setosa', ]

what just happened?

# 1. the iris$Species column is compared to the string 'setosa'
iris$Species == 'setosa'
# this yields a logical vector EXACTLY the same length as iris$Species, i.e. the number of rows

# 2. this vector is then used for extracting rows
ind <- iris$Species == 'setosa'
iris[ind, ]

conceptually, this comes down to get me all rows, for which the column Species has the value "setosa"
and since logical expressions can be combined

iris[iris$Species == 'setosa' & iris$Sepal.Length >= 5, ]

of course, this can be combined with selecting columns

iris[iris$Species != 'virginica' & iris$Sepal.Length >= 5, 'Petal.Width']

replaceemnt in a data frame

all the subsetting mechanics can be used for very specific replacement

# suppose we want to omit some outliers from our calculations
# we can replace some data points with NA
iris[iris$Sepal.Length > 6 | 
       iris$Sepal.Length < 5 & 
       iris$Species == 'versicolor', 
     'Petal.Width'] <- NA
iris

operations with columns

data frames provide a convenient way of processing data
columns are vectors of equal length
operations involving several columns will always be performed rowwise (no recycling)

head(mtcars)

# compute how much horse power (hp) a car has per cylinder (cyl)
mtcars$hp / mtcars$cyl

the resulting vector can easily be saved as a new column

mtcars$hp.per.cyl <- mtcars$hp / mtcars$cyl
head(mtcars)

the same can be done with the string notation with [[

mtcars[['hp.per.cyl']] <- mtcars[['hp']] / mtcars[['cyl']]

adding new columns is often a useful way to add information without losing data

# rather than substituting NAs for numbers, 
# let's add a logical column that will flag certain observations as outliers
iris$outlier <- 
  ifelse(iris$Sepal.Length > 6 | 
           iris$Sepal.Length < 5 & 
           iris$Species == 'versicolor', 
         TRUE, FALSE)
iris

ifelse is a very useful function; study it carefully

A data frame is a special kind of list made up of atomic vectors.
Extracting rows and columns is robust.
Operations on columns are easy and reliable.

Factor

remember the factor thing in iris?

str(iris)

let's have a closer look

head(iris$Species)

str(iris$Species)

a typical reaction to factors

a factor is a special kind of vector, something between an integer and a character
it is a vector of integer values
it has a levels attribute
every level is a unique character string mapped to one of the integer values
see state.region for an example factor
factors are created with factor

ch <- rep(c('c', 'b', 'a'), 3)
f <- factor(ch)

ch

f

note that levels are sorted matched to sorted values

factor(1:3, labels = c('a', 'b', 'c'))

factor(3:1, labels = c('a', 'b', 'c'))

factor(1:3, labels = c('c', 'b', 'a'))

factor(3:1, labels = c('c', 'b', 'a'))

creating factors de novo is rather rare
vectors can also be converted to factors with as.factor

as.factor(ch)

test if a vector is a factor with is.factor
it is common behavior in R to automatically convert character vectors to factors
this is especially important when creating data frames
data.frame has a stringsAsFactors argument just for this purpose
beware numbers encoded as strings

nch <- as.character(rnorm(10))

nch

as.numeric(nch)

nf <- as.factor(nch)
nf

as.numeric(nf)

as.numeric just strips the levels attribute

\newline

Wherefore the factors?

Factors store categorical data. It used to be a very useful way to encode what is called discreet variables. Today much of what factors do can be don eby character vectors just as well but many procedures rely on a factor input, so they are not going away any time soon.

It may seem superfluous now but we will get to places where factors are helpful and even necessary.

I promise.

Factors are confusing so it's all right to be confused.

olobiolo/Rdlazer documentation built on Aug. 6, 2022, 11:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

olobiolo/Rdlazer
Resources for Teaching R

In olobiolo/Rdlazer: Resources for Teaching R

Vector

attributes

Every vector has a type and a length. It can also have attributes.

Vectorized Operations

Extraction/Subsetting

Replacement

Matrix

subsetting a matrix

Array

List

subsetting a list
it's a bit of a mess...

A list is a kind of vector that can hold any item.
Subsetting a list with `[` yields a list of any length. Using `[[` or `$` yields a single list item.

R's Syntactic Flexibility

Data Frame

subsetting a data frame

replaceemnt in a data frame

operations with columns

A data frame is a special kind of list made up of atomic vectors.
Extracting rows and columns is robust.
Operations on columns are easy and reliable.

Factor

Factors are confusing so it's all right to be confused.

R Package Documentation

Browse R Packages

We want your feedback!

olobiolo/Rdlazer Resources for Teaching R

In olobiolo/Rdlazer: Resources for Teaching R

Vector

attributes

Every vector has a type and a length. It can also have attributes.

Vectorized Operations

Extraction/Subsetting

Replacement

Matrix

subsetting a matrix

Array

List

subsetting a list it's a bit of a mess...

A list is a kind of vector that can hold any item. Subsetting a list with `[` yields a list of any length. Using `[[` or `$` yields a single list item.

R's Syntactic Flexibility

Data Frame

subsetting a data frame

replaceemnt in a data frame

operations with columns

A data frame is a special kind of list made up of atomic vectors. Extracting rows and columns is robust. Operations on columns are easy and reliable.

Factor

Factors are confusing so it's all right to be confused.

R Package Documentation

Browse R Packages

We want your feedback!

olobiolo/Rdlazer
Resources for Teaching R

subsetting a list
it's a bit of a mess...

A list is a kind of vector that can hold any item.
Subsetting a list with `[` yields a list of any length. Using `[[` or `$` yields a single list item.

A data frame is a special kind of list made up of atomic vectors.
Extracting rows and columns is robust.
Operations on columns are easy and reliable.