source('vignette_header.R')
r Hm
is a package for the R programming language.
You don't need to be an R master to use r hm
, but there are some basic concepts from R
that you will need to learn, and ultimately, if you want to get really advanced, you'll
need to develop some R skills.
This document is a basic primer for R, which will teach you the basics you need to know in order to
make the most out of r hm
.
R code is made up of "expressions" like 2 + 2
, sqrt(2)
, or (x - mean(x))^2
.
As you can see, you can create very intuitive arithmetic expressions, like 5 / 2
or 3 * 3
.
However, the most common elements of R expressions are "calls" to functions.
A "function" in R is a pre-built bit of code that does something.
Most functions take one or more input arguments, and "return" some kind of output.
For example, the function sqrt()
takes a number as an input argument, and "returns" the square root of that number.
sqrt(2) sqrt(9)
To "call" a function, we write the function's name, followed by parentheses (()
).
Any input arguments to the function must go inside the parentheses, separated by commas if there are more than one.
Here are some examples of common functions being "called" with zero or more input arguments:
abs(-3) mean(1:5) max(1, 5) # two arguments c(1, 2, 3) # three arguments Sys.time() # no arguments!
Different functions have different arguments they recognize, with specific names.
For example, the function log()
takes two arguments, called x
and base
.
Other functions can take any number of arguments, with any name.
You can learn about a function, including the arguments it accepts, by typing ?functionName
at the command line; for example, ?sqrt
or ?mean
.
If you see an argument called ...
, that tells you that the function can take any number of arguments.
We can explicitly "name" the function arguments we want by putting argname = argument
into our calls:
For example, you could say log(10, base = 2)
.
Named arguments are very useful when we are creating data, like vectors and data.frame
s (see below).
Complex expressions might involve a large number of function calls, which can get tiresome to read (or write). For example, something like
log(round(sqrt(mean(x^2)), base = 2)
calls four functions!
An expression like that is a bit tricky to read, and it can be really easy to make a mistake where
you put the wrong number of parentheses.
As an alternative, R gives us the option of calling functions in a "pipe."
The way this works is we use the "pipe" command |>
, which takes an input on the left and "pipes" it into a function call on the right.
For example, we can rewrite the previous command as:
x^2 |> mean() |> sqrt() |> round() |> log(base = 2)
Much better!
To make things even cleaner, R will understand if you spread your expressions across multiple lines, by putting a new line after each |>
, or function argument:
x^2 |> mean() |> sqrt() |> round() |> log(base = 2) max(sqrt(2), log(2), exp(2), pi / 2)
When coding in R, you'll often want to "save" data or other objects so you can reuse them.
We do this by "assigning" something (often the result of a function) to a "variable".
This is done using the assignment operators, either <-
or ->
.
A variable name can be any combination of upper and lowercase letters.
Let's calculate the square-root of two and save it to a variable:
tworoot <- sqrt(2)
We can then reuse that value as many times as we want:
tworoot^2 tworoot * 2 c(tworoot, tworoot)
You can also assign from left to right, using ->
.
This is useful in combination with pipes:
tworoot |> exp() |> round() -> newvalue
Note your variable names can also include _
, .
, or numeric digits, as long as they aren't at the beginning of the name.
For example, X1
or my_name
are valid names---but not 2X
.
In R, there two fundamental data structures that are used all the time:
In R, the basic units---the atoms, if you will---of information are called "atomic" vectors. There are three basic atomic data types:
numeric
values.3
, 4.2
, -13
, 254.30
character
values."note"
, "a"
, "do, a dear, a female dear"
logical
values.TRUE
, FALSE
You might be wondering, why are we calling these basic atoms "vectors"? Well, in R, the basic atomic data types are always considered a collection of ordered values. These ordered collections are called vectors. In the simple examples above, each vector only had a single value, so it just looks like one value---single values like this are often called "scalars". However, R doesn't really distinguish between scalars (single values) and vectors (multiple values)---everything is always a vector. (Still, we sometimes refer to length-1 vectors as scalars.)
To make a vector from scratch in R, use c()
, as so:
c(1, 2, 3) c("Bach", "Mozart", "Beethoven", "Brahms") c(TRUE, FALSE) c(32.3) 32.3
In this example, we've created five vectors.
numeric
vector of length 3.character
vector of length four (composers).logical
vector of length 2.numeric
vectors of length 1.c(32.3)
and 32.3
are the same thing---a vector of length 1.Notice that vectors can't mix-and-match different data types; which makes sense because a vector is a single type of thing.
But this means that commands like c(3, "a")
will actually create a character
vector, where the 3
is forced to be a character ("3"
).
Having everything be a vector all the time is very useful, because it allows us to think of and
use collections of data as single thing.
If I give you, say, ten thousand numbers, you don't have to worry about manipulating ten thousand things:
rather, you just work with one thing: a vector, which happens to be of length 10,000.
In R, we call this vectorization---generally, in R and in r hm
we will constantly
be taking advantage of vectorization to make our lives super easy!
For an example of vectorization, watch this:
c(1, 1, 2, 3, 5, 8, 13, 21) * 2
We created two numeric
vectors:
2
and multiplied them together! Notice that the entire Fibonacci vector is multiplied by two! We don't have to worry about multiplying each number of the vector, it's done for us.
There are two ideal circumstances for working with vectors.
In the first case, we work with multiple vectors that are all the same length, each value in each vector is "lined" up with values in the other vector. If we, for example, add two such vectors together, each "lined up" pair of numbers is added:
c(1, 2, 3) + c(5, 4, 3) paste(c('a', 'b', 'c'), c(1, 2, 3))
In the second case, one of the vectors is length-1 (a "scalar"). In this case, the scalar value is paired with each value in the longer vector (as in the Fibonacci example above).
c(1, 2, 3) + 5 paste(c('a', 'b', 'c'), 1)
What happens if we have vectors that are longer than one, but are not the same length? Well, R will generally attempt to "recycle" the shorter vector---which means repeat it--- as necessary to match the length of the longer vector. If the shorter vector evenly divides the longer vector, you generally won't have a problem:
c(1, 2, 3, 4) * c(2, 3)
If the division is not perfect, R will still "recycle" the shorter vector, but you'll get a warning:
c(1, 2, 3, 4) * c(2, 3, 4)
You see the warning message R have us?
"longer object length is not a multiple of shorter object length"
That's R telling us that we've got an obvious mismatch in the lengths of our vectors.
Generally, it is best to work with vectors that are all the same length and/or scalar values (length-1 vectors), so you can avoid worrying about how exactly R is "recycling" values. This brings us too...
Factors are a useful modification of character
vectors, which keep track of all the possible values ("levels")
you expect in your data, even when some of those levels are missing from the vector.
This is mainly useful when we are counting data with table()
.
For example, let's consider the built-in R object called letters
:
letters
What happens if we call table()
on letters
?:
table(letters)
Every letter appears once in the table, duh! What if we randomly sample a handful letters and table the result?
sample(letters, 15, replace = TRUE) |> table()
Notice that not all the letters from table appear in the output. E.g., if a letter never appears in the sample, it doesn't get counted.
Let's try something new: before sampling, I will call the command factor()
on letters
:
factor(letters) |> sample(15, replace = TRUE) |> table()
Ah! Now our table includes all possible letters, even though many of them appear 0
times.
So how does this work?
Well the factor()
function looks at a character
vector and outputs a new "factor" vector.
The factor vector acts just like a character
vector, except it remembers all the unique values,
or "levels", in the vector:
factor(letters)
Even if we remove some values from the factor vector, the vector will "remember" these levels. The factor will also remember the order of the levels, so you can make tables ordered the way you want them.
You can access, or set, the levels of a factor using these using the levels()
function, or with the levels
argument to the factor()
function itself.
Maybe we want to tabulate the letters, but put the vowels first:
factor(letters, levels = c('a', 'e', 'i', 'o', 'u', "b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "y", "z")) |> sample(15) |> table()
Note that if a character
string contains values that you don't include in your levels
, the value will show up NA
in the resulting factor, and you may see warnigns like "invalid factor level, NA generated
."
Data frames are the heart and soul of R.
A data.frame
is simply a collection of vectors that are all the same length---ideal for vectorized operations!
The vectors in a data.frame
are arranged as columns in a two dimension table.
Let's make a data frame, by feeding some vectors to the [data.frame()] function:
X <- c("C", "D", "E", "F", "G", "A", "B", "C") Y <- c(0, 2, 4, 5, 7, 9, 11, 12) Z <- c("P1", "M2", "M3", "P4", "P5", "M6", "M7", "P8") df <- data.frame(X, Y, Z) df
Notice that each of your columns/vectors can be a different type, with no problem.
Also notice, that each column has a name; we can inspect these names using the colnames()
function.
colnames(df)
Or change them:
colnames(df) <- c('Letters', "Semitones", "Intervals") df
Finally, it's also possible to assign the column name we want when creating the data frame:
data.frame(Letters = X, Semitones = Y, Intervals = Z)
Remember, the vectors in a data.frame
must all be the same length.
If you tried to make a data.frame
with a vectors that don't match in length, you'll get an error "arguments imply differing number of rows
."
The one exception is that you can call data.frame
with some scalar single values, which will be automatically recycled to match
the length of the other vectors.
We often want to access the columns/vectors held in a data frame.
We can do this several ways.
One approach is with the $
operator, combined with the name of the column we want.
For example, we can get the Letters
column from the data frame we made above using df$Letters
.
Often, we'll want to write code that uses a bunch of different columns from the same data.frame---in fact, this is the main thing we do most of the time in R!
To avoid writing df$
over and over again, we can use the with()
function.
with()
allows us to drop "inside" our data.frame
, where our R commands can "see" the columns variables:
with(df, paste(Intervals, Semitones, sep = ' = '))
Sometimes we'll encounter data points which are irrelevant, meaningless, or "not applicable."
In other cases, there may be relevant data that is "missing."
R provides two distinct ways to represent missing/irrelevant data: NULL
and NA
.
NULL
is a special R object/variable, which is used represent something that is totally missing or empty.
NULL
has no length (length(NULL) == 0
) and no value. It cannot be indexed. Many functions will give an error if passed a NULL
.
NA
is quite different than NULL
.
Any atomic vector can have NA
value at any (or all) indices---in fact, you can have vectors or NA
values.
The NA
values are still "values" in a vector, but they are used indicate when there are values that are missing or problematic.
Passing a vector with NA
values to most functions does not lead to an error, though you'll often get a warning message instead.
For example, consider what happens if we apply the command as.numeric()
to the following strings:
numbers <- as.numeric(c("1", "2", "apple", "4.2")) numbers
Four of the strings in this vector are converted to numbers without a problem,
but the string "apple"
makes no sense as a number.
So what does R do?
It converts the three strings to numbers, just like as.numeric()
is supposed to,
but the "apple"
string appears as NA
in the outut.
We also get warning message: NAs introduced by coercion
.
You might see that warning sometimes, so now you know what it means!
What would happen if we tried applying a different function onto our vector with an NA
?
sqrt(numbers)
The sqrt()
function has no problem taking the square-roots of the three numbers, and it simply "propogates" the NA
value in its input through to its output.
The "propogation" of missing values is a very useful feature in R:
it makes sure that we keep track of what data is missing, while keeping our vectors all their original lengths.
getwd()
--- Get R's current working directory.setwd()
--- Set R's working directory.summary()
--- Summarize the contents of an R object.sort()
--- Put values of a vector into ascending order.decreasing = TRUE
for decreasing order.rev()
--- Reverse the order of a vector.rep()
--- Repeat a vector.unique()
--- Returns only the unique values of a vector.x %in% y
--- Which elements of the vector x
appear in the vector y
?length()
--- How long is the vector (or list()
)?head()
and tail()
--- Return the first or last $N$ elements of a vector.n
argument a natural number to control $N$.x:y
--- Create a sequence of integers from x
to y
.1:10
makes a vector of integers from one to ten.seq()
--- Create arbitrary sequences of numbers.which()
--- Which indices in a logical vector are true?which(c(TRUE, FALSE, TRUE))
returns c(1,3)
.paste()
--- Paste together multiple character
strings.nchar()
--- Counts how many characters there are in each string of a character vector.x + y
--- Addition; $x + y$.x - y
--- Subtraction; $x - y$.-x
--- Negation; $-x$.x * y
--- Multiplication; $xy$x^y
--- Exponentiation; $x^y$.x^(1/3)
; $x^{\frac{1}{3}}$.x / y
--- Real division; $\frac{x}{y}$.x %/% y
--- Euclidean division; $\lfloor \frac{x}{y} \rfloor$.x %% y
--- x
modulo y
; $x \mod y$.diff(x)
--- This function calculates the differences between consecutive values in a numeric vector.diff(c(5, 3))
is the same as 3 - 5
.sqrt(x)
--- Square-root of numbers; $\sqrt{x}$.abs(x)
--- Absolute value of numbers; $|x|$round(x)
--- Round number to nearest integer; $\lfloor x \rceil$log(x)
--- Log of number (natural log by default); $\log(x)$sign(x)
--- Sign (1, -1, or 0) of x; $\text{sgn}\ x$sum(x)
--- The sum of a numeric vector.max(x)
--- The maximum value in a numeric vector.min(x)
--- The minimum value in a numeric vector.range(x)
--- The minimum and maximum values of a numeric vector.diff(range(x))
.mean(x)
--- The arithmetic mean of numeric vector.median(x)
--- The median of numeric vector.quantile(x)
--- Other distribution quantiles of numeric vector.sample()
--- Takes a random sample from a vector. Can also be used to randomize the order of a vector.table()
--- Tabulate all unique values in vector, or cross-tabulate across multiple vectors.r hm
, you should use the similar [count()] instead!sum((1:100) > 55)
sum(letters %in% c('a', 'e', 'i', 'o', 'u'))
mean((1:100) > 55)
sum(letters %iN% c('a', 'e', 'i', 'o', 'u'))
To make your function in R, you use the function
keyword, like so:
function(argument1, argument2, etc.) { Expressions to evaluate here, involving the arguments }
For example, let's make a function that subtracts the mean from a vector of numbers.
We'll have one argument, which we'll call numbers
.
myfunc <- function(numbers) { mean <- mean(numbers) numbers - mean }
We've created our function, and assigned it the name myfunc
, just like any other assignment.
Let's try it out:
myfunc(1:9)
Notice that the last expression in your function definition is the value that gets "returned" by the function.
If you are feeling lazy, you can also define a function using a few less keystrokes using the command \()
instead of function()
.
For example,
\(x) x + 1
TBA
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.