knitr::opts_chunk$set(warning = FALSE, message = FALSE) options(knitr.table.format = 'markdown')
The krsp
package offers a set of tools for working with the KRSP MySQL database in R.
Functions fall into two categories: standard database queries and helper functions for writing custom queries. The former require very little knowledge of R or the database to run. For writing your own queries, this package takes a SQL-free approach, instead using the dplyr
MySQL interface. Therefore, if you're not already familiar with dplyr
, it will be useful to read through one of the many tutorials online or look through the vignette with vignette("introduction", "dplyr")
. I provide an extremely brief introduction below.
dplyr
basicsdplyr
is a package designed for manipulating data in data frames quickly and easily. The functions in this package are more consistent, intuitive, and predictable than the equivalent base R functions. There are five core verbs (aka functions) for working with single data frames:
select()
: select a subset of columns in a data frame.filter()
: select a subset of rows in a data frame.arrange()
: sort the rows in a data frame by the values in a column.mutate()
: add new columns to a data frame that are functions of existing columns.summarise()
: collapse a data frame and summarize data.Here's a brief demonstration of how these verbs work.
library(dplyr) mtc <- tbl_df(mtcars) mtc <- select(mtc, mpg, cyl, horse_power = hp) mtc filter(mtc, cyl == 4, horse_power < 70) arrange(mtc, desc(cyl), horse_power) mutate(mtc, mpg_per_hp = mpg / horse_power) summarize(mtc, mean_mpg = mean(mpg), num_records = n())
Note that in all cases the first arguement is a data frame, the subsequent arguements describe what you want to do with the data frame, and the result is always another data frame. Also, columns in the data frame can be referenced directly without quotes.
%>%
The pipe operator, %>%
, is used to chain multiple operations together in a way that avoids multiple, nested function calls. It can be used with any R functions, but it is particularly useful when working with dplyr
. Specifically, x %>% f(y)
is equivalent to f(x, y)
. Let's look at a simple example, the following two operations are equivalent:
mean(sqrt(exp(mtc$mpg))) mtc$mpg %>% exp %>% sqrt %>% mean
However, the approach using %>%
is much easier to read. Let's look at a dplyr
example:
mtc %>% filter(cyl == 4, horse_power < 70) %>% mutate(mpg_per_hp = mpg / horse_power) %>% arrange(mpg_per_hp)
Often you'll want to apply some operation independently to groups of rows in a data frame. The dplyr
function group_by()
splits a data frame into groups such that all subsequent operations occur within these groups. group_by
works with all of the verbs described above, but it's most commonly used with summarise()
mtc %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg))
Often you'll have data spread across multiple data frames and you'll want to join those data frames together based on some common variable (called a key). dplyr
provides a series of functions to combine tables together, they vary according to what happens if values of the key appear in one table, but not the other:
inner_join(x, y)
: return all rows from x
where there are matching values in y
.left_join(x, y)
: return all rows from x
, regardless of whether or not there are matching values in y
. Rows in x
with no match in y
will have NA
values in the new columns.right_join(x, y)
: return all rows from y
, regardless of whether or not there are matching values in x
. Rows in y
with no match in x
will have NA
values in the new columns.full_join(x, y)
: return all rows from x
and y
.Here's a few simple examples taken from the UBC STAT545 website. First I create a couple simple tables to work with:
superheroes <- data.frame( name = c("Magneto", "Storm", "Mystique", "Batman", "Joker", "Catwoman", "Hellboy"), alignment = c("bad", "good", "bad", "good", "bad", "bad", "good"), publisher = c("Marvel", "Marvel", "Marvel", "DC", "DC", "DC", "Dark Horse Comics"), stringsAsFactors = FALSE ) superheroes publishers <- data.frame( publisher = c("DC", "Marvel", "Image"), yr_founded = c(1934, 1939, 1992), stringsAsFactors = FALSE ) publishers
Now I join them:
# Hellboy (superhero) and Image (publisher) are lost inner_join(superheroes, publishers) # Hellboy (superhero) has no yr_founded and Image (publisher) is lost left_join(superheroes, publishers) # Hellboy (superhero) is lost and Image (publisher) has no superheros right_join(superheroes, publishers)
Note that in the above examples, the join variable (i.e. the key) is inferred as the only variable in common between the tables: publisher
. This can be made explicit with:
inner_join(superheroes, publishers, by = "publisher")
Or, if the tables use different names for the key variable:
publishers <- rename(publishers, pub = publisher) inner_join(superheroes, publishers, by = c("publisher" = "pub"))
The functions within dplyr
are usually applied to data frames; however, they can be applied to tables within a SQL database in almost exactly the same way. The only additional steps are to define a connection to the database (i.e. specify where the database is and your login credentials), a process addresssed in the next section. Under the hood, dplyr
converts function calls to SQL code, but this process is invisible to the user.
For reference, the KRSP database has the following primary tables:
squirrel
: each squirrel gets a unique record in this table.trapping
: each record corresponds to a trapping event. Join to squirrel
by squirrel_id
.bevahiour
: each record corresponds to a behaviour observation. Join to squirrel
by squirrel_id
.litter
: each record corresponds to a litter for a given female. Join to squirrel
by squirrel_id
to find mother.juvenile
: each record corresponds to a juvenile squirrel within a given litter. Join to squirrel
by squirrel_id
to find juvenile, and to litter
by litter_id
.census
: the squirrel census is stored in this table, one record per squirrel at each census. Join to squirrel
by squirrel_id
.Before accessing tables in a MySQL database you need to create a connection object that defines the location and authentication details for the connection. The function krsp_connect()
is designed to create this connection object, which you will then pass as a parameter to other functions in this package. I outline three different ways of connecting to the KRSP database for three common scenarios.
The simplest situation is if you have a local copy of the database that you connect to with username root
and no password. This is the default for krsp_connect()
so creating a connection object is as simple as:
library(krsp) con <- krsp_connect()
Things get a little more complicated if you want to connect to a remote KRSP database. This could either be a cloud database hosted by Amazon Web Services (AWS) or a database on a computer on your local network, which would be the setup at Squirrel Camp. In either case, you'll need to configure the database correctly so it accepts connections, and set up users, passwords, and privileges. Before you proceed, you'll need the following pieces of information:
192.168.0.1
) or a host name (e.g. krsp.abc123.us-west-2.rds.amazonaws.com
).With this information there are two ways to connect: the quick/wrong way and the correct/secure way.
The easiest way to quickly connect to a remote database is to pass all the connection and authentication details directly to krsp_connect()
as parameters. This is usually a terrible idea because your username and password will be stored in plain text and, in the long run, it will be more work because you'll have to type all the details in each time you connect. Here's how you do it:
con <- krsp_connect(host = "krsp.abc123.us-west-2.rds.amazonaws.com", user = "msm", password = "s0mepaSSword")
where the host
, user
, and password
parameters are replaced with the correct values for your situation.
The one situation where this method may make sense is at Squirrel Camp. In this situation the database is isolated from the internet anyway, so it may be easiest to just create a single user (e.g. squirreler
) with SELECT
only privileges and no password. Then connecting is as simple as:
con <- krsp_connect(host = "192.168.0.1", user = "squirreler")
where 192.168.0.1
should be replaced with the IP address on the local network of the camp computer with the database.
The right way to connect to a remote database, especially in the cloud on AWS, is to specify the connection information in a my.cnf
connection file. This is a file that MySQL references to retrieve the location of the database and your credentials. This avoids having to store sensitive information, such as your password, within your R code.
On Mac OS and Linux the connection file is ~/.my.cnf
and on Windows it is located at C:/my.cnf
. Open this file, or create it if it doesn't already exist, and enter the following text:
[krsp-aws] username=msm password=s0mepaSSword host=krsp.abc123.us-west-2.rds.amazonaws.com port=3306 database=krsp
Make sure you edit this to include your own username and password, and the correct host, which is the Public DNS of the AWS instance or the IP address of a remote computer. Multiple connection profiles can be set up with different names, here I've used krsp-aws
for the name of this profile.
To reduce the risk of someone getting access to your database credentials, it's wise to set the permissions of the .my.cnf
file so only you can read it. On Mac OS or Linux, open the Terminal and enter the following command:
sudo chmod 600 ~/.my.cnf
To create a connection object provide the name of the my.cnf
profile to the group
parameter of krsp_connect()
:
con <- krsp_connect(group = "krsp-aws")
Once you have created a connection to the database you can reference a table by name with the tbl()
command as follows:
library(krsp) con <- krsp_connect() (trapping <- tbl(con, "trapping"))
Now you can treat the table just like a data frame! Let's say we want to know which squirrels are really fat (over 500 grams).
trapping %>% filter(scale_weight > 500) %>% select(squirrel_id, tagLft, tagRt, scale_weight) %>% arrange(scale_weight)
Or, how about a multi-table query. Let's figure out the average litter size for each grid:
litter <- tbl(con, "litter") juvenile <- tbl(con, "juvenile") litter_size <- inner_join(litter, juvenile, by = c("id" = "litter_id")) %>% group_by(grid, litter_id) %>% # count() counts the number of records within each group, stores as variable n count() %>% summarize(mean_size = mean(n)) %>% arrange(mean_size) collect(litter_size)
Looks like JO is the winner.
More comfortable with SQL than dplyr
? Can't figure out how to convert a SQL query into R code? No worries, the krsp
package has you covered. Use the function krsp_sql()
to run a pure SQL query and get the results as a data frame. For example, we can calculate the number of squirrels per grid.
sql <- " SELECT gr, COUNT(*) AS n_squirrels FROM squirrel GROUP BY gr ORDER BY n_squirrels DESC; " krsp_sql(con, sql) %>% knitr::kable()
Note that krsp_sql()
will only permit SELECT queries to avoid users accidentally changing the database.
dplyr
is a great tool for working with the KRSP database directly in R. However, there are some aspects of dplyr
that can lead to errors when working with SQL databases. Some of these errors and bugs can be quite subtle if you're not aware of what to look for. Read this section thoroughly to avoid these issues.
One important feature of dplyr
is that it tries to be as "lazy" as possible when dealing with databases. dplyr
code is translated behind the scenes into an SQL query and executed only when required. This means:
dplyr
.dplyr
keeps track of all the intermediate operations and only converts them to SQL and queries the database when results are explicitly requested.This is nice because it means queries are performed on the database side, which is typically faster than R. Furthermore, queries to the database and downloading data are kept to an absolute minimum. However, there is one very important catch with lazy evaluation: functions that don't have SQL equivalents will cause errors because they can't be translated. For example, I'll try to use the R function quantile()
to get the third quartile of squirrel weights on JO.
third <- trapping %>% filter(gr == "JO") %>% summarize(third = quantile(scale_weight, probs = 0.75, na.rm = TRUE)) third
There are two important things to note here. First, using quantile()
leads to an error because no equivalent function exists in SQL. Second, and more dangerous, note that the error only gets raised when the results are explicitily printed with the second line of code. The first line just constructs the query and doesn't actually run anything, so it doesn't discover that there's an error.
To avoid these issues, use the function collect()
to force dplyr
to run a query and return the results as a data frame.
third <- trapping %>% filter(gr == "JO") %>% collect() %>% summarize(third = quantile(scale_weight, probs = 0.75, na.rm = TRUE)) third
In general, it's best to do filter()
ing, select()
ing, and joining on the database side to take advantage of the improved speed, but use collect()
before using any R specific functions in a mutate()
or summarize()
statement. If you're ever in doubt about whether a function has a SQL equivalent, just use it then call collect()
immeditately after to force an error.
# sd() has a SQL equivalent trapping %>% summarize(test = sd(scale_weight)) %>% collect() # but median() doesn't trapping %>% summarize(test = median(scale_weight)) %>% collect()
R is case sensistive, MySQL is not. The can cause problems because MySQL allows you to use mixed case when defining variables. For example, a variable can be called LocX
, but referred to with locx
, LocX
, or LOCX
. dplyr
isn't so sloppy, if a variable is called LocX
in the database, it must be referred to as LocX
. This also applies to table names, for example, note the mixed case in FLastAll
.
To see this issue in action, note that the dbaMidden
table uses locX
, while the census
table uses locx
.
# fine tbl(con, "dbaMidden") %>% select(locX) %>% head # error tbl(con, "dbaMidden") %>% select(locx) %>% head # error tbl(con, "census") %>% select(locX) %>% head # fine tbl(con, "census") %>% select(locx) %>% head
To avoid this, just make sure you refer to variables with the correct case. Even better, only use lowercase when defining variables in the database.
By default, dplyr
only pulls a maximum of 100,000 from an SQL query. In most cases, this should be sufficient, howver, if you suspect your query may return more records than this, use n = Inf
in collect()
. For example, to pull the entire trapping table:
# only the first 100,000 rows are returned tbl(con, "trapping") %>% collect() %>% nrow() # return all rows tbl(con, "trapping") %>% collect(n = Inf) %>% nrow()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.