knitr::opts_chunk$set(warning = FALSE, message = FALSE)
options(knitr.table.format = 'markdown')

The krsp package offers a set of tools for working with the KRSP MySQL database in R. Functions fall into two categories: standard database queries and helper functions for writing custom queries. The former require very little knowledge of R or the database to run. For writing your own queries, this package takes a SQL-free approach, instead using the dplyr MySQL interface. Therefore, if you're not already familiar with dplyr, it will be useful to read through one of the many tutorials online or look through the vignette with vignette("introduction", "dplyr"). I provide an extremely brief introduction below.

dplyr basics

dplyr is a package designed for manipulating data in data frames quickly and easily. The functions in this package are more consistent, intuitive, and predictable than the equivalent base R functions. There are five core verbs (aka functions) for working with single data frames:

Here's a brief demonstration of how these verbs work.

library(dplyr)
mtc <- tbl_df(mtcars)
mtc <- select(mtc, mpg, cyl, horse_power = hp)
mtc
filter(mtc, cyl == 4, horse_power < 70)
arrange(mtc, desc(cyl), horse_power)
mutate(mtc, mpg_per_hp = mpg / horse_power)
summarize(mtc, mean_mpg = mean(mpg), num_records = n())

Note that in all cases the first arguement is a data frame, the subsequent arguements describe what you want to do with the data frame, and the result is always another data frame. Also, columns in the data frame can be referenced directly without quotes.

Chaining operations with %>%

The pipe operator, %>%, is used to chain multiple operations together in a way that avoids multiple, nested function calls. It can be used with any R functions, but it is particularly useful when working with dplyr. Specifically, x %>% f(y) is equivalent to f(x, y). Let's look at a simple example, the following two operations are equivalent:

mean(sqrt(exp(mtc$mpg)))
mtc$mpg %>% 
  exp %>% 
  sqrt %>% 
  mean

However, the approach using %>% is much easier to read. Let's look at a dplyr example:

mtc %>% 
  filter(cyl == 4, horse_power < 70) %>% 
  mutate(mpg_per_hp = mpg / horse_power) %>% 
  arrange(mpg_per_hp)

Grouped operations

Often you'll want to apply some operation independently to groups of rows in a data frame. The dplyr function group_by() splits a data frame into groups such that all subsequent operations occur within these groups. group_by works with all of the verbs described above, but it's most commonly used with summarise()

mtc %>% 
  group_by(cyl) %>% 
  summarise(mean_mpg = mean(mpg))

Joining tables

Often you'll have data spread across multiple data frames and you'll want to join those data frames together based on some common variable (called a key). dplyr provides a series of functions to combine tables together, they vary according to what happens if values of the key appear in one table, but not the other:

Here's a few simple examples taken from the UBC STAT545 website. First I create a couple simple tables to work with:

superheroes <- data.frame(
  name = c("Magneto", "Storm", "Mystique", "Batman", 
           "Joker", "Catwoman", "Hellboy"),
  alignment = c("bad", "good", "bad", "good",
                "bad", "bad", "good"),
  publisher = c("Marvel", "Marvel", "Marvel", "DC", 
                "DC", "DC", "Dark Horse Comics"),
  stringsAsFactors = FALSE
  )
superheroes

publishers <- data.frame(
  publisher = c("DC", "Marvel", "Image"),
  yr_founded = c(1934, 1939, 1992),
  stringsAsFactors = FALSE
  )
publishers

Now I join them:

# Hellboy (superhero) and Image (publisher) are lost
inner_join(superheroes, publishers)
# Hellboy (superhero) has no yr_founded and Image (publisher) is lost
left_join(superheroes, publishers)
# Hellboy (superhero) is lost and Image (publisher) has no superheros
right_join(superheroes, publishers)

Note that in the above examples, the join variable (i.e. the key) is inferred as the only variable in common between the tables: publisher. This can be made explicit with:

inner_join(superheroes, publishers, by = "publisher")

Or, if the tables use different names for the key variable:

publishers <- rename(publishers, pub = publisher)
inner_join(superheroes, publishers, by = c("publisher" = "pub"))

Working with the KRSP database

The functions within dplyr are usually applied to data frames; however, they can be applied to tables within a SQL database in almost exactly the same way. The only additional steps are to define a connection to the database (i.e. specify where the database is and your login credentials), a process addresssed in the next section. Under the hood, dplyr converts function calls to SQL code, but this process is invisible to the user.

For reference, the KRSP database has the following primary tables:

Connecting to the database

Before accessing tables in a MySQL database you need to create a connection object that defines the location and authentication details for the connection. The function krsp_connect() is designed to create this connection object, which you will then pass as a parameter to other functions in this package. I outline three different ways of connecting to the KRSP database for three common scenarios.

Local instance

The simplest situation is if you have a local copy of the database that you connect to with username root and no password. This is the default for krsp_connect() so creating a connection object is as simple as:

library(krsp)
con <- krsp_connect()

Remote instance

Things get a little more complicated if you want to connect to a remote KRSP database. This could either be a cloud database hosted by Amazon Web Services (AWS) or a database on a computer on your local network, which would be the setup at Squirrel Camp. In either case, you'll need to configure the database correctly so it accepts connections, and set up users, passwords, and privileges. Before you proceed, you'll need the following pieces of information:

With this information there are two ways to connect: the quick/wrong way and the correct/secure way.

The quick/wrong way

The easiest way to quickly connect to a remote database is to pass all the connection and authentication details directly to krsp_connect() as parameters. This is usually a terrible idea because your username and password will be stored in plain text and, in the long run, it will be more work because you'll have to type all the details in each time you connect. Here's how you do it:

con <- krsp_connect(host = "krsp.abc123.us-west-2.rds.amazonaws.com",
                    user = "msm",
                    password = "s0mepaSSword")

where the host, user, and password parameters are replaced with the correct values for your situation.

The one situation where this method may make sense is at Squirrel Camp. In this situation the database is isolated from the internet anyway, so it may be easiest to just create a single user (e.g. squirreler) with SELECT only privileges and no password. Then connecting is as simple as:

con <- krsp_connect(host = "192.168.0.1", user = "squirreler")

where 192.168.0.1 should be replaced with the IP address on the local network of the camp computer with the database.

The correct/secure way

The right way to connect to a remote database, especially in the cloud on AWS, is to specify the connection information in a my.cnf connection file. This is a file that MySQL references to retrieve the location of the database and your credentials. This avoids having to store sensitive information, such as your password, within your R code.

On Mac OS and Linux the connection file is ~/.my.cnf and on Windows it is located at C:/my.cnf. Open this file, or create it if it doesn't already exist, and enter the following text:

[krsp-aws]
username=msm
password=s0mepaSSword
host=krsp.abc123.us-west-2.rds.amazonaws.com
port=3306
database=krsp

Make sure you edit this to include your own username and password, and the correct host, which is the Public DNS of the AWS instance or the IP address of a remote computer. Multiple connection profiles can be set up with different names, here I've used krsp-aws for the name of this profile.

To reduce the risk of someone getting access to your database credentials, it's wise to set the permissions of the .my.cnf file so only you can read it. On Mac OS or Linux, open the Terminal and enter the following command:

sudo chmod 600 ~/.my.cnf

To create a connection object provide the name of the my.cnf profile to the group parameter of krsp_connect():

con <- krsp_connect(group = "krsp-aws")

Querying the database

Once you have created a connection to the database you can reference a table by name with the tbl() command as follows:

library(krsp)
con <- krsp_connect()
(trapping <- tbl(con, "trapping"))

Now you can treat the table just like a data frame! Let's say we want to know which squirrels are really fat (over 500 grams).

trapping %>% 
  filter(scale_weight > 500) %>% 
  select(squirrel_id, tagLft, tagRt, scale_weight) %>% 
  arrange(scale_weight)

Or, how about a multi-table query. Let's figure out the average litter size for each grid:

litter <- tbl(con, "litter")
juvenile <- tbl(con, "juvenile")
litter_size <- inner_join(litter, juvenile, by = c("id" = "litter_id")) %>% 
  group_by(grid, litter_id) %>% 
  # count() counts the number of records within each group, stores as variable n
  count() %>% 
  summarize(mean_size = mean(n)) %>% 
  arrange(mean_size)
collect(litter_size)

Looks like JO is the winner.

But what about SQL?

More comfortable with SQL than dplyr? Can't figure out how to convert a SQL query into R code? No worries, the krsp package has you covered. Use the function krsp_sql() to run a pure SQL query and get the results as a data frame. For example, we can calculate the number of squirrels per grid.

sql <- "
  SELECT gr, COUNT(*) AS n_squirrels
  FROM squirrel
  GROUP BY gr 
  ORDER BY n_squirrels DESC;
  "
krsp_sql(con, sql) %>% knitr::kable()

Note that krsp_sql() will only permit SELECT queries to avoid users accidentally changing the database.

Potential issues (Important!)

dplyr is a great tool for working with the KRSP database directly in R. However, there are some aspects of dplyr that can lead to errors when working with SQL databases. Some of these errors and bugs can be quite subtle if you're not aware of what to look for. Read this section thoroughly to avoid these issues.

Laziness

One important feature of dplyr is that it tries to be as "lazy" as possible when dealing with databases. dplyr code is translated behind the scenes into an SQL query and executed only when required. This means:

This is nice because it means queries are performed on the database side, which is typically faster than R. Furthermore, queries to the database and downloading data are kept to an absolute minimum. However, there is one very important catch with lazy evaluation: functions that don't have SQL equivalents will cause errors because they can't be translated. For example, I'll try to use the R function quantile() to get the third quartile of squirrel weights on JO.

third <- trapping %>% 
  filter(gr == "JO") %>% 
  summarize(third = quantile(scale_weight, probs = 0.75, na.rm = TRUE))
third

There are two important things to note here. First, using quantile() leads to an error because no equivalent function exists in SQL. Second, and more dangerous, note that the error only gets raised when the results are explicitily printed with the second line of code. The first line just constructs the query and doesn't actually run anything, so it doesn't discover that there's an error.

To avoid these issues, use the function collect() to force dplyr to run a query and return the results as a data frame.

third <- trapping %>% 
  filter(gr == "JO") %>% 
  collect() %>% 
  summarize(third = quantile(scale_weight, probs = 0.75, na.rm = TRUE))
third

In general, it's best to do filter()ing, select()ing, and joining on the database side to take advantage of the improved speed, but use collect() before using any R specific functions in a mutate() or summarize() statement. If you're ever in doubt about whether a function has a SQL equivalent, just use it then call collect() immeditately after to force an error.

# sd() has a SQL equivalent
trapping %>% 
  summarize(test = sd(scale_weight)) %>% 
  collect()
# but median() doesn't
trapping %>% 
  summarize(test = median(scale_weight)) %>% 
  collect()

Case sensitivity

R is case sensistive, MySQL is not. The can cause problems because MySQL allows you to use mixed case when defining variables. For example, a variable can be called LocX, but referred to with locx, LocX, or LOCX. dplyr isn't so sloppy, if a variable is called LocX in the database, it must be referred to as LocX. This also applies to table names, for example, note the mixed case in FLastAll.

To see this issue in action, note that the dbaMidden table uses locX, while the census table uses locx.

# fine
tbl(con, "dbaMidden") %>% 
  select(locX) %>% 
  head
# error
tbl(con, "dbaMidden") %>% 
  select(locx) %>% 
  head
# error
tbl(con, "census") %>% 
  select(locX) %>% 
  head
# fine
tbl(con, "census") %>% 
  select(locx) %>% 
  head

To avoid this, just make sure you refer to variables with the correct case. Even better, only use lowercase when defining variables in the database.

Record limit

By default, dplyr only pulls a maximum of 100,000 from an SQL query. In most cases, this should be sufficient, howver, if you suspect your query may return more records than this, use n = Inf in collect(). For example, to pull the entire trapping table:

# only the first 100,000 rows are returned
tbl(con, "trapping") %>% 
  collect() %>% 
  nrow()
# return all rows
tbl(con, "trapping") %>% 
  collect(n = Inf) %>% 
  nrow()


mstrimas/krsp documentation built on May 23, 2019, 8:16 a.m.