In mwhalen18/krspfecals: KRSP Auxilary Fecal Database

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(krspfecals)
library(RMySQL)
library(krsp)

To connect to the fecal database instance, you need to first create a connection object which defines the location and credentiuals for the connection. The base engine for doing this is the DBI package in R. Other packages like RMySQL are built on top of this package and are therefore the base architectures we need. For a while, the dbplyr package offered some source functions like dbplyr::src_dbi to make the database connection objects easier to work with in dplyr. This is what the original krsp package is built with. However, these functions are either deprecated or superceded and will likely be phased out in future CRAN releases. As a result, krspfecals is built with a slightly more fundemental architecture. I outline this in a bit more detail later on.

All functions for working with the fecal database begin with fecal and are named using lowerCamel notation.

We can use the function fecalConnect to create a connection object which will then be passed as a parameter into other helper functions.

con = fecalConnect(MySQL(), group = "krspfecals-aws")

How to connect to the database?

I am repeating what is already laid out in the krsp Vignettes...The best way to connect to the database is to create secure credentials on your local machine. There are a few ways to do this that are secure and repeatable. Given that most people will (should) have setup a group in their my.cnf file, I will outline that one first.

Creating groups in .my.cnf

A .cnf file is a configuration file that MySQL reads through. They live in a few different locations depending on whether they are global configurations or user specific configurations. You can read more here (https://dev.mysql.com/doc/refman/8.0/en/option-files.html). This allows you to specify MySQL credentials permanently without embeddign them in your R code.

On Unix machines, the defualt location for this configuration file is ~/.my.cnf and on Windows it is C:/my.cnf (Note this is where programs will LOOK for the file. You can put it anywhere but you'll have to change the default file parameter in other programs).

To open this file (or create it if it doesn't exist) open your Terminal and type the following command

sudo nano ~/.my.cnf

This will open (or create) the file. Enter the following information

[krspfecals-aws]
user = "USER"
password = "PASSW0RD"
host = "hormonedb.cfvasuoihvfm.ca-central-1.rds.amazonaws.com"
port = 3306
database = "krspfecals"

This same structure can be used to create a security group for the krsp database as well (see the krsp vignettes).

When you are finished, press CTRL+X, then press Y to save. It's a good idea to then restart MySQL on your machine.

sudo systemctl restart mysql

As a note, remember that this will only work on your local machine. If you have a desktop and a laptop you'll need to configure your keys seprately on each machine.

Direct login.

If you do not set your credentials up in the way described above then you will have to login directly to the database. This is fine for one-off queries that will never be shared, but this is potentially very insecure. If you login directly, you will need to provide a few parameters:

username: a valid username that has been assigned privilages to access the databse.
password: password for the given username
host: the endpoint of the server host for the instance of the database you want to connect to. This is either an IP address (eg 192.222.0.1 (IPv4) or FE80:CD00:0000:0CDE:1257:0000:211E:729C (IPv6)) or URL endpoint if you are connecting the an AWS instance.

NOTE: if you do not specify a host, it will automatically default to the AWS host

Database engine -- NItty Gritty (Don't Read if you don't care)

I want to lay out some of the underlyiung nuts and bolts of the engine so you can understand how it works and why it works a little different from the krsp database.

The database is built with MySQL, a simple relational database management system. R has a lot of ways of connecting to and interacting with SQL databases. The fundemental engine behind most of these methods is the DBI package. the RMySQL package optimizes DBI functions for working with MySQL in R. The dbplyr package acts as a sort of middle-man package between dplyr and DBI. For a while, DBI objects were not directly compatible with dplyr and so dbplyr implemented a src_dbi function to force compatibility.

The original krsp package was built on top of these src_ functions but these have since been superceded. Now, DBI objects are easily passed to dplyr so there is no need for these functions any more. They still exist but are no longer maintained and will likely be phased out in future CRAN releases.

Without dipping too heavily into the nitty-gritty here, the main difference is that src_dbi converted an S4 DBI object to an S3 object.

conn = krsp_connect(RMySQL::MySQL(), group = "krsp-aws")
typeof(conn)

class(conn)

inherits(conn, "DBIConnection")

We see here that our krsp connection is an S3 list type and it inherits methods from that src_ family I talked about earlier.

Compare this to the result of our fecal connection

con = fecalConnect(RMySQL::MySQL(), 
                     group = "krspfecals-aws")
typeof(con)

class(con)

inherits(con, "DBIConnection")
inherits(con, "MySQLConnection")

So we see the fecal connection is of the "Super Class" DBI Connection which is inherited by the MySQLConnection type. The result is that this opens us up to use more methods on our object. While, yes, its nice to stick to methods that are clearly defined by the package, for those used to working with databases this will avoid headaches caused by inheritance errors. Here's an example.

#We can use the built in functions to check which tables are in our krsp database
krsp_tables(conn)

#But if we wanted to use one of the standard functions to do the same thing...
DBI::dbListTables(conn)
RMySQL::dbListTables(conn)
dplyr::db_list_tables(conn)

#only one that works
dplyr::src_tbls(conn)

#This method is ONLY defined for the src_ class which is no longer maintained
?dbplyr::src_dbi #You can run this and see it is superceded

So, in order to ensure fitire compatibility with the base engines, krspfecals is built as an extension of the existing packages, rather than as a wrapper around them. This has a few side effects:

1. It's faster. Marginally for our purposes, but if we ever have millions of rows of data it'll maintain speed.
2. It will avoid any breakage caused by changes to the packages it is built on. So long as DBI and RMySQL maintain their base classes we should be okay.
3. krsp functions will NOT work on fecal connection objects and vice-versa. Yes, fewer people will use the fecal database package than the krsp package so the cross-compatibility is a minor issue, but it would be nice to have cross-compatibility. This will have MINOR effects for day to day use, but if you get sloppy and start mixing methods you'll run into unexpected troubles.
4. This is unrelated but the more auxillary packages and databases we have, we lose the magic of SQL. If your data are not of the same source (same host endpoint) then you can't use SQL joins without creating temporary tables. This is to say we should advocate for more centralized data within the same database.

Working with the database

More features will be added to the package to include handy functions for cleaning and summarizing data. I also plan to add some handy helper functions that I use a lot with the krsp package (like a function to automatically calculate local densities and functions to automatically generate summary data).

For now, the data are structures such that there is one main pipeline that can be modified ad libitum to suit your needs.

The data

We can easily see which data are currently housed in the database. I'll also demonstrate the benefit of the low-level architecture here.

#Use a built in method
fecalTables(con)

#Use any other method
DBI::dbListTables(con)
RMySQL::dbListTables(con)
dplyr::db_list_tables(con)

If you're curious why this third one doesn't work on S3 or S4, this thread points out that Hadley Wickham suggested it would be deprecated soon (in 2017) and encourages using dbListTables.

Data Documentation

I am currently in the process of documenting all the tables with their data types and explnations. You can see what documentation is curently available using the fecalTableDoc() function:

This pulls straight from a JSON file so the formatting might be weird...But you'll figure it out.

fecalTableDoc(con) #If you forget a table

fecalTableDoc(con, "poop_link")

The main fecal pipeline

In general, you should start with your necessary trapping records, link them through the poop_link table, then join to extracts, then to the necessary assays.

Here is the pipeline that captures all the data.

trapping = tbl(conn, "trapping") %>% 
  #filter() %>% 
  select(trapping_id = id, everything()) %>%#change from everything if you need less columns
  collect() # We must collect it into memory since the krsp and fecal databases are at different endpoints

poop_link = tbl(con, "poop_link") %>% 
  collect()

# Assume we are only interested in individuals we have cort data for
extracts = tbl(con, "extracts")
cort = tbl(con, "cort")

assays = inner_join(cort, extracts) %>% 
  collect()

joined = trapping %>%
  inner_join(poop_link) %>%
  inner_join(assays)

This table is then a collection of the trapping data and hormone data for every individual for whom we have hormone data. I use inner joins for simplicity, But you can use a left_join to keep individuals who don't have hormone data.

A worked example

Let's say we want to look at cort levels over pregnancy for females on Kloo and Sulpher from 2010 onwards. I've chosen an example that also uses the "litters" table for sake of demonstration. Well start with our primary tables. Some simple data wrangling.

library(lubridate)
litters = tbl(conn, "litter") %>% 
  filter(yr >= 2010,
         grid %in% c("KL", "SU"),
         ln == 1) %>% 
  select(squirrel_id, fieldBDate)

trapping = tbl(conn, "trapping") %>% 
  filter(gr %in% c("KL", "SU"),
         sex == "F",
         year(date) >= 2010) %>% 
  select(trapping_id = id, squirrel_id, date) %>% 
  left_join(litters) %>% 
  collect() %>% 
  mutate(across(contains("date"), yday))

We'll then calculate relative parturition and ust keep females who are pregnant

trapping = trapping %>% 
  mutate(rel_part = date-fieldBDate) %>% #pregnant are negative, lacatating are positive
  select(-c(contains("date"))) %>%  #pregnancy is from -35 -- 0 
  filter(rel_part >= -35 & rel_part<= 0)

#Note this section will rarely ever change. Copy and poasted from above
poop_link = tbl(con, "poop_link") %>% 
  collect()

# Assume we are only interested in individuals we have cort data for
extracts = tbl(con, "extracts")
cort = tbl(con, "cort")

assays = inner_join(cort, extracts) %>% 
  collect()

joined = trapping %>%
  inner_join(poop_link) %>%
  inner_join(assays)

Calculating hormone levels

The variable final_concentration refers to the steroid concentration in the well of the assay plate. We need to convert this to a concentration of steroid present in the feces.

$$ steroid(ng)/feces(g) = \frac{\frac{pg}{well} * ExtractVolume(\mu L) * DilutionFactor}{FecalWeight(g) * SampleVolume(\mu L) * 1000} $$ Where

$\frac{pg}{well}$ is the output concentration from the plate reader (cort$final_concentration)
$ExtractVolume$ is the volume of 80% MeOH used for the extraction in \mu L + fecal weight in mg (500\mu L + extracts$mass_extracted_g * 1000)
$DilutionFactor$ is the dilution of pre-EIA (e.g. 10 if the extracts were diluted 1:10) (cort$dilution)
$FecalWeight$ is the weight of feces extracted (expressed in g) (extracts$mass_extracted_g)
$SampleVolume$ is the volume transferred to EIA (e.g 10 \mu L) (cort$volume_used)

Extracts are always extracted in 500 \mu L of 80% MeOH

I have added a helper funciton to help with this, but be aware it is clunky and will only work on a joined table of extracts and hormone that is unaltered. This is mildly intentional.

So, to calculate your hormone concetration in ng/g

joined$cort_ng_g = fecalHormoneCalc(joined)

Now we can plot this.

library(ggplot2)
ggplot(joined, aes(x = rel_part, y = cort_ng_g))  + 
  geom_point(size = 0.5, alpha = 0.5) +
  scale_y_log10() +
  geom_smooth(aes(group = squirrel_id), method = "glm", se = F, size = 0.3, color = "darkgrey") +
  geom_smooth(method = "glm", se = F) +
  theme_minimal()