docs/articles/data_structure.md

title: "Data Structure / Table Joining" author: "Cale Basaraba" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Data Structure / Table Joining} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

Setup

Before exploring the tables in the EIDITH database using the eidith package, make sure to load it:

library(eidith)

For the purposes of this vignette, you also will want to download / use the dplyr package:

install.packages("dplyr")  
library(dplyr)

If you are having trouble installing the eidith R package, logon credentials, permissions, or have not downloaded the local database, please follow the instructions in the package setup tutorial.

EIDITH Database Structure

Like many databases, EIDITH is divided into several tables that are linked by unique ID fields (integers in this case). Below is a sketch of the six EIDITH tables and their basic linking structure. The lines connecting tables identify the fields, or columns, that link one table to the next. The fields with yellow keys next to their names are unique IDs -- this means that no two rows in that table can have the same value for this field.

In the events table, for instance, each event (or row) has a unique event_id, while in the animals table, each animal (or row) has a unique animal_id. Importantly, each animal in the animals table also has an event_id, which links it to an event. Since event_id is not unique in the animals table, there can be more than one animal entry with the same event_id -- this makes perfect sense, since there are likely multiple animals attached to a particular event. Using the ed_tables_conn() function at any time will display a table listing the ID fields linking tables together:

ed_tables_conn()
##                 T1               T2        conn
## 1           Animal         Specimen   animal_id
## 2           Animal            Event    event_id
## 3 TestIDSpecimenID             Test     test_id
## 4 TestIDSpecimenID            Virus     test_id
## 5         Specimen TestIDSpecimenID specimen_id
## 6             Test            Virus     test_id

These ID fields tie all of the tables together, allowing us to join tables together that contain information we are interested in analyzing. EIDITH is a large database with many fields per table, so it can be challenging at first to determine which table contains the relevant information. Generally, the location of the information we are looking for should be in the table that it is most closely tied to: if we are looking for the species of an animal, that field should be in the animals table, if we are looking for the genetic sequence of a virus, that information should be in the viruses table. If we are having a hard time finding the field we need, we can use the ?ed_metadata command to load the built-in help file that contains a searchable list of all EIDITH fields.

Testspecimens Table

The testspecimens table is different from the other five EIDITH tables: it is a lookup table whose only purpose is to allow the joining of the tests and specimens tables. It is necessary because one specimen can be linked to multiple tests and (due to pooled tests) one test can be linked to multiple specimens.

Joining Tables

Let's walk through an example of joining tables in order to get information about the specimens taken during different events. Specifically, we will examine different specimen types by habitat Looking at the database structure diagram above, we will need the events, animals, and specimens tables to link together information about events (habitat) with information about specimens (speciment type). Notice we cannot link rows in the specimens table directly to the events table. This is because the specimens table contains information about which animal the specimen was obtained from, but only the animals table has information about which animal is connected to a particular event.

First, we need to load these three tables from our locally stored EIDITH database into R. If you have EIDITH database access, you can use the functions ed_events(), ed_animals() and ed_specimens() to access your data, but for the purposes of this vignette we will use the mock data available using the ed_mock calls. For more information on the mock data included in the package, ?ed_mock will open a helpfile in RStudio.

mock_events <- ed_events_mock()
mock_animals <- ed_animals_mock()
mock_specimens <- ed_specimens_mock()

We can take a look at all the fields / columns in these tables by running names(mock_events), names(mock_animals), or names(mock_specimens) which will print out all the column names from these tables into our console. If we do this, we can see that, along with many other fields, each table has the ID fields highlighted in the database structure diagram above.

We will have to perform two joins in this example. First, we will join the mock_events table to the mock_animals table:

mock_ea <- inner_join(mock_events, mock_animals, by = "event_id")

In this example we used the inner_join() function from the dplyr package. Note that we use the by = "event_id" argument to specify that we want to join based on the linking key. This type of join will combine all columns from both tables, but will drop entries from the events table that have no linked entries in the animals table and vice versa. Depending on what kind of analysis we are interested in, we may want to perform a different kind of join. Running ?dplyr::join to read the dplyr join help file explains different joining options.

If we run names(mock_ea) we can see that we now have a dataframe with all the columns from both the mock_events and mock_animals tables. Our next step is to join this table to the mock_specimens table:

mock_eas <- inner_join(mock_ea, mock_specimens, by = "animal_id")

We now have a table that combines information from all three tables, and can create a visualization of specimen type by habitat type. We can use the ggplot2 package to create a clean stacked-bar graph. We can install and load the ggplot2 package using the code below:

install.packages("ggplot2")
library(ggplot2)

The following code creates a stacked-bar graph using the ggplot2 package.

ggplot(data = mock_eas) +       
  geom_bar(aes(x = specimen_type_id, fill = habitat_type)) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.text = element_text("")) +
  labs(title = "Specimen Type by Habitat Type (Mock Data)", x = "Specimen Type", y = "Count (specimens)", fill = "Habitat Type")

plot of chunk 7

Conclusion

This vignette describes the data structure of the EIDITH database and leads the user through some simple joins to link information from the events and specimens tables. The same process can be done to visualize or analyze information from across any of the tables in EIDITH database.



ecohealthalliance/eidith documentation built on Aug. 30, 2022, 7:45 a.m.