Home

/

GitHub

/

README.md
In blex-max/ectosearchR: Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

ectosearchR

ectosearchR - A simple wrapper package around DBI and Rpostgres for SQL querying the Van Der Linde 2018 ectomycorrhizal dataset hosted on Heroku Postgres. As this dataset grows with future sampling, so will this package.

Why use an SQL database?

Centralised, backed-up repository that can be easily accessed anywhere with internet (and written to disk for offline use if necessary).
Easy to add new data as soon as it becomes available - e.g. fieldwork data goes straight to this repository from my phone as we're sampling, no paper or tabulating stuff into excel.
Powerful & intuitive query language for data analysis.
No more reading in 5 csvs to do analysis! Much easier to interface with the data.

Installation & Getting Started

To install simply run the following code block in R:

install.packages('devtools')
devtools::install_github('blex-max/ectosearchR')

To connect to the database:

library(ectosearchR)
ecto_connect()

At the time of writing, the database contains the following tables

Table Name | Content :---: | :---: otu_taxa | taxonomic info about each of the OTUs included in the final dataset otu_occurrence | An abundance matrix where each row is an OTU, and each column a ICP plot, named accordingly occurrence_by_plot | An abundance matrix where each row is an ICP plot, and each column an OTU - this is useful since it's not easy to just flip the matrix in SQL, so sometimes this makes certain queries easier plot_metadata | metadata about each of the sampled ICP forest plots, e.g. country, major tree species, date sampled, etc. seq_data | DNA sequence data for all sequenced root tips and associated metadata including matched OTU/sh, plot code, sequence length, etc.

We can check what tables are presently in the database using:

ecto_tables()

And we can check what fields (columns) each table contains using:

ecto_fields(tablename)

To grab the whole of any given table we can use a simple SQL query and the query function included in the package:

df <- ecto_query("SELECT * FROM otu_taxa")

* simply means 'every column', and this query will return a dataframe containing all taxonomic records. This dataframe can then be handled/subset in R. However, we can also construct more complex queries using PostgreSQL syntax, which is more powerful and much more readable. It's also reasonably intuitive. See the next section for a real, practical example on this dataset. Otherwise, there are myriad guides to PostgreSQL query syntax across the web.

Finally, to see all functions and documentation use:

help(package = ectosearchR)

Building a Query Step by Step (Example 1):

Which ascomycota genera contain species that appear in more than the average number of plots? What families do those genera belong to? We'll use this question to have a quick look at how you can interact with the database using SQL. This is not intended to be a tutorial, more of a small showcase.

First, let's look at the table 'otu_occurrence' as that seems like the best place to start

library(ectosearchR)
ecto_connect()

ecto_fields('otu_occurrence')

We can see that in the table 'otu_occurrence' we have a field 'distinct plots', and we have a field of SH codes called 'sh'. Sounds promising. Now, which OTUs have a greater value than average in 'distinct plots'?. We'll select SH codes, filtering by their corresponding value in 'distinct plots'.

ecto_query('SELECT sh 
  FROM otu_occurrence 
  WHERE distinct_plots >= 
    (SELECT AVG(distinct_plots) FROM otu_occurrence)')

But to which genera do they belong? By nesting our previous query as a WHERE condition, we can find out.

ecto_query('SELECT genus
  FROM otu_taxa
  WHERE sh IN 
      (SELECT sh 
      FROM otu_occurrence 
      WHERE distinct_plots >= 
        (SELECT AVG(distinct_plots) FROM otu_occurrence))
')

There are basidiomycota in here! We'll filter them out with an additional conditional in the WHERE clause using AND. There's also a lot of the same genera repeated where more than one species meets the WHERE criteria. We can use DISTINCT with SELECT to get only the unique ones.

ecto_query('SELECT DISTINCT genus
  FROM otu_taxa
  WHERE sh IN 
      (SELECT sh 
      FROM otu_occurrence 
      WHERE distinct_plots >= 
        (SELECT AVG(distinct_plots) FROM otu_occurrence))
  AND phylum = "Ascomycota"
')

Finally, We also wanted the family each of these genera are in, so let's SELECT family also.. This will give us the answer to our question, so we'll assign this dataframe to a variable named 'result'.

result <- ecto_query('SELECT DISTINCT genus, family
  FROM otu_taxa
  WHERE sh IN 
      (SELECT sh 
      FROM otu_occurrence 
      WHERE distinct_plots >= 
        (SELECT AVG(distinct_plots) FROM otu_occurrence))
  AND phylum = "Ascomycota"
')

ecto_disconnect()

Done!

Combining SQL and R (Example 2):

Here's a real example of the database in action - again not as a tutorial, but to show how it can be/has been deployed. I wanted to run the LULU algorithm (Frøslev et al. 2017, Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates, https://www.nature.com/articles/s41467-017-01312-x) to look for potentially spurious OTUs which are in fact errors of more common OTUs with which they co-occur. This requires a table of all OTUs and their abundance in each sample, in our case an ICP plot, and sequence data for all OTUs.

library(ectosearchR)
ecto_connect()

# create a table of OTU abundance per sample (ICP plot) for LULU algorithm
otu_table <- ecto_query('SELECT * FROM OTU_occurrence') # get abundance data for all OTUs

A couple of lines of R got this into the format LULU required for the OTU table. Onto the sequence data.

# get single seq to represent each otu in the table
# first get a single seq for each SH, using the DISTINCT ON () SQL clause. 
# Also grab the SH number of that seq, the sample/root tip ID code, 
# and whether or not the seq has a forward and reverse complement.
rep_seq <- ecto_query('SELECT DISTINCT ON (sh) 
                            sh, compliment, sample_id, sample_seq
                        FROM seq_data;') # one seq for each SH!

Another line to filter that table down to only SHs in the OTU table.

After some further work, the algorithm outputs a dataframe of potentially erroneous OTUs, and their 'true' identity, or parent OTU. However, as recieved these are just pairs of SHs with no taxonomic information, or anything else. We can improve on this. I've called this dataframe 'map' in my R script.

# the rownames of 'map' correspond to the erroneous OTU SHs. Here I'm prepreparing those
# rownames into a string for some subsequent SQL queries to get more data on them all at once. 
# Each SH needs single quotes to be parsed properly in the next step.

in_sql <- paste0("('", paste(rownames(map), collapse = "', '"), "')")


tax <- ecto_query(paste0("SELECT family, genus, species FROM otu_taxa WHERE sh IN", in_sql)) 
# all taxonomic data in a dataframe using the SQL WHERE clause and IN operator!

To grab that same information for the parent or true OTU is a little more tricky. A few of the parent OTUs had more than one erroneous child, and so appeared more than once our map dataframe. This means if we query using the WHERE and IN syntax as above, we'll get a dataframe with fewer rows than we might expect and won't be able to join the two together nicely. lapply and an anonymous function let us nicely skirt this issue by retrieving taxonomic data for each row in our map object individually.

parent_tax <- lapply(map$parent_id, \(x) ecto_query(paste0("SELECT family, genus, species FROM otu_taxa WHERE sh = '", x, "';"))) 
# note the single quotes again. 
# The semi-colon is just a convention to indicate the end of the SQL query.

parent_tax <- do.call('rbind', parent_tax) # the output needs a little tidying
colnames(parent_tax) <- paste0('PARENT_', colnames(parent_tax))


bad_otus <- cbind(map, seq_complement, tax, parent_tax) 
# bring it all together and we have all the info we need.

ecto_disconnect() # close connection to the database

All data accession done without ever reading a single file and much more neatly than we could have done in R alone!

blex-max/ectosearchR documentation built on Dec. 19, 2021, 10:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

blex-max/ectosearchR
Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

README.md
In blex-max/ectosearchR: Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

ectosearchR

Why use an SQL database?

Installation & Getting Started

Building a Query Step by Step (Example 1):

Combining SQL and R (Example 2):

R Package Documentation

Browse R Packages

We want your feedback!

blex-max/ectosearchR Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

README.md In blex-max/ectosearchR: Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

ectosearchR

Why use an SQL database?

Installation & Getting Started

Building a Query Step by Step (Example 1):

Combining SQL and R (Example 2):

R Package Documentation

Browse R Packages

We want your feedback!

blex-max/ectosearchR
Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset

README.md
In blex-max/ectosearchR: Simple Wrapper Package around DBI for SQL Querying Van Der Linde 2018 Ectomycorrhizal Dataset