knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(kewr)
library(dplyr)
library(tidyr)

The Tree of Life is a database of specimens sequenced as part of Kew's efforts to build a comprehensive evolutionary tree of life for flowering plants.

This package accesses data from the Tree of Life Explorer, an output of the Plant and Fungal Trees of Life Project (PAFTOL). The data in the Tree of Life is generated by target sequence capture using the universal Angiosperm353 probe set.

The Tree of Life contains information about specimens that have been sequenced and genes recovered in the process. It lets you download sequence data for the specimens, as well as alignments and trees for the genes.

Viewing the Tree of Life

The Tree of Life Explorer lets users view the tree of life constructed from the current dataset of samples.

You can view it using kewr by loading it in:

tree <- load_tol()
tree

This reads it as a single string, so you need to use other packages to parse it and view it (e.g, ape).

Searching ToL for specimens

The Tree of Life contains information about the specimens that have been sequenced to construct the tree. The long-term aim is to sample at least on species from every flowering plant genus. This means that, typically, there will be one specimen per species.

You can search this information using the search_tol function. There is no filtering or keyword-search functionality, so queries are just the name of an order/family/genus/species. For example, to get all specimens for the genus Myrcia:

specimens <- search_tol("Myrcia")
specimens

This searching works by exact matching, and the taxonomy follows WCVP so only accepted names will work. For example, if we mispell Myrcia we get nothing:

search_tol("Mercya")

And if we search for an outdated synonym we get nothing:

search_tol("Gomidesia")

But search using higher taxonomy will work:

specimens <- search_tol("Myrtaceae")
specimens

To get all these results, we can either increase the limit in the search function:

myrts_all <- search_tol("Myrtaceae", limit=500)
myrts_all

Or do paged searching:

myrts1 <- search_tol("Myrtaceae")
myrts2 <- request_next(myrts1)
myrts2

And we can tidy our results into a dataframe:

tidied <- tidy(myrts_all)
tidied

Some information is nested inside the tidied dataframe, but we can get to it by unnesting:

tidied %>%
  select(id, raw_reads, taxonomy) %>%
  unnest(col=c(taxonomy, raw_reads), names_sep="_")

Getting gene information

The Tree of Life also contains information about the genes captured during sequencing. These can be accessed using the search_tol function:

genes_all <- search_tol(genes=TRUE, limit=500)
tidy(genes_all)

But they cannot currently be queried, so the best bet is just to grab all of them.

Looking up a record

Information about a single specimen or gene can be looked up using their ID:

specimen <- lookup_tol("2660")
specimen
gene <- lookup_tol("51", type="gene")
gene

Loading data

Records returned by search_tol and lookup_tol contain links to data files on an SFTP server. You can load these into R using the load_tol function. As you saw at the top of this vignette, if you don't provide any URL to load_tol, it will load the whole Tree of Life tree file.

To load a sequence file for a particular specimen:

load_tol(specimen$fasta_file_url)

To load a sequence file for a gene:

load_tol(gene$fasta_file_url)

Or the alignment file:

load_tol(gene$alignment_file_url)

Or the gene tree:

load_tol(gene$tree_file_url)

All files are returned as strings, so you will need to parse them to use them downstream.

If you want to download these files directly, you can use the download_tol function.



barnabywalker/kewr documentation built on July 5, 2022, 5:37 p.m.