BiocStyle::markdown()
Authors: V. Keith Hughitt
Modified: r file.info("EuPathDB.Rmd")$mtime
Compiled: r date()
This tutorial describes how to query and make use of annotations from the EuPathDB : The Eukaryotic Pathogen Genomics Resource and the creation/access of local R packages derived from the EuPathDB.
The EuPathDB R package may also be used to generate local copies of some of the resources provided by eupathdb.org. These include:
Because the EuPathDB is constantly updating their resources, it might be of use for one to generate these resources locally. In addition, the EuPathDB R package provides some shortcut functions for gathering data frames(tables) from these resources of annotation data.
When AnnotationHub and the EupathDB web resources get out of sync, one might want to install and manipulate the newest version of the available Eupath data. The following blocks demonstrate how one might do those tasks.
The following shows how one might gather information about Saccharomyces cerevisiae from fungidb.org.
If one is creating bsgenomes, then the number of open files may quickly pass the default limit of 1024 imposed by most linux systems. This will prove more than a little annoying, therefore you will very much want to do something like this:
```{bash ulimit, eval=FALSE} ulimit -HSn 4096 ## If you can do it, make this number higher.
```r library(EuPathDB) ## Ask for the version 42 data from fungidb for species with 'cerevisiae' in the name. sc_entry <- get_eupath_entry(species="cerevisiae", webservice="fungidb", eu_version="v42") sc_name <- sc_entry[["Species"]] sc_entry
Now that we have the canonical name for yeast from fungidb, we can create a fresh orgdb package. I will not actually run this for the vignette because it takes a long time and prints a lot to screen.
orgdb_pkg <- make_eupath_orgdb(sc_entry) txdb_pkg <- make_eupath_txdb(sc_entry) bsgenome_pkg <- make_eupath_bsgenome(sc_entry) organ_pkg <- make_eupath_organismdbi(sc_entry)
You will have to trust that the above worked in my R session and installed for me packages named:
As the names suggest, these were derived from fungidb.org revision 43.
The following demonstrates ways to extract data from the generated orgdb package.
Because the eupathdb is constantly evolving, the get_eupath_pkgnames() function will use the metadata generated in download_eupath_metadata() in order to provide a differently names package for each eupathdb revision.
orgdb_pkg <- get_eupath_pkgnames(sc_entry) sc_orgdb <- orgdb_pkg$orgdb ## Here is the name of the current yeast package. sc_orgdb ## Thus we see the v41 (as of late 2018), a number which presumably will continue increasing. ## We can set the version parameter to change this if we have a previous version installed. ## Now get the set of available columns from it: library(sc_orgdb, character=TRUE) pkg <- get0(sc_orgdb) avail_columns <- AnnotationDbi::columns(pkg) head(avail_columns) ## There are lots of columns! length(avail_columns)
The EuPathDB provides quite an astonishing field of information. When creating the OrgDB packages, I prefixed column names of the various data types as follows:
Once we have a package loaded, everything else is an application of the AnnotationDbi interface. Here are a few examples.
The load_orgdb_annotation() and load_orgdb_go() are simply wrapper around AnnotationDbi in order to fill in the various arguments and more quickly return annotation of likely interest.
## The columns which begin with strings like 'PATHWAY' or 'INTERPRO' are actually separate ## sql tables in the orgdb database, and as such will lead to a hugely redundant data table ## if we select them. chosen_columns_idx <- grepl(x=avail_columns, pattern="^ANNOT") chosen_columns <- avail_columns[chosen_columns_idx] ## Now we have a set of columns of interest, let us get a data table/data frame. sc_annot <- load_orgdb_annotations(orgdb=sc_orgdb, keytype="gid", fields=chosen_columns) ## load_orgdb_annotations will fill out separate dataframes for each annotation type, ## genes, exons, transcripts, etc. In this case, we only want the genes ## (The eupathdb does not provide much information for the others.) sc_genes <- sc_annot[["genes"]] dim(sc_genes) head(sc_genes) ## Yay! We have data about S. cerevisiae! chosen_columns_idx <- grepl(x=avail_columns, pattern="^GO") chosen_columns <- avail_columns[chosen_columns_idx] sc_go <- load_orgdb_go(sc_orgdb, columns=chosen_columns) head(sc_go) ## Yay Gene ontology data for Crithidia! chosen_columns_idx <- grepl(x=avail_columns, pattern="^INTERPRO") chosen_columns <- avail_columns[chosen_columns_idx] sc_interpro <- load_orgdb_go(sc_orgdb, columns=chosen_columns) head(sc_interpro) ## Interpro data for Crithidia! chosen_columns_idx <- grepl(x=avail_columns, pattern="^PATHWAY") chosen_columns <- avail_columns[chosen_columns_idx] sc_path <- load_orgdb_go(sc_orgdb, columns=chosen_columns) head(sc_path)
Remembering the keytype and package names and everything can be annoying. The following attempts to make that easier. In this instance, the only thing we need to provide is a unique substring for the species of interest. If the substring is not unique, this should show the matches and choose the first.
## The function load_eupath_annotations() provides a shortcut to the above. sc_annot <- load_eupath_annotations(species="S288c", eu_version="v42", webservice="fungidb") dim(sc_annot)
A task I find myself needing to do fairly often is to get the set of genes orthologous to a given gene. There are lots of methods to handle this question; one of them includes the nice table provided by the eupathdb. Let us query that.
sc_ortho <- extract_eupath_orthologs(sc_orgdb) dim(sc_ortho) head(sc_ortho) summary(sc_ortho)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.