geomedb: geomedb - an R package for querying metadata and associated...

Description Details GEOME Query Functions SRAtoolkit Functions Example Usage


The Genomic Observatory Metadatabase (GEOME Database) is an open access repository for geographic and ecological metadata associated with sequenced samples. This package is used to retrieve GeOMe data for analysis. See for more information regarding GeOMe.


The geomedb package provides functions for querying GEOME directly, as well as wrappers for sratoolkit executables. When used together, it is possible to download all metadata relevant to your query from GEOME and then download all associated SRA sequences.

GEOME Query Functions

SRAtoolkit Functions

Example Usage

Inggat is working on Orangebar Tang (Acanthurus olivaceus) in the Philippines, and would like to download any genetic data that may be available in GEOME from previous research.

First, she searches for all GEOME samples of this species.

acaoli <- queryMetadata(entity = "Sample", query = "genus = Acanthurus AND specificEpithet = olivaceus")

Seeing that there are 787 samples in the database, mostly from the DIPnet project (projectID = 1), she decides to exclude any samples that are not from that project. She then downloads all data from Sanger-sequenced mitochondrial Cytochrome B into a DNAbin object as well as a FASTA file in her working directory.

acaoli_seqs <- querySanger(projects = 1, locus= "CYB", query = "genus = Acanthurus AND specificEpithet = olivaceus")

Then she repeats her query for samples that are associated with massively parallel sequencing reads in the SRA. acaoli_sra <- queryMetadata( entity = "fastqMetadata", query = "genus = Acanthurus AND specificEpithet = olivaceus AND _exists_:bioSample", select=c("Event","Sample"))

This query returns a list object with three data frames representing entities (tables) in GEOME: 'fastqMetadata' contains metadata from the SRA, 'Samples' contains metadata about the samples, and 'Events' contains metadata about the sampling events that obtained the samples. By including "_exists_:bioSample" in her query, Inggat selected only samples that have associated SRA data (biosamples).

Inggat now uses 'prefetch' to download .sra files for these samples that she has queried into her working directory. She then uses fasterqDump to convert these .sra files into fastq files, and rename them based on their original materialSampleID that the previous author supplied, and uses 'cleanup = T' to delete the .sra files.

prefetch(queryMetadata_object = acaoli)

fasterqDump(queryMetadata_object = acaoli, filenames = "IDs", source = "local", cleanup = T)

This two-step approach is generally faster, but Inggat could also have simply used fasterqDump() to download fastq files directly from the SRA. If she has Aspera Connect installed, with the ascp executable in her $PATH, her download would be even faster. If she was using Windows, she would have used fastqDump, which is single-threaded.

geomedb documentation built on July 15, 2020, 5:07 p.m.