geomedb: geomedb - an R package for querying metadata and associated...

geomedbR Documentation

geomedb - an R package for querying metadata and associated genetic sequences from GEOME

Description

The Genomic Observatory Metadatabase (GEOME Database) is an open access repository for geographic and ecological metadata associated with sequenced samples. This package is used to retrieve GeOMe data for analysis. See http://www.geome-db.org for more information regarding GeOMe.

Details

The geomedb package provides functions for querying GEOME directly, as well as wrappers for sratoolkit executables. When used together, it is possible to download all metadata relevant to your query from GEOME and then download all associated SRA sequences.

GEOME Query Functions

  • listProjects: Get a list of projects in GEOME

  • listExpeditions: Get a list of expeditions for a GEOME project

  • listEntities: Get a list of entities (i.e. tables) available to query

  • listLoci: Get a list of loci that are stored in FASTA format directly in GEOME (not in the SRA)

  • queryMetadata: Query metadata from the GEOME database

  • querySanger: Query Sanger sequences directly from the GEOME database

SRAtoolkit Functions

  • fasterqDump: Download or convert fastq data from NCBI Sequence Read Archive using multiple threads

  • fastqDump: Download or convert fastq data from NCBI Sequence Read Archive in a single thread (Windows compatible)

  • prefetch: Download data from NCBI Sequence Read Archive in .sra format using FASP or HTTPS protocols

Example Usage

Inggat is working on Orangebar Tang (Acanthurus olivaceus) in the Philippines, and would like to download any genetic data that may be available in GEOME from previous research.

First, she searches for all GEOME samples of this species.

acaoli <- queryMetadata(entity = "Sample", query = "genus = Acanthurus AND specificEpithet = olivaceus")

Seeing that there are 787 samples in the database, mostly from the DIPnet project (projectID = 1), she decides to exclude any samples that are not from that project. She then downloads all data from Sanger-sequenced mitochondrial Cytochrome B into a DNAbin object as well as a FASTA file in her working directory.

acaoli_seqs <- querySanger(projects = 1, locus= "CYB", query = "genus = Acanthurus AND specificEpithet = olivaceus")

Then she repeats her query for samples that are associated with massively parallel sequencing reads in the SRA. acaoli_sra <- queryMetadata( entity = "fastqMetadata", query = "genus = Acanthurus AND specificEpithet = olivaceus AND _exists_:bioSample", select=c("Event","Sample"))

This query returns a list object with three data frames representing entities (tables) in GEOME: 'fastqMetadata' contains metadata from the SRA, 'Samples' contains metadata about the samples, and 'Events' contains metadata about the sampling events that obtained the samples. By including "_exists_:bioSample" in her query, Inggat selected only samples that have associated SRA data (biosamples).

Inggat now uses 'prefetch' to download .sra files for these samples that she has queried into her working directory. She then uses fasterqDump to convert these .sra files into fastq files, and rename them based on their original materialSampleID that the previous author supplied, and uses 'cleanup = T' to delete the .sra files.

prefetch(queryMetadata_object = acaoli)

fasterqDump(queryMetadata_object = acaoli, filenames = "IDs", source = "local", cleanup = T)

This two-step approach is generally faster, but Inggat could also have simply used fasterqDump() to download fastq files directly from the SRA. If she has Aspera Connect installed, with the ascp executable in her $PATH, her download would be even faster. If she was using Windows, she would have used fastqDump, which is single-threaded.


DIPnet/fimsR-access documentation built on Nov. 12, 2022, 2:41 a.m.