R/geomedb.R

#' geomedb - an R package for querying metadata and associated genetic sequences from GEOME
#' 
#' The Genomic Observatory Metadatabase (GEOME Database) is an open access repository for
#' geographic and ecological metadata associated with sequenced samples. This package is used to retrieve
#' GeOMe data for analysis. See \url{http://www.geome-db.org} for more information regarding GeOMe.
#' 
#' The geomedb package provides functions for querying GEOME directly, as well as wrappers for
#' \href{https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc}{sratoolkit} executables. When used together, it is possible
#' to download all metadata relevant to your query from GEOME and then download all associated SRA sequences.
#' 
#' @section GEOME Query Functions:
#' \itemize{
#' 
#' \item \code{\link{listProjects}}: Get a list of projects in GEOME
#' \item \code{\link{listExpeditions}}: Get a list of expeditions for a GEOME project
#' \item \code{\link{listEntities}}: Get a list of entities (i.e. tables) available to query
#' \item \code{\link{listLoci}}: Get a list of loci that are stored in FASTA format directly in GEOME (not in the SRA)
#' \item \code{\link{queryMetadata}}: Query metadata from the GEOME database
#' \item \code{\link{querySanger}}: Query Sanger sequences directly from the GEOME database
#' }
#' 
#' @section SRAtoolkit Functions:
#' \itemize{
#' \item \code{\link{fasterqDump}}: Download or convert fastq data from NCBI Sequence Read Archive using multiple threads
#' \item \code{\link{fastqDump}}: Download or convert fastq data from NCBI Sequence Read Archive in a single thread (Windows compatible)
#' \item \code{\link{prefetch}}: Download data from NCBI Sequence Read Archive in .sra format using FASP or HTTPS protocols
#' }
#' 
#' @section Example Usage:
#' 
#' Inggat is working on Orangebar Tang (\emph{Acanthurus olivaceus}) in the Philippines, and would like to download any genetic data
#' that may be available in GEOME from previous research.
#' 
#' First, she searches for all GEOME samples of this species.
#' 
#'  \code{
#'  acaoli <- queryMetadata(entity = "Sample", query = "genus = Acanthurus AND specificEpithet = olivaceus")
#'          
#'  } 
#' 
#' Seeing that there are 787 samples in the database, mostly from the DIPnet project (projectID = 1), she decides to
#' exclude any samples that are not from that project. She then downloads all data from Sanger-sequenced mitochondrial Cytochrome B into
#' a DNAbin object as well as a FASTA file in her working directory.
#' 
#' \code{
#'
#' acaoli_seqs <- querySanger(projects = 1, locus= "CYB", query = "genus = Acanthurus AND specificEpithet = olivaceus")
#' 
#' }
#' 
#' Then she repeats her query for samples that are associated with massively parallel sequencing reads in the SRA.
#' \code{
#' 
#' acaoli_sra <- queryMetadata(
#'     entity = "fastqMetadata", 
#'     query = "genus = Acanthurus AND specificEpithet = olivaceus AND _exists_:bioSample",
#'     select=c("Event","Sample"))
#' 
#' }
#' 
#' This query returns a list object with three data frames representing entities (tables) in GEOME: 'fastqMetadata'
#' contains metadata from the SRA, 'Samples' contains metadata about the samples, and 'Events' contains metadata about
#' the sampling events that obtained the samples. By including "_exists_:bioSample" in her query, Inggat selected only
#' samples that have associated SRA data (biosamples).
#' 
#' Inggat now uses `prefetch` to download .sra files for these samples that she has queried into her working directory.
#' She then uses fasterqDump to convert these .sra files into fastq files, and rename them based on their original
#' materialSampleID that the previous author supplied, and uses `cleanup = T` to delete the .sra files.
#' 
#'
#' 
#' \code{
#' 
#' prefetch(queryMetadata_object = acaoli)

#' }
#' 
#' \code{
#' fasterqDump(queryMetadata_object = acaoli, filenames = "IDs", source = "local", cleanup = T)
#' }
#' 
#'  This two-step approach is generally faster, but Inggat could also have simply used fasterqDump() to download fastq
#'  files directly from the SRA. If she has \href{https://downloads.asperasoft.com/connect2/}{Aspera Connect} installed, with the ascp executable in her $PATH, her download
#'  would be even faster. If she was using Windows, she would have used \code{\link{fastqDump}}, which is single-threaded. 
#' 
#' 
#' @docType package
#' @name geomedb
#' @keywords internal
NULL
biocodellc/fimsR-access documentation built on Nov. 17, 2022, 1:56 a.m.