This vignette is for those wishing to process online resources into AnnotationHub resources. There are four steps involved in creating resources.
This vignette is intended for users who are comfortable with R. Be
sure to use Bioc-devel. Install the AnnotationHubData package
using biocLite("AnnotationHubData")
.
The AnnotationHubData package provides a place where we store code
to processes online resources into AnnotationHub
resources. AnnotationHubMetadata
objects are used to describe an
online resource, including a function ('recipe') that creates the
resource.
The steps involved include writing a recipe are:
AnnotationHubMetadata
objectsAnnotationHubMetadata
Generating FunctionsThe following example function takes files from the latest release of
inparanoid and processes them into a list of AnnotationHubMetadata
objects (e.g., using Map()
).
The function body can contain hard-coded information, but many recipes
will 'scrape' internet resources, creating many AnnotationHub
resources; in the example below, .inparanoidMetadataFromUrl()
visits
the inparanoid web site to find current data. The Recipe:
field
indicates the function that will perform optional step 2, transforming
the original source file(s) to an object that is more convenient for
user input. This function returns a list of AnnotationHubMetadata
objects.
makeinparanoid8ToAHMs <- function(currentMetadata, justRunUnitTests=FALSE, BiocVersion=biocVersion()) { baseUrl <- 'http://inparanoid.sbc.su.se/download/current/Orthologs_other_formats' ## Make list of metadata in a helper function meta <- .inparanoidMetadataFromUrl(baseUrl) ## then make AnnotationHubMetadata objects. Map(AnnotationHubMetadata, Description=meta$description, Genome=meta$genome, SourceFile=meta$sourceFile, SourceUrl=meta$sourceUrl, SourceVersion=meta$sourceVersion, Species=meta$species, TaxonomyId=meta$taxonomyId, Title=meta$title, RDataPath=meta$rDataPath, MoreArgs=list( Coordinate_1_based = TRUE, DataProvider = baseUrl, Maintainer = "Marc Carlson <mcarlson@fhcrc.org>", RDataClass = "SQLiteFile", RDataDateAdded = Sys.time(), RDataVersion = "0.0.1", Recipe = "AnnotationHubData:::inparanoid8ToDbsRecipe", Tags = c("Inparanoid", "Gene", "Homology", "Annotation"))) }
Here is a listing of AnntotationHubMetadata
arguments:
AnnotationHubRoot
: 'character(1)' Absolute path to directory
structure containing resources to be added to AnnotationHub. SourceUrl
: 'character()' URL where resource(s) can be foundSourceType
: 'character()' which indicates what kind of resource was
initially processed. The preference is to name the type of resource
if it's a single file type and to name where the resources came from
if it is a compound resource. So Typical answers would be like:
'BED','FASTA' or 'Inparanoid' etc.SourceVersion
: 'character(1)' Version of original fileSourceLastModifiedDate
: 'POSIXct()' The date when the source was
last modified. Leaving this blank should allow the values to be
retrieved for you (if your sourceURL is valid).SourceMd5
: 'character()' md5 hash of original fileSourceSize
: 'numeric(1)' Number of bytes in original fileDataProvider
: 'character(1)' Where did this resource come from?Title
: 'character(1)' Title for this resourceDescription
: 'character(1)' Description of the resourceSpecies
: 'character(1)' Species nameTaxonomyId
: 'character(1)' NCBI codeGenome
: 'character(1)' Name of genome buildTags
: 'character()' Free-form tagsRecipe
: 'character(1)' Name of recipe functionRDataClass
: 'character(1)' Class of derived object (e.g. 'GRanges')RDataDateAdded
: 'POSIXct()' Date added to AnnotationHub. Used to
determine snapshots.RDataPath
: 'character(1)' file path to serialized formMaintainer
: 'character(1)' Maintainer name and email address, 'A
Maintainer BiocVersion
: 'character(1)' Under which resource was builtCoordinate_1_based
: 'logical(1)' Do coordinates start with 1 or 0?DispatchClass
: 'character(1)' string used to indicate which code
should be called by the client when the resource is downloaded. This
is often the same as the RDataClass. But it is allowed to be a
different value so that the client can do something different
internally if required.Location_Prefix
: 'character(1)' The location prefix is the base path
where the resource is coming from; the default is the Bioconductor
AnnotationHub server.Notes
: 'character()' Notes about the resource.A (optional) recipe function transformed the source data into an
object served by AnnotationHub to the user. It takes a single
AnnotationHubMetadata
object as an argument. Below is a recipe that
generates an inparanoid database object from the metadata stored in
it's AnnotationHubMetadata
object. Note that the recipe will be
invoked on the hub, so should output data to the location specified in
the input AnnotationHubMetadata
object.
inparanoid8ToDbsRecipe <- function(ahm) { require(AnnotationForge) inputFiles <- metadata(ahm)$SourceFile dbname <- makeInpDb(dir=file.path(inputFiles,""), dataDir=tempdir()) db <- loadDb(file=dbname) outputPath <- file.path(metadata(ahm)$AnnotationHubRoot, metadata(ahm)$RDataPath) saveDb(db, file=outputPath) outputFile(ahm) }
While writing this function, care has to be taken for a couple of fields:
Case 1 (common) - The AnnotationHub resource is downloaded directly to the user cache without any pre-processing, then
SourceUrls
specify the original resource locatoin = Location_Prefix + RDataPathNA_character_
Example -
SourceUrls = "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz" Location_Prefix = "http://hgdownload.cse.ucsc.edu/", RDataPath = "goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz" Recipe = NA_character_
Case 2 - The AnnotationHub resource requires pre-processing
SourceUrls
should merely document the original location of the
untouched fileLocation_Prefix
+ RDataPath
should be equal to the file path on the
amazon machine where all pre-processed files are stored.Example -
SourceUrls="http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz", Location_Prefix = "http://s3.amazonaws.com/annotationhub/", RDataPath="chainfile/dummy.Rda"
If this seems confusing, please note how in both of these cases, the sourceUrl needs to reflect the location that the resource will actually come from once when the client is in use.
One can post-process a file when it is instantiated into
AnnotationHub from the user's cache. An example, would be a BED
file, downloaded to the user's cache but input into R as a GRanges
using rtrackler::import
. Implement this by defining a class that
extends AnnotationHubResource
and that implements a get1()
method.
setClass("BEDFileResource", contains="AnnotationHubResource") setMethod(".get1", "BEDFileResource", function(x, ...) { .require("rtracklayer") yy <- getHub(x) dat <- rtracklayer::BEDFile(cache(yy)) rtracklayer::import(dat, format="bed", genome=yy$genome, ...) })
The class and method definition typically need to be added to AnnotationHub, and require coordination with us.
At this point make sure that the AnnotationHubMetadata
generating
function produces a list of AnnotationHubMetadata
objects and that
the recipe (if needed) produces an appropriate output path. Contact
us to add your recipe to the production hub.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.