Creating An ExperimentHub Package

Overview

ExperimentHubData provides tools to add or modify resources in Bioconductor's ExperimentHub. This 'hub' houses curated data from courses, publications or experiments. The resources are generally not files of raw data (as can be the case in AnnotationHub) but instead are R / Bioconductor objects such as GRanges, SummarizedExperiment, data.frame etc. Each resource has associated metadata that can be searched through the ExperimentHub client interface.

New resources

Resources are contributed to ExperimentHub in the form of a package. The package contains the resource metadata, man pages, vignette and any supporting R functions the author wants to provide. This is a similar design to the existing Bioconductor experimental data packages except the data are stored in AWS S3 buckets instead of the data/ directory of the package.

Below are the steps required for adding new resources.

Notify Bioconductor team member

The man page and vignette examples in the data experiment package will not work until the data are available in ExperimentHub. Adding the data to AWS S3 and the metadata to the production database involves assistance from a Bioconductor team member. Please read the section "Uploading Data to S3".

Building the data experiment package

When a resource is downloaded from ExperimentHub the associated data experiment package is loaded in the workspace making the man pages and vignettes readily available. Because documentation plays an important role in understanding these curated resources please take the time to develop clear man pages and a detailed vignette. These documents provide essential background to the user and guide appropriate use the of resources.

Below is an outline of package organization. The files listed are required unless otherwise stated.

inst/extdata/

An example data experiment package metadata.csv file can be found here

inst/scripts/

vignettes/

R/

man/

DESCRIPTION / NAMESPACE

Data objects

Data are not formally part of the software package and are stored separately in AWS S3 buckets. The author should follow instructions in the section "Uploading Data to S3".

Metadata

When you are satisfied with the representation of your resources in make-metadata.R (which produces metadata.csv) the Bioconductor team member will add the metadata to the production database.

Package review

Once the data are in AWS S3 and the metadata have been added to the production database the man pages and vignette can be finalized. When the package passes R CMD build and check it can be submitted to the package tracker for review. The package should be submitted without any of the data that is now located on S3; This keeps the package light weight and minimual size while still providing access to key large data files now stored on S3. If the data files were added to the github repository please see removing large data files and clean git tree to remove the large files and reduce package size.

Many times these data package are created as a suppliment to a software package. There is a process for submitting mulitple package under the same issue.

Add additional resources

Metadata for new versions of the data can be added to the same package as they become available.

Contact [email protected] or [email protected] with any questions.

Bug fixes

A bug fix may involve a change to the metadata, data resource or both.

Update the resource

Update the metadata

New metadata records can be added for new resources but modifying existing records is discouraged. Record modification will only be done in the case of bug fixes.

Remove resources

Removing resources should be done with caution. The intent is that ExperimentHub be a 'reproducible' resource by providing a stable snapshot of the data. Data made available in Bioconductor version x.y.z should be available for all versions greater than x.y.z. Unfortunately this is not always possible. If you find it necessary to remove data from ExperimentHub please contact [email protected] or [email protected] for assistance.

When a resource is removed from ExperimentHub the 'status' field in the metadata is modified to explain why they are no longer available. Once this status is changed the ExperimentHub() constructor will not list the resource among the available ids. An attempt to extract the resource with '[[' and the EH id will return an error along with the status message.

Uploading Data to S3

Instead of providing the data files via dropbox, ftp, etc. we will grant temporary access to an S3 bucket where you can upload your data. Please email [email protected] for access.

You will be given access to the 'AnnotationContributor' user. Ensure that the AWS CLI is installed on your machine. See instructions for installing AWS CLI here. Once you have requested access you will be emailed a set of keys. There are two options to set the profile up for AnnotationContributor

  1. Update your .aws/config file to include the following updating the keys accordingly:
[profile AnnotationContributor]
output = text
region = us-east-1
aws_access_key_id = ****
aws_secret_access_key = ****
  1. If you can't find the .aws/config file, Run the following command entering appropriate information from above
aws configure --profile AnnotationContributor

After the configuration is set you should be able to upload resources using

aws --profile AnnotationContributor s3 cp test_file.txt s3://annotation-contributor/test_file.txt --acl public-read

# to upload directory

aws --profile AnnotationContributor s3 cp test_dir s3://annotation-contributor/teset_dir --recursive --acl public-read

Please upload the data with the appropriate directory structure, including subdirectories as necessary (i.e. top directory must be software package name, then if applicable, subdirectories of versions, ...)

Once the upload is complete, email [email protected] to continue the process. To add the data officially the data will need to be uploaded and the metadata.csv file will need to be created in the github repository.

Validating

The best way to validate record metadata is to read inst/extdata/metadata.csv with ExperimentHubData::makeExperimentHubMetadata(). If that is successful the metadata are ready to go.

Example metadata.csv file and more information

As described above the metadata.csv file (or multiple metadata.csv files) will need to be created before the data can be added to the database. To ensure proper formatting one should run AnnotationHubData::makeAnnotationHubMetadata on the package with any/all metadata files, and address any ERRORs that occur. Each object uploaded to S3 should have an entry in the metadata file. Briefly, a description of the metadata columns required:

Any additional columns in the metadata.csv file will be ignored but could be included for internal reference.

This is a dummy example but hopefully it will give you an idea of the format. Let's say I have a package myExperimentPackage and I upload two files one a SummarizedExperiments of expression data saved as a .rda and the other a sqlite database both considered simulated data. You would want the following saved as a csv (comma seperated output) but for easier view we show in a table:

Title | Description | BiocVersion | Genome | SourceType | SourceUrl | SourceVersion | Species | TaxonomyId | Coordinate_1_based | DataProvider | Maintainer | RDataClass | DispatchClass | RDataPath ---------------|----------------------------------------------|-----|-------------|-----|----------------------------------------------------------------------------------------------------|------------|-------------|------|------|---------|-------------------------------------------------------|-----------|----------|-------------------------------------------------------------------------- Simulated Expression Data | Simulated Expression values for 12 samples and 12000 probles | 3.9 | NA | Simulated | http://mylabshomepage | v1 | NA | NA | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer maintainer@bioconductor.org | SummarizedExperiment | Rda | myExperimentPackage/SEobject.rda Simulated Database | Simulated Database containing gene mappings | 3.9 | hg19 | Simulated | http://bioconductor.org/packages/myExperimentPackage | v2 | Home sapiens | 9606 | NA | http://bioconductor.org/packages/myExperimentPackage | Bioconductor Maintainer maintainer@bioconductor.org | SQLiteConnection | SQLiteFile | myExperimentPackage/mydatabase.sqlite



Try the ExperimentHub package in your browser

Any scripts or data that you put into this service are public.

ExperimentHub documentation built on May 6, 2019, 3:05 a.m.