cheetSheet

knitr::opts_chunk$set(echo = TRUE)

Cheat Sheet for adding Annotations

This doc describes how to add resources to AnnotationHub. In general, these instructions pertain to core team members only.


AnnotationHub


Run Periodically - whenever updated

ensembl release

This requires generating two types of resources

  1. GTF -> GRanges on the fly conversion
  2. 2Bit

GTF

On Local Machine:

  1. Navigate to the AnnotationHub_docker directory or create such a directory by following the instructions here.

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
url <- getOption("AH_SERVER_POST_URL")
library(AnnotationHubData)
# Since this grabs the file and converts on the fly
# There is actually no need to run with metadataOnly=FALSE
# 
# This periodically fails. We realized after a certain number
# of subsequent hits to ensembl ftp site, the site starts asking
# for a username:password or simply blocks entirely
# We have tried increasing the sleep time in between failed attempts
# with a max retry of 3.  Normally this function will run completely
# after 1-3 attempts.  
#  In the future maybe consider adding in conditional userpwd argumnet to getURL
meta <- updateResources(getwd(),
              BiocVersion = "3.6",
              preparerClasses = "EnsemblGtfImportPreparer",
              metadataOnly = TRUE, insert = FALSE,
              justRunUnitTest = FALSE, release = "89")
# test/check meta
pushMetadata(meta, url)

# you could rerun updateResources with insert=TRUE to do the push but I like to check resource data
  1. exit R

  2. Convert db to sqlite (puts the file in the data/ directory)

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

2Bit

On EC2 Instance:

The files will be downloaded, converted and pushed to S3 bucket. This should be done on the EC2 instance val_annotations. If it is not running, start the EC2 instance on AWS and log as user ubuntu.

Because this can take awhile, it is recommended to use the screen application. Some usefule screen calls:

- start screen by typing 'screen'
- cd to directory you want to be in, start the process or code you want to run
- exit the screen session with 'ctl-a' 'd'
- list screen sessions with 'screen -ls'
- reconnect to a specific session (e.g., XYZ) with 'screen -r XYZ'
  1. Start screen
  2. Navigate to directory to run code
  3. In R:
library(AnnotationHubData)
# to populate S3 bucket need metadataOnly = FALSE
# metadataOnly will control population of S3 bucket
# insert controls inserting into database
meta <- updateResources(getwd(),
            BiocVersion = "3.6",
            preparerClasses = "EnsemblTwoBitPreparer",
            metadataOnly = FALSE, insert = FALSE,
            justRunUnitTest = FALSE, release = "89")

# a suggested step is to save(meta, file="metadataForTwoBit")
# and scp to local machine
  1. Check S3 bucket is being populated.
  2. Once finished close everything and stop EC2 instance

On Local Machine:

  1. Navigate to the AnnotationHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
url <- getOption("AH_SERVER_POST_URL")
library(AnnotationHubData)

# option 1:
meta <- updateResources(getwd(),
            BiocVersion = "3.6",
            preparerClasses = "EnsemblTwoBitPreparer",
            metadataOnly = TRUE, insert = FALSE,
            justRunUnitTest = FALSE, release = "89")


# option 2:
# if you saved the meta from the EC2 instance
load("metadataForTwoBit")

pushMetadata(meta, url)

# you could rerun updateResources with insert=TRUE to do the push but I like to check resource data
  1. exit R

  2. Convert db to sqlite (puts the file in the data/ directory)

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

After GTF or 2bits are added to production database, you should be able to get the resources and see the updated timestamp of database with something like the following

library(AnnotationHub)
hub = AnnotationHub()
length(query(hub, c("ensembl", "gtf", "release-89")))
length(query(hub, c("fasta", "release-89", "twobit")))

User Contributed Resources

Contributors will generally reach out when wanting to update or include Annotations to the AnnotationHub. In the past they have provided the annotations through an application like dropbox; we have since updated the process and now will directly upload files to S3 in a temporary location. Send them the instructions found here

A key will have to be generated for them to access and use the AnnotationContributor account. Go to here Under the 'Security credentials' tab click 'Create access key'. Send the Access key ID to the contributor and the Secret access key is stored in AWS. When the contributor is done you can delete the key (small 'x' at the right of the key row).

Advise that their data should be in a directory the same name as the software package that will access the annotations; subdirectories to keep track of versions is strongly encouraged.

Once the data is uploaded to S3 move the data to the proper location.

We will need a copy of the package to generate and test the annotaitons. Request link to package from user.

Follow instructions here

In general, generate the list of AnnotationHubMetadata objects with makeAnnotationHubMetadata() or updateResources. To test that the metadata.csv is properly formatted, run makeAnnotationHubMetadata.

Some suggested testing procedures can be found here

When satisfied start the AnnotationHub docker and add resource to docker.

  1. Navigate to the AnnotationHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
library(AnnotationHubData)
url <- getOption("AH_SERVER_POST_URL")

# run approprate makeAnnotationHubMetadata() call and
#pushMetadata(meta[[1]], url)
  1. exit R

  2. Test From the list of dockers running sudo docker ps find the process with db in the name. Example test_db1. Connect to the container with sudo docker exec -ti test_db bash. Log into mysql with mysql -p -u ahuser (The password will be the same as the exported MYSQL_REMOTE_PASSWORD). Explore with mysql commands like select * from resources order by id desc limit 5;

  3. Convert db to sqlite (puts the file in the data/ directory) Which command to run to convert the db to sqlite will depend on the name of the process, but it should be one of the followng:

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh

sudo docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

Run Release time after new builds of annotations

makeStandardOrgDbs

This recipe should be run after the new OrgDb packages have been built for the release are available in the devel repo. The code essentially loads the current packages, extracts the sqlite file and creates some basic metadata.

The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.

On AWS:

Create S3 bucket based on rdatapath: annotationhub/ncbi/standard/

On Local Machine:

  1. Navigate to the AnnotationHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
url <- getOption("AH_SERVER_POST_URL")
library(AnnotationHubData)

# see the man page for clarification on testing and actively pushing
# ?makeStandardOrgDbsToAHM
# to populate S3 bucket need metadataOnly = FALSE
# metadataOnly will control population of S3 bucket
# insert controls inserting into database
meta <- updateResources(getwd(),
            BiocVersion = "3.5",
            preparerClasses = "OrgDbFromPkgsImportPreparer",
            metadataOnly = TRUE, insert = FALSE,
            justRunUnitTest = FALSE,
            downloadOrgDbs=TRUE)
# downloadOrgDbs can be FALSE for subsequent runs

pushMetadata(meta, url)
  1. exit R

  2. Convert db to sqlite (puts the file in the data/ directory)

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

  2. After uploaded to production you can test that they are available in release and devel

query(hub, "OrgDb")
table(mcols(query(hub, "OrgDb"))$rdatadateadded)

If before the release occurs, debug(AnnotationHub:::.uid0) so that query2's biocversion is the desired test version.

or try setAnnotationHubOption(TESTING=TRUE)

makeStandardTxDbs

This recipe should be run after the new TxDbs have been built and are in the devel repo. The code loads the packages, extracts the sqlite file and creates metadata.

The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.

On AWS:

Create S3 bucket based on rdatapath: annotationhub/ucsc/standard/

On Local Machine:

  1. Navigate to the AnnotationHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
url <- getOption("AH_SERVER_POST_URL")
library(AnnotationHubData)

# see the man page for clarification on testing and actively pushing
# follow example in man page
# will need the list of updated or added TxDbs to make a character list of files

?makeStandardTxDbsToAHM



#
# The following will hopefully be implemented in the future
#
#meta <- updateResources(getwd(),
#                        BiocVersion = "3.5",
#                        preparerClasses = "TxDbFromPkgsImportPreparer",
#                        metadataOnly = TRUE, insert = FALSE,
#                        justRunUnitTest = FALSE,
#                        downloadTxDbs=TRUE)
# downloadTxDbs can be FALSE for subsequent runs



pushMetadata(meta, url)
  1. exit R

  2. Convert db to sqlite (puts the file in the data/ directory)

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

  2. After uploaded to production you can test that they are available in release and devel

query(hub, "TxDb")
table(mcols(query(hub, "TxDb"))$rdatadateadded)

makeNCBIToOrgDbs (Non Standard Orgs)

This code generates ~1400 non-standard OrgDb sqlite files from ucsc. These are less comprehensive and the standard OrgDb packages. It's best to run this on an EC2 instance. You can run it locally if your machine has enough space to download the files from NCBI but keep in mind this code takes several hours to run.

The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.

Before running, make sure the AnnotationForge/inst/extdata/viableIDs.rda and GenomeInfoDbData/data/specData.rda files have been updated and pushed in their respective packages. The scripts to generate these data files are in the packages inst/scripts directories as viableIDs.R and updateGenomeInfoDbData.R respectively.

On AWS:

Create S3 bucket: annotationhub/ncbi/uniprot/

On EC2 Instance:

The files will be downloaded, converted and pushed to S3 bucket. This should be done on the EC2 instance val_annotations. If it is not running, start the EC2 instance on AWS and log as user ubuntu.

Because this can take awhile, it is recommended to use the screen application. Some usefule screen calls:

- start screen by typing 'screen'
- cd to directory you want to be in, start the process or code you want to run
- exit the screen session with 'ctl-a' 'd'
- list screen sessions with 'screen -ls'
- reconnect to a specific session (e.g., XYZ) with 'screen -r XYZ'
  1. Start screen
  2. Navigate to directory to run code
  3. In R:
library(AnnotationHubData)

# see the man page for clarification on testing and actively pushing
# ?makeNCBIToOrgDbsToAHM
# to populate S3 bucket need metadataOnly = FALSE
# metadataOnly will control population of S3 bucket
# insert controls inserting into database
meta <- updateResources(getwd(),
            BiocVersion = "3.5",
            preparerClasses = "NCBIImportPreparer",
            metadataOnly = TRUE, insert = FALSE,
            justRunUnitTest = FALSE)

# a suggested step is to save(meta, file="metadataForTwoBit")
# and scp to local machine
  1. Check S3 bucket is being populated.
  2. Once finished close everything and stop EC2 instance

Note: We have had issues in the past with the recipe completing for all desired resources. The receipe, when repeatedly run, will check the appropriate S3 bucket to compare what resources still need to be processed. The helper function needToRerunNonStandardOrgDb can be run to determine if a repeat call should be made. This is important because if all the resources are on aws and metadataOnly=FALSE, it will assume you want to overwrite all the files and begin the generation over.

On Local Machine:

  1. Navigate to the AnnotationHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(AH_SERVER_POST_URL="http://localhost:3000/resource")
options(ANNOTATION_HUB_URL="http://localhost:3000")
url <- getOption("AH_SERVER_POST_URL")
library(AnnotationHubData)

# option 1:
meta <- updateResources(getwd(),
            BiocVersion = "3.5",
            preparerClasses = "NCBIImportPreparer",
            metadataOnly = TRUE, insert = FALSE,
            justRunUnitTest = FALSE)


# option 2:
# if you saved the meta from the EC2 instance
load("metadataForTwoBit")


pushMetadata(meta, url)
  1. exit R

  2. Convert db to sqlite (puts the file in the data/ directory)

sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

ExperimentHub

ExperimentHub Resources are added upon request or when it is recommended a package be an Experiment Data Package rather than Software package. The package that will use such data will reach out. It is then a similar process to The AnnotationHubData adding contributor resources.

The user will upload files to S3 in a temporary location. Send them the instructions found here

A key will have to be generated for them to access and use the AnnotationContributor account. Go to here Under the 'Security credentials' tab click 'Create access key'. Send the Access key ID to the contributor and the Secret access key is stored in AWS. When the contributor is done you can delete the key (small 'x' at the right of the key row).

Advise that their data should be in a directory the same name as the software package that will use the Experiment Data when uploading (ie. software package Test would upload files to S3 in a folder Test -> Test\file1, Test\file2, etc). If subdirectories are needed that is okay but ensure the RDataPath in the metadata.csv reflects this structure.

Once the data is uploaded to S3 move the data to the proper location.

We will need a copy of the package to generate and test the annotaitons. Request link to package from user. The following should be sent to user to ensure the package is sent up correctly: instructions.

In general, generate the list of ExperimentHubMetadata objects with makeExperimentHubMetadata() or addResources. To test that the metadata.csv is properly formatted,run makeAnnotationHubMetadata.

Info on ExperimentHub docker and how to set up docker directory.

When satisfied start the ExperimentHub docker and add resource to docker.

  1. Navigate to the ExperimentHub_docker directory

  2. Start the docker:

export MYSQL_REMOTE_PASSWORD=***  (See credentials doc)
sudo docker-compose up
  1. In a new terminal start R:
options(EXPERIMENT_HUB_SERVER_POST_URL="http://localhost:4000/resource")
options(EXPERIMENT_HUB_URL="http://localhost:4000")
library(ExperimentHubData)
url <- getOption("EXPERIMENT_HUB_SERVER_POST_URL")

# run approprate makeExperimentHubMetadata() call and following if necessary
#pushMetadata(meta, url)
  1. exit R

  2. Test From the list of dockers running sudo docker ps find the process with db in the name. Example test_db1. Connect to the container with sudo docker exec -ti test_db bash. Log into mysql with mysql -p -u hubuser (The password will be the same as the exported MYSQL_REMOTE_PASSWORD). Explore with mysql commands like select * from resources order by id desc limit 5;

  3. Convert db to sqlite (puts the file in the data/ directory) Which command to run to convert the db to sqlite will depend on the name of the process, but it should be one of following:

sudo docker exec experimenthubdocker_experimenthub_1 bash /bin/backup_db.sh

sudo docker exec experimenthub_docker_experimenthub_1 bash /bin/backup_db.sh
  1. If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database

Some other Notes and helpful hints:

  1. If a new recipe needed to be added, the recipe is added in AnnotationHub. Be sure to then update the version of AnnotationHub dependency in the DESCRIPTION of ExperimentHub and increase the version in ExperimentHub.

  2. Remember when an ExperimentHub package is accepted it is uploaded to a different repo.



Try the AnnotationHubData package in your browser

Any scripts or data that you put into this service are public.

AnnotationHubData documentation built on April 17, 2021, 6:05 p.m.