The installation of the archs4 package is a bit more involved than a standard package installation and can be roughly broken down into three steps.

  1. Install the R package along with its dependencies.
  2. Download a number of (large) data files into a specific folder.
  3. Generate metadata from the files downloaded in (2) for downstream use.

We will walk you through each step in this section.

R Package Installation

The arcsh4 package depends on other packages that are available through both [CRAN][cran] and [Bioconductor][bioc]. For that reason, we will use the [BiocInstaller::biocLite()][biocLite] function to install this package, which can seamlessly install packages from github, CRAN, and Bioconductor.

source("https://bioconductor.org/biocLite.R")
biocLite("denalitherapeutics/archs4", build_vignettes=TRUE)
library("archs4")

When you first load the archs4 library, you will notice a startup message telling you that something isn't quite right with your archs4 installation. The message will look something like this:

Note that your default archs4 data directory is NOT setup correctly

  * Run `archs4_local_data_dir_validate()` to diagnose
  * Refer to the ARCHS4 Data Download section of the archs4 vignette for more information

Your default archs4 data directory (`getOption("archs4.datadir")`) is:

  ~/.archs4data

In order for the package to work correctly, you must download a number of files which are enumerated in the Data File Download section below into a single directory. You will then instruct the archs4 package the path to the directory that holds all of these files by setting the value of R's global "archs4.datadir" option to be the path to that directory.

Data File Download

You will have to create a directory on your filesystem which will hold a number of data files that the archs4 package depends on. Let's call this directory $ARCHS4DIR, which we will define here to be ~/archs4v6data.

The archs4 package provides the archs4_local_data_dir_create() convenience function which creates this directory and copies over a meta.yaml file into that directory. The purpose of this file is to specify the names of the downloaded files that correspond to the human and mouse-level gene and transcript-level data.

library(archs4)
archs4dir <- "~/archs4v6data"
archs4_local_data_dir_create(archs4dir)

Once this directory is created successfully, you will then have to download the following files into it:

sysdir <- system.file("extdata", package = "archs4")
cat(archs4:::md_archs4_download_bullet_list(sysdir))

The enumerated items above contain links to the files that need to be downloaded. You can right-click on them and select Save As ... and instruct your web-browser to save them to your local $ARCHS4DIR.

NOTE: Most all of the archs4 functions accept a datadir parameter, which should be the path to $ARCHS4DIR. For convenience, the default value of this parameter is always set to getOption("archs4.datadir"). This means that you can modify your ~/.Rprofile file to set the value of this option to "~/archs4v2data" (for instance), so that the package will always look there by default. If this option is not set in your ~/.Rprofile, the default value for this option is "~/.archs4data".

Feature-Level Metadata Generation

The datasets currently made available by the [ARCHS4 Project][archs4web] only provide minimal feature-level metadata:

We want to augment these features with richer annotations, such as the ensembl gene identifiers or gene biotypes, for instance.

To make such data generation automatic and easy for the user, once you have downloaded the Ensembl GTF files listed above into the $ARCHS4DIR, you can run the create_augmented_feature_info() to extract these extra feature-level metadata from the GTF files and store them as tables inside $ARCHS4DIR for later use.

create_augmented_feature_info(archs4dir)

This function will load and parse the GTF files from human and mouse, and create gene- and transcript-level *.csv.gz files in the $ARCHS4DIR which the archs4 package will then later use downstream.

Once your $ARCHS4DIR is setup, you may find it convenient to set the default value for R's global "archs4.datadir" option to the $ARCHS4DIR directory you just setup. To do so, you can put the following line in your ~/.Rprofile file:

options(archs4.datadir = "~/archs4v2data")

Library Size and Normalization Factors

It is often convenient to extract normalized versions of the count data from the gene-level expression matrices. In order to do this "on-the-fly" we provide the estimate_repository_norm_factors() functions, which accepts an Archs4Repository object and essentially performs the steps necessary to create edgeR::TMM-like normalization factors across the entire ARCSH4 expression atlas.

In order to avoid laoding the entire expression matrix into memory, the estimate_repository_norm_factors() splits up the processing in batches, loading a subset of samples at a time until it completes. This process will take quite some time (around two hours on a modern laptop).

Note that recent versions of the ARCHS4 data do provide meta/reads_aligned and meta/total_reads entries. We don't use those here, but we can compare the values their with what we calculate above.

a4 <- Archs4Repository(archs4dir)
estimate_repository_norm_factors(a4)

ARCHS4 Installation Heatlh

Because the installation of this package is a bit more involved than most, we have also provided an archs4_local_data_dir_validate() function, which you can run over your $ARCHS4DIR in order to check on "the health" of your install.

This function will simply look at your $ARCHS4DIR to ensure that the required files are there, and tries to give you helpful error messages if not.

For instance, if the first two files enumerated in the Data File Download section were missing from your $ARCHS4DIR (ie. human_matrix.h5 and human_hiseq_transcript_v2.h5), you would be warned that "something isn't right" when you first load the archs4 package. You could then run the archs4_local_data_dir_validate() to see what is wrong:

archs4_local_data_dir_validate(archs4dir)
#> The following ARCHS4 files are missing, please download them:
#>   * human_matrix.h5: https://s3.amazonaws.com/mssm-seq-matrix/human_matrix.h5
#>   * human_hiseq_transcript_v2.h5: #> https://s3.amazonaws.com/mssm-seq-matrix/human_hiseq_transcript_v2.h5

NOTE: If all installation and data download/processing steps have been completed successfully, a call to archs4_local_data_dir_validate() will simply return TRUE.

Package Development

If you are developing this package, you will find that it will be convenient to symlink the package's default archs4.datadir path (~/.arcsh4data) to the $ARCHS4DIR you just setup. This is because often times things like roxygen2 document compilation, unit testing, etc. happen in a vanilla R workspace, which won't run the configuration that is prescribed in your ~/.Rprofile file.



denalitherapeutics/archs4 documentation built on May 17, 2019, 1:29 p.m.