The installation of the archs4
package is a bit more involved than a standard
package installation and can be roughly broken down into three steps.
We will walk you through each step in this section.
The arcsh4
package depends on other packages that are available through both
[CRAN][cran] and [Bioconductor][bioc]. For that reason, we will use the
[BiocInstaller::biocLite()
][biocLite] function to install this package, which
can seamlessly install packages from github, CRAN, and Bioconductor.
source("https://bioconductor.org/biocLite.R") biocLite("denalitherapeutics/archs4", build_vignettes=TRUE) library("archs4")
When you first load the archs4
library, you will notice a startup message
telling you that something isn't quite right with your archs4
installation.
The message will look something like this:
Note that your default archs4 data directory is NOT setup correctly * Run `archs4_local_data_dir_validate()` to diagnose * Refer to the ARCHS4 Data Download section of the archs4 vignette for more information Your default archs4 data directory (`getOption("archs4.datadir")`) is: ~/.archs4data
In order for the package to work correctly, you must download a number of files
which are enumerated in the Data File Download section
below into a single directory. You will then instruct the archs4
package the
path to the directory that holds all of these files by setting the value of R's
global "archs4.datadir
" option to be the path to that directory.
You will have to create a directory on your filesystem which will hold a number
of data files that the archs4
package depends on. Let's call this
directory $ARCHS4DIR
, which we will define here to be ~/archs4v6data
.
The archs4
package provides the archs4_local_data_dir_create()
convenience
function which creates this directory and copies over a meta.yaml
file into
that directory. The purpose of this file is to specify the names of the
downloaded files that correspond to the human and mouse-level gene and
transcript-level data.
library(archs4) archs4dir <- "~/archs4v6data" archs4_local_data_dir_create(archs4dir)
Once this directory is created successfully, you will then have to download the following files into it:
sysdir <- system.file("extdata", package = "archs4") cat(archs4:::md_archs4_download_bullet_list(sysdir))
The enumerated items above contain links to the files that need to be
downloaded. You can right-click on them and select Save As ...
and instruct
your web-browser to save them to your local $ARCHS4DIR
.
NOTE: Most all of the archs4
functions accept a datadir
parameter, which
should be the path to $ARCHS4DIR
. For convenience, the default value of this
parameter is always set to getOption("archs4.datadir")
. This means that you
can modify your ~/.Rprofile
file to set the value of this option to
"~/archs4v2data"
(for instance), so that the package will always look there
by default. If this option is not set in your ~/.Rprofile
, the
default value for this option is "~/.archs4data".
The datasets currently made available by the [ARCHS4 Project][archs4web] only provide minimal feature-level metadata:
We want to augment these features with richer annotations, such as the ensembl gene identifiers or gene biotypes, for instance.
To make such data generation automatic and easy for the user, once you have
downloaded the Ensembl GTF files listed above into the $ARCHS4DIR
, you can
run the create_augmented_feature_info()
to extract these extra feature-level
metadata from the GTF files and store them as tables inside $ARCHS4DIR
for
later use.
create_augmented_feature_info(archs4dir)
This function will load and parse the GTF files from human and mouse, and
create gene- and transcript-level *.csv.gz
files in the $ARCHS4DIR
which
the archs4
package will then later use downstream.
Once your $ARCHS4DIR
is setup, you may find it convenient to set the default
value for R's global "archs4.datadir"
option to the $ARCHS4DIR
directory you
just setup. To do so, you can put the following line in your ~/.Rprofile
file:
options(archs4.datadir = "~/archs4v2data")
It is often convenient to extract normalized versions of the count data from
the gene-level expression matrices. In order to do this "on-the-fly" we provide
the estimate_repository_norm_factors()
functions, which accepts an
Archs4Repository
object and essentially performs the steps necessary to
create edgeR::TMM-like normalization factors across the entire ARCSH4 expression
atlas.
In order to avoid laoding the entire expression matrix into memory, the
estimate_repository_norm_factors()
splits up the processing in batches,
loading a subset of samples at a time until it completes. This process will
take quite some time (around two hours on a modern laptop).
Note that recent versions of the ARCHS4 data do provide meta/reads_aligned
and meta/total_reads
entries. We don't use those here, but we can compare
the values their with what we calculate above.
a4 <- Archs4Repository(archs4dir) estimate_repository_norm_factors(a4)
Because the installation of this package is a bit more involved than most,
we have also provided an archs4_local_data_dir_validate()
function, which
you can run over your $ARCHS4DIR
in order to check on "the health" of your
install.
This function will simply look at your $ARCHS4DIR
to ensure that the required
files are there, and tries to give you helpful error messages if not.
For instance, if the first two files enumerated in the
Data File Download section were missing from
your $ARCHS4DIR
(ie. human_matrix.h5
and human_hiseq_transcript_v2.h5
),
you would be warned that "something isn't right" when you first load the
archs4
package. You could then run the archs4_local_data_dir_validate()
to see what is wrong:
archs4_local_data_dir_validate(archs4dir) #> The following ARCHS4 files are missing, please download them: #> * human_matrix.h5: https://s3.amazonaws.com/mssm-seq-matrix/human_matrix.h5 #> * human_hiseq_transcript_v2.h5: #> https://s3.amazonaws.com/mssm-seq-matrix/human_hiseq_transcript_v2.h5
NOTE: If all installation and data download/processing steps have been
completed successfully, a call to archs4_local_data_dir_validate()
will simply
return TRUE
.
If you are developing this package, you will find that it will be convenient
to symlink the package's default archs4.datadir
path (~/.arcsh4data
) to the
$ARCHS4DIR
you just setup. This is because often times things like roxygen2
document compilation, unit testing, etc. happen in a vanilla R workspace, which
won't run the configuration that is prescribed in your ~/.Rprofile
file.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.