In Bioconductor/HCABrowser: Browse the Human Cell Atlas data portal

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

What is the Human Cell Atlas?

From the Human Cell Atlas (HCA) website:

The cell is the core unit of the human body -- the key to understanding the biology of health and the ways in which molecular dysfunction leads to disease. Yet our characterization of the hundreds of types and subtypes of cells in the human body is limited, based partly on techniques that have limited resolution and classifications that do not always map neatly to each other. Genomics has offered a systematic approach, but it has largely been applied in bulk to many cell types at once -- masking critical differences between cells -- and in isolation from other valuable sources of data.

Recent advances in single-cell genomic analysis of cells and tissues have put systematic, high-resolution and comprehensive reference maps of all human cells within reach. In other words, we can now realistically envision a human cell atlas to serve as a basis for both understanding human health and diagnosing, monitoring, and treating disease.

At its core, a cell atlas would be a collection of cellular reference maps, characterizing each of the thousands of cell types in the human body and where they are found. It would be an extremely valuable resource to empower the global research community to systematically study the biological changes associated with different diseases, understand where genes associated with disease are active in our bodies, analyze the molecular mechanisms that govern the production and activity of different cell types, and sort out how different cell types combine and work together to form tissues.

The Human Cell Atlas facilitates queries on it's [data coordination platform with a RESTFUL API] (https://dss.data.humancellatlas.org/).

Installation

To install this package, use Bioconductor's BiocManager package.

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install('HCABrowser')

library(HCABrowser)

Connecting to the Human Cell Atlas

The HCABrowser package relies on having network connectivety. Also, the HCA's Data Coordination Platform (DCP) must also be operational. This package is meant for users who have a working knowledge of the HCA DCP's schema to provide the user with the fill richness of the HCA's content. For a more simple-to-use way to access and download data from The Human Cell Atlas, please use the HCAExplorer package.

The HCABrowser object serves as the representation of the Human Cell Atlas. Upon creation, it will automatically peform a cursorary query and display a small table showing the first few bundles of the entire HCA. This intial table contains some columns that we have determined are most useful to users. The output also displays the url of the instance of the HCA DCP being used, the current query, whether bundles or files are being displayed, and the number of bundles in the results

By default, ten bundles per page will be displayed in the result for "raw" output and 100 will be displayed for "summary" output. The host argument dictates the hca-dcp the client will be accessing while the api_url argument dictates what schema the client will be accessing. These three values can be changed in the constructor. The default values are given below.

If the HCA cannot be reached, an error will be thrown displaying the status of the request.

hca <-
    HCABrowser(
        api_url='https://dss.data.humancellatlas.org/v1/swagger.json',
        host = 'dss.data.humancellatlas.org/v1'
    )
hca

Upon displaying the object, multiple fields can be seen: - The class: HCABrowser - The hca dcp host address that is currently being used. - The current query (assigned using filter()) - The current selection (assigned using select()) - You may notice that some columns are already selected. These columns are automatically selected to allow the user some initial view of the hca.

Querying the HCABrowser

The HCA extends the functionality of the dplyr package's filter() and select() methods.

The filter() method allows the user to query the HCA by relating fields to certain values. Character fields can be queried using the operators: - == - != - %in% - %startsWith% - %endsWith% - %contains%

Numeric fields can be queried with the operators: - == - != - %in% - > - < - >= - <=

Queries can be encompassed by parenthesese - ()

Queries can be negated by placing the ! symbol in front

Combination operators can be used to combine queries - & - |

Now we see that "brain" and "Brain" are available values. Since these values are the result of input by other users, there may be errors or inconsistencies. To be safe, both fields can be queried with the following query:

hca2 <-
    hca %>%
    filter('files.specimen_from_organism_json.organ.text' == c('Brain', 'brain'))
hca2 <-
    hca %>%
    filter(
        'files.specimen_from_organism_json.organ.text' %in% c('Brain', 'brain')
    )
hca2 <-
    hca %>%
    filter(
        'files.specimen_from_organism_json.organ.text' == Brain |
        'files.specimen_from_organism_json.organ.text' == brain
    )
hca2

If we also wish to search for results based on the NCBI Taxon ID for mouse, 10090, as well as brain, we can perform this query in a variety of ways.

hca2 <-
    hca %>%
    filter(
        'files.specimen_from_organism_json.organ.text' %in% c('Brain', 'brain')
    ) %>%
    filter('specimen_from_organism_json.biomaterial_core.ncbi_taxon_id' == 10090)
hca2 <-
    hca %>%
    filter(
        'files.specimen_from_organism_json.organ.text' %in% c('Brain', 'brain'),
        'specimen_from_organism_json.biomaterial_core.ncbi_taxon_id' == 10090
    )
hca <-
    hca %>%
    filter(
        'files.specimen_from_organism_json.organ.text' %in% c('Brain', 'brain') &
        'specimen_from_organism_json.biomaterial_core.ncbi_taxon_id' == 10090
    )
hca

The HCABrowser package is able to handle arbitrarily complex queries on the Human Cell Atlas.

hca2 <-
    hca %>%
    filter((!organ.text %in% c('Brain', 'blood')) & (
        files.specimen_from_organism_json.genus_species.text == "Homo sapiens" |
        library_preparation_protocol_json.library_construction_approach.text == 'Smart-seq2'
    ))
hca2

The HCABrowser object can undo the most recent queries run on it.

hca <-
    hca %>%
    filter('files.specimen_from_organism_json.organ.text' == heart)
hca <-
    hca %>%
    filter('files.specimen_from_organism_json.organ.text' != brain)
hca <- undoEsQuery(hca, n = 2)
hca

If one would want to start from a fresh query but retain the modifications made to the HCABrowser object, the resetEsQuery() method can be used.

hca <- resetEsQuery(hca)
hca

Using fields(), we can find that the fields paired_end and organ.ontology are availiable. These fields can be shown in our resulting HCABrowser object using the select() method.

hca2 <-
    hca %>%
    select('paired_end', 'organ.ontology')
hca2 <-
    hca %>%
    select(c('paired_end', 'organ.ontology'))
hca2

Use examples

The remainder of this vignette will review applicable uses for this package.

Download a Bundle

The following retrieves the first ten results satisfying the filter criterion.

hca <-
    HCABrowser() %>%
    filter('files.specimen_from_organism_json.organ.text' == "brain")
result <- searchBundles(hca, replica='aws')
result

Each result is represented as a 'bundle' of primary data and (with output_format = 'summary') associated metadata. Bundles have a unique identifier, bundle_fqid, consisting of an identifier and version (date) stamp

result %>%
    as_tibble() %>%
    select("bundle_fqid")

Use the searchBundles() parameter per_page= to influence the number of bundles returned. The maximimum per_page= value when output_format = "summary" is 500; when output_format = "raw" the maximum is 10. Use nextResults() to obtain the next per_page set of results.

nextResults(result) %>%
    as_tibble() %>%
    select("bundle_fqid")

The uuid and version components of the fqid are easily obtained, e.g.,

re <- "^([^\\.]+)\\.(.*)$" # uuid / version as before / after the first '.'
tbl <-
    result %>%
    as_tibble() %>%
    mutate(
        uuid = sub(re, "\\1", bundle_fqid),
        version = sub(re, "\\2", bundle_fqid)
    ) %>%
    select(uuid, version)
tbl

Use checkoutBundle() to 'check out' a bundle from the HCA to the Amazon or Google cloud.

## provide 'version' for precise control
checkout_job_id <- checkoutBundle(hca, tbl$uuid[1])

The checkout_job_id can be used to query the status of the bundle

status <- getBundleCheckout(hca, checkout_job_id)
status %>% str()

The status will change from 'RUNNING' to 'SUCCEEDED', showing the location of the bundle for download via AWS or GCP command line tools.

> getBundleCheckout(hca, checkout_job_id=jid) %>% str()
List of 1
 $ status: chr "RUNNING"
> ## time passes
> getBundleCheckout(hca, checkout_job_id=jid) %>% str()
List of 2
 $ location: chr "s3://org-hca-dss-checkout-prod/bundles/fffcea5e-...
 $ status  : chr "SUCCEEDED"

Download a File

hca <-
    HCABrowser() %>%
    filter('files.specimen_from_organism_json.organ.text' == "brain")
results <-
    searchBundles(hca, output_format='raw') %>%
    results()
result <- results[[1]]

result is a list-of-lists structure. Files associated with a bundle are in at the following path:

tbl <-
    result$metadata$manifest$files %>%
    bind_rows() %>%
    mutate(`content-type` = noquote(`content-type`))
tbl

The following filters the manifest to the name and uuid of a fastq file

fastq <- 
    tbl %>%
    filter(endsWith(name, "fastq.gz")) %>%
    select(name, uuid) %>%
    head(1)
fastq

Download the file with

destination <- file.path(tempdir(), fastq$name)
getFile(HCABrowser(), fastq$uuid, destination = destination)

Developer notes

The S3 object-oriented programming paradigm is used.
Methods from the dplyr package can be used to manipulate objects in the HCABrowser package.
In the future, we wish to expand the functionalit of this packages to cover the remaining functionality of the hca dcp api.