knitr::opts_chunk$set(
  collapse = TRUE
)

Motivation & Introduction

The purpose of this package is to make it easy to query the Human Cell Atlas Data Portal via their data browser API. Visit the Human Cell Atlas for more information on the project.

Installation and getting started

Evaluate the following code chunk to install packages required for this vignette.

## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)

Load the packages into your R session.

library(httr)
library(dplyr)
library(LoomExperiment)
library(hca)

Example: Discover and download a 'loom' file

To illustrate use of this package, consider the task of downloading a 'loom' file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into R as a 'LoomExperiment' object. For illustration, we focus on the 'Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns' project.

Discover projects with loom files

Use projects() to retrieve the first 200 projects in the HCA's default catalog.

projects(size = 200)

Use filters() to restrict the projects to just those that contain at least one 'loom' file.

project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble

Use standard R commands to further filter projects to the one we are interested in, with title starting with "Single...". Extract the unique projectId for the first project with this title.

project_tibble |>
    filter(startsWith(projectTitle, "Single")) |>
    head(1) |>
    t()

projectIds <-
    project_tibble |>
    filter(startsWith(projectTitle, "Single")) |>
    dplyr::pull(projectId)

projectId <- projectIds[1]

A project id can be used to discover the title or additional project information.

project_title(projectId)

project_information(projectId)

Discover and download the loom file of interest

files() retrieves (the first 1000) files from the Human Cell Atlas data portal. Construct a filter to restrict the files to loom files from the project we are interested in.

file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "loom")
)

# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")

file_tibble

files_download() will download one or more files (one for each row) in file_tibble. The download is more complicated than simply following the url column of file_tibble, so it is not possible to simply copy the url into a browser. We'll download the file and then immediately import it into R.

file_locations <- file_tibble |> files_download()

LoomExperiment::import(unname(file_locations[1]),
                       type ="SingleCellLoomExperiment")

Note that files_download() uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache], so individual files are only downloaded once.

Example: Illustrating access to h5ad files

This example walks through the process of file discovery and retrieval in a little more detail, using h5ad files created by the Python AnnData analysis software and available for some experiments in the default catalog.

Projects facets and terms

The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the 'facets' of the data, the number of terms used in each facet, and the number of distinct values used to describe projects.

projects_facets()

Note the fileFormat facet, and repeat projects_facets() to discover detail about available file formats

projects_facets("fileFormat")

Note that there are 8 uses of the h5ad file format. Use this as a filter to discover relevant projects.

filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)

Projects columns

The default tibble produced by projects() contains only some of the information available; the information is much richer.

To obtain a tibble with an expanded set of columns, you can specify that using the as parameter set to "tibble_expanded".

# an expanded set of columns for all or the first 4 projects
projects(as = 'tibble_expanded', size = 4)

In the next sections, we'll cover other options for the as parameter, and the data formats they return.

projects() as an R list

Instead of retrieving the result of projects() as a tibble, retrieve it as a 'list-of-lists'

projects_list <- projects(size = 200, as = "list")

This is a complicated structure. We will use lengths(), names(), and standard R list selection operations to navigate this a bit. At the top level there are three elements.

lengths(projects_list)

hits represents each project as a list, e.g,.

lengths(projects_list$hits[[1]])

shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g.,

lengths(projects_list$hits[[1]]$projects[[1]])
projects_list$hits[[1]]$projects[[1]]$projectTitle

One can use standard R commands to navigate this data structure, and to, e.g., extract the projectTitle of each project.

projects() as an lol

Use as = "lol" to create a more convenient way to select, filter and extract elements from the list-of-lists by projects().

lol <- projects(size = 200, as = "lol")
lol

Use lol_select() to restrict the lol to particular paths, and lol_filter() to filter results to paths that are leafs, or with specific numbers of entries.

lol_select(lol, "hits[*].projects[*]")
lol_select(lol, "hits[*].projects[*]") |>
    lol_filter(n == 44, is_leaf)

lol_pull() extracts a path from the lol as a vector; lol_lpull() extracts paths as lists.

titles <- lol_pull(lol, "hits[*].projects[*].projectTitle")
length(titles)
head(titles, 2)

Creating projects() tibbles with specific columns

The path or its abbreviation can be used to specify the columns of the tibble to be returned by the projects() query.

Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists.

columns <- c(
    projectId = "hits[*].entryId",
    projectTitle = "hits[*].projects[*].projectTitle",
    genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
    donorCount = "hits[*].donorOrganisms[*].donorCount",
    cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
    totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects

Note that the cellSuspensions.organ and totalCells columns have more than one entry per project.

projects |>
   select(projectId, cellSuspensions.organ, totalCells)

In this case, the mapping between cellSuspensions.organ and totalCells is clear, but in general more refined navigation of the lol structure may be necessary.

projects |>
    select(projectId, cellSuspensions.organ, totalCells) |>
    filter(
        ## 2023-06-06 two projects have different 'organ' and
        ## 'totalCells' lengths, causing problems with `unnest()`
        lengths(cellSuspensions.organ) == lengths(totalCells)
    ) |>
    tidyr::unnest(c("cellSuspensions.organ", "totalCells"))

Select the following entry, augment the filter, and query available files

projects |>
    filter(startsWith(projectTitle, "Reconstruct")) |>
    glimpse()

This approach can be used to customize the tibbles returned by the other main functions in the package, files(), samples(), and bundles().

File download

The relevant file can be selected and downloaded using the technique in the first example.

filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) |>
    head(1)            # only first file, for demonstration
files |> t()
file_path <- files_download(files)

"h5ad" files can be read as SingleCellExperiment objects using the zellkonverter package.

## don't want large amount of data read from disk
sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
sce

Example: A multiple file download

project_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)

project_tibble |>
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    )

projectId <-
    project_tibble |>
    filter(
        startsWith(
            projectTitle,
            "Reconstructing the human first trimester"
        )
    ) |>
    pull(projectId)

file_filter <- filters(
    projectId = list(is = projectId),
    fileFormat = list(is = "csv")
)

## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)

file_tibble |>
    files_download()

Example: Exploring the pagination feature

The files(), bundles(), and samples() can all return many 1000's of results. It is necessary to 'page' through these to see all of them. We illustrate pagination with projects(), retrieving only 30 projects.

Pagination works for the default tibble output

page_1_tbl <- projects(size = 30)
page_1_tbl

page_2_tbl <- page_1_tbl |> hca_next()
page_2_tbl

## should be identical to page_1_tbl
page_2_tbl |> hca_prev()

Pagination also works for the lol objects

page_1_lol <- projects(size = 5, as = "lol")
page_1_lol |>
    lol_pull("hits[*].projects[*].projectTitle")

page_2_lol <-
    page_1_lol |>
    hca_next()
page_2_lol |>
    lol_pull("hits[*].projects[*].projectTitle")

## should be identical to page_1_lol
page_2_lol |>
    hca_prev() |>
    lol_pull("hits[*].projects[*].projectTitle")

Example: Obtaining other data entities

Much like projects() and files(), samples() and bundles() allow you to provide a filter object and additional criteria to retrieve data in the form of samples and bundles respectively

heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples

heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles

Example: Obtaining summaries of project catalogs

HCA experiments are organized into catalogs, each of which can be summarized with the hca::summary() function

heart_filters <- filters(organ = list(is = "heart"))
hca::summary(filters = heart_filters, type = "fileTypeSummaries")
first_catalog <- catalogs()[1]
hca::summary(type = "overview", catalog = first_catalog)

Example: Obtaining details on individual projects, files, samples, and bundles

Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them.

heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects

projectId <-
    heart_projects |>
    filter(
        startsWith(
            projectTitle,
            "Cells of the adult human"
        )
    ) |>
    dplyr::pull(projectId)

result <- projects_detail(uuid = projectId)

The result is a list containing three elements representing information for navigating next or previous (alphabetical, by default) (pagination) project, the filters (termFacets) available, and details of the project (hits).

names(result)

As mentioned above, the hits are a complicated list-of-lists structure. A very convenient way to explore this structure visually is with listview::jsonedit(result). Selecting individual elements is possible using the lol interface; an alternative is cellxgenedp::jmespath().

lol(result)

Exploring manifest files

See the accompanying "Human Cell Atlas Manifests" vignette on details pertaining to the use of the manifest endpoint and further annotation of .loom files.

Session info

sessionInfo()


Bioconductor/hca documentation built on March 27, 2024, 3:15 a.m.