In r-spark/sparkhail: A 'Sparklyr' Extension for 'Hail'

Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). Hail is exposed as a Python library, using primitives for distributed queries and linear algebra implemented in Scala, Spark, and increasingly C++.

The sparkhail is a R extension using sparklyr package. The idea is to help R users to use Hail functionalities with the well know tidyverse sintax. In this README we are going to reproduce the GWAS tutorial using sparkhail, sparklyr, dplyr and ggplot2.

Installation

You can install the sparkhail package from CRAN as follows:

install.packages("sparkhail")

To upgrade to the latest version of sparkhail, run the following command and restart your R session:

install.packages("devtools")
devtools::install_github("r-spark/sparkhail")

You can install Hail manually or using hail_install().

sparkhail::hail_install()

Read a matrix table

The data in Hail is naturally represented as a Hail MatrixTable. The sparkhail converts the MatrixTable to dataframe, in this way is easier to manipulate the data using dplyr.

library(sparkhail)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.4", config = hail_config())

hl <- hail_context(sc)
mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail"))

Convert to spark Data Frame as follows

df <- hail_dataframe(mt)

Getting to know our data

You can see the data structure using glimpse().

library(dplyr)
glimpse(df)

It’s important to have easy ways to slice, dice, query, and summarize a dataset. The conversion to dataframe is a good way to use dplyr verbs. Let's see some examples.

df %>% 
  dplyr::select(locus, alleles) %>% 
  head(5)

Here is how to peek at the first few sample IDs:

s <- hail_ids(mt)
s

The genotype calls are in entries column and we can see it using hail_entries() function. This function selects and explodes the data frame using sparklyr.nested.

hail_entries(df)

Adding column fields

A Hail MatrixTable can have any number of row fields and column fields for storing data associated with each row and column. Annotations are usually a critical part of any genetic study. Column fields are where you’ll store information about sample phenotypes, ancestry, sex, and covariates. Row fields can be used to store information like gene membership and functional impact for use in QC or analysis.

The file provided contains the sample ID, the population and “super-population” designations, the sample sex, and two simulated phenotypes (one binary, one discrete).

This file is a standard text file and can be imported using sparklyr.

annotations <- spark_read_csv(sc, "table", 
                              path = system.file("extdata/1kg_annotations.txt",
                                                 package = "sparkhail"),
                              overwrite = TRUE, 
                              delimiter = "\t")

A good way to peek at the structure of a Table is to look at its schema.

glimpse(annotations)

Now we’ll use this table to add sample annotations to our dataset. To merge these data we can use joins.

annotations_sample <- inner_join(s, annotations, by = c("s" = "Sample"))

Query functions

We will start by looking at some statistics of the information in our data. We can aggregate using group_by() and count the number of occurrences using tally().

annotations %>%
  group_by(SuperPopulation) %>%
  tally()

We can use sdf_describe() to see the summary statistics of the data.

sdf_describe(annotations)

However, these metrics aren’t perfectly representative of the samples in our dataset. Here’s why:

sdf_nrow(annotations)
sdf_nrow(annotations_sample)

Since there are fewer samples in our dataset than in the full thousand genomes cohort, we need to look at sample annotations on the dataset.

annotations_sample %>%
  group_by(SuperPopulation) %>%
  tally()

sdf_describe(annotations)

Let's see another example, now we are going to calculate the counts of each of the 12 possible unique SNPs (4 choices for the reference base * 3 choices for the alternate base). To do this, we need to get the alternate allele of each variant and then count the occurences of each unique ref/alt pair. The alleles column is nested, because of this, we need to separete this column using sdf_separate_column().

df %>% 
  sdf_separate_column("alleles") %>% 
  group_by(alleles_1, alleles_2) %>% 
  tally() %>% 
  arrange(-n)

It’s nice to see that we can actually uncover something biological from this small dataset: we see that these frequencies come in pairs. C/T and G/A are actually the same mutation, just viewed from from opposite strands. Likewise, T/A and A/T are the same mutation on opposite strands. There’s a 30x difference between the frequency of C/T and A/T SNPs. Why?

The last example, what about genotypes? Hail can query the collection of all genotypes in the dataset, and this is getting large even for our tiny dataset. Our 284 samples and 10,000 variants produce 10 million unique genotypes. Let's plot this using the ggplot2 and dbplot package.

library(dbplot)
library(ggplot2)
library(sparklyr.nested) # to access DP nested in info column

df %>% 
  sdf_select(DP = info.DP) %>% 
  dbplot_histogram(DP) +
  labs(title = "Histogram for DP", y = "Frequency")