In RajLabMSSM/echotabix: echoverse module: automated creation and querying of tabix files

library(echotabix)

Convert

#### Example with full data ####
tmp <- echodata::example_fullSS()  

tabix_files <- echotabix::convert(target_path = tmp,
                                  start_col = "BP")

Query

query automatically chooses the correct methods to perform a query by inferring:

Whether the file is local (e.g. on your computer) or remote (e.g. on a server accessed with a URL).
Whether the file is in VCF or table format.
Whether your query's genome build (query_genome) matches the full data's genome build (target_genome), and if not performs liftover before querying.

query_granges is used to specify which coordinates you want to query from the target file. This argument is rather flexible and can take any of the following:

A GRanges object with one or more rows, across one or more chromosomes.
A data.table containing the columns "CHR","POS", and "SNP", which is automatically converted to a GRanges object.
The direct output of the echotabix::construct_query() function, for additional flexibility and non-standard column names.

Local file

query_dat <- echodata::BST1

query_res <- echotabix::query(target_path = tabix_files$path,
                              query_granges = query_dat)

Customised queries

You can also gain more customised control over the query with the helper functionconstruct_query() (which is called internally by default when query_granges is not a GRanges object.

Using min/max ranges

query_granges1 <- echotabix::construct_query(query_chrom = "chr4", 
                                             query_start_pos = 5000,
                                             query_end_pos = 1e8L)
query_res2 <- echotabix::query(target_path = tabix_files$path,
                               query_granges = query_granges1)

Using a query_dat object

query_dat2 <- echodata::LRRK2[1:100,]
## Rename columns for demo purposes
data.table::setnames(query_dat2, c("CHR","SNP"),c("seqnames","rsID"))

query_granges2 <- echotabix::construct_query(query_dat = query_dat2, 
                                             query_chrom_col = "seqnames", 
                                             query_snp_col = "rsID")
query_res2 <- echotabix::query(target_path = tabix_files$path,
                               query_granges = query_granges2)

Standardise column names

Instead of specifying each column name, you can also select standardise_colnames=TRUE to automatically rename all columns to a standard nomenclature using MungeSumstats::format_header.

query_granges3 <- echotabix::construct_query(query_dat = query_dat2,
                                             standardise_colnames = TRUE)
query_res2 <- echotabix::query(target_path = tabix_files$path,
                               query_granges = query_granges2)

Remote file

Next, we'll query whole-genome sequencing data from the 1000 Genomes Project.

Specifically, we'll extract a given genomic range for a subset of samples (individuals).

target_path <- file.path(
    "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
    "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"
)
## Fewer samples will speed up queries substantially (15-30s vs. >1.2min). 
samples <- c("HG00097","HG00099","HG00100","HG00101","HG00102")
dat2 <- echodata::BST1[1:50,]

#### Query ####
vcf <- echotabix::query(target_path = target_path,
                        query_granges = dat2, 
                        samples = samples, 
                        query_save = TRUE,
                        as_datatable = FALSE, 
                        query_force_new = TRUE)
print(vcf)

Convert and query

Merges the convert and query functions into a single pipeline.

If the target_path file is not already a tabix file, it will first be converted to a sorted, bgzip-compressed, tabix-indexed file, and then proceed to the query step.
If the target_path file is already a tabix file, the function will detect this automatically and skip right to the query step.

query_dat <- echodata::BST1
target_path <- echodata::example_fullSS() 

query_res <- echotabix::convert_and_query( 
    target_path = target_path,
    target_start_col = "BP", 
    query_granges = query_dat,
    query_force_new = TRUE) 

knitr::kable(head(query_res))