extract_snvs_from_RefSNP_json: Extract SNPs of type "snv" from a RefSNP JSON file
In hpages/SNPlocsForge: Tools for building SNPlocs data packages

View source: R/extract_snvs_from_RefSNP_json.R

extract_snvs_from_RefSNP_json

R Documentation

Extract SNPs of type "snv" from a RefSNP JSON file

Description

Extract SNPs of type "snv" from a RefSNP JSON file.

Usage

extract_snvs_from_RefSNP_json(con, dump_dir,
                              chunksize=10000, BPPARAM=NULL)

Arguments

`con`	File path or connection to a RefSNP JSON file (compressed files are supported).
`dump_dir`	Path to the directory where to dump the snvs.
`chunksize`	How many JSON lines to load at once in memory. Set to -1 to load the entire file in memory (strongly discouraged on a big JSON file!)
`BPPARAM`	`NULL` or a BiocParallelParam instance (from the BiocParallel package). This controls how the individual chunks are going to be processed once loaded in memory. Note that chunks are always loaded in memory sequentially. After being loaded in memory, the JSON lines in the current chunk are either processed sequentially (if `BPPARAM` is `NULL`) or in parallel (if `BPPARAM` is a BiocParallelParam instance).

Details

RefSNP JSON files are made available by dbSNP for each release (a.k.a. build). For example, the RefSNP JSON files for dbSNP build 155 are available at https://ftp.ncbi.nih.gov/snp/archive/b155/JSON/. These files are compressed and have one RefSNP id per line.

extract_snvs_from_RefSNP_json() will only consider RefSNP ids of variant type "snv". Furthermore, for each RefSNP id of variant type "snv", it will only consider its placements on sequences of type "refseq_chromosome". The function will extract these placements plus their alleles and write them to output files in dump_dir. One output file will get created per sequence id.

All the output files are tab-delimited files with one row per snv and the following columns:

RefSNP id
is preferred top level placement (PTLP)
alleles position (zero-based)
deleted sequence
inserted sequences

Value

The number of RefSNP ids processed (as an invisible integer). This is equal to the number of lines in the RefSNP JSON file.

Note

extract_snvs_from_RefSNP_json() is **very** slow! Depending on the chromosome and your machine, it will only process between 40 and 80 RefSNP ids per second. At this speed it would take about 12 days just to process refsnp-chr1.json.bz2 (83578784 RefSNP ids). Using 9 workers (e.g. by setting BPPARAM to MulticoreParam(9)) makes this only about 3 times faster (i.e. 4 days instead of 12), which is also disappointing.

Examples

json_file <- system.file("extdata", "refsnp-chrMT.json",
                         package="SNPlocsForge")
dump_dir <- file.path(tempdir(), "chrMT")
dir.create(dump_dir)

## Should take about 15 sec.:
extract_snvs_from_RefSNP_json(json_file, dump_dir, BPPARAM=MulticoreParam(6))

## Let's take a look at the output:
old_wd <- setwd(dump_dir)
dir()
cat(head(readLines("NC_012920.1.tab")), sep="\n")
read.delim("NC_012920.1.tab", header=FALSE, nrows=15)
setwd(old_wd)
unlink(dump_dir, recursive=TRUE)

hpages/SNPlocsForge documentation built on Nov. 9, 2023, 11:17 a.m.