seqVCF2GDS: Reformat VCF Files
In zhengxwen/SeqArray: Data management of large-scale whole-genome sequence variant calls using GDS files

seqVCF2GDS

R Documentation

Reformat VCF Files

Description

Reformats Variant Call Format (VCF) files.

Usage

seqVCF2GDS(vcf.fn, out.fn, header=NULL, storage.option="LZMA_RA",
    info.import=NULL, fmt.import=NULL, genotype.var.name="GT",
    ignore.chr.prefix="chr", scenario=c("general", "imputation"),
    reference=NULL, start=1L, count=-1L, variant_count=NA_integer_,
    optimize=TRUE, raise.error=TRUE, digest=TRUE, use_Rsamtools=NA,
    parallel=FALSE, verbose=TRUE)
seqBCF2GDS(bcf.fn, out.fn, header=NULL, storage.option="LZMA_RA",
    info.import=NULL, fmt.import=NULL, genotype.var.name="GT",
    ignore.chr.prefix="chr", scenario=c("general", "imputation"),
    reference=NULL, optimize=TRUE, raise.error=TRUE, digest=TRUE,
    bcftools="bcftools", verbose=TRUE)

Arguments

`vcf.fn`	the file name(s) of VCF format; or a `connection` object
`bcf.fn`	a file name of binary VCF format (BCF)
`out.fn`	the file name of output GDS file
`header`	if NULL, `header` is set to be `seqVCF_Header(vcf.fn)`
`storage.option`	specify the storage and compression option, "ZIP_RA" (`seqStorageOption("ZIP_RA")`); or "LZMA_RA" to use LZMA compression algorithm with higher compression ratio by default; or "LZ4_RA" to use an extremely fast compression and decompression algorithm. "ZIP_RA.max", "LZMA_RA.max" and "LZ4_RA.max" correspond to the algorithms with a maximum compression level; the suffix "_RA" indicates that fine-level random access is available; see more details at `seqStorageOption`
`info.import`	characters, the variable name(s) in the INFO field for import; or `NULL` for all variables
`fmt.import`	characters, the variable name(s) in the FORMAT field for import; or `NULL` for all variables
`genotype.var.name`	the ID for genotypic data in the FORMAT column; "GT" by default (in VCF v4)
`ignore.chr.prefix`	a vector of character, indicating the prefix of chromosome which should be ignored, e.g., `"chr"`; it is not case-sensitive
`scenario`	"general": use float32 to store floating-point numbers (by default); "imputation": use packedreal16 to store DS and GP in the FORMAT field with four decimal place accuracy
`reference`	genome reference, like "hg19", "GRCh37"; if the genome reference is not available in VCF files, users could specify the reference here
`start`	the starting variant if importing part of VCF files
`count`	the maximum count of variant if importing part of VCF files, -1 indicates importing to the end
`variant_count`	`NA_integer_` (default) or a numeric vector specifying the numbers of variants in the VCF file(s) in `vcf.fn`; only applicable when multiple cores are used; if the number of variants is known, the conversion can skip counting the variants before splitting the file(s); `variant_count` could be an approximate
`optimize`	if `TRUE`, optimize the access efficiency by calling `cleanup.gds`
`raise.error`	`TRUE`: throw an error if numeric conversion fails; `FALSE`: get missing value if numeric conversion fails
`digest`	a logical value (TRUE/FALSE) or a character ("md5", "sha1", "sha256", "sha384" or "sha512"); add md5 hash codes to the GDS file if TRUE or a digest algorithm is specified
`use_Rsamtools`	only applicable when multiple cores are used; `use_Rsamtools` is passed to `seqVCF_Header` to get the total number of variants; `NA`: using Rsamtools when it is installed; `FALSE`: not use the Rsamtools package; `TRUE`: to use Rsamtools, if it is not installed, the function fails
`parallel`	`FALSE` (serial processing), `TRUE` (parallel processing), a numeric value indicating the number of cores, or a cluster object for parallel processing; `parallel` is passed to the argument `cl` in `seqParallel`, see `seqParallel` for more details
`verbose`	if `TRUE`, show information
`bcftools`	the path of the program `bcftools`

Details

If there are more than one files in vcf.fn, seqVCF2GDS will merge all VCF files together if they contain the same samples. It is useful to merge multiple VCF files if variant data are split by chromosomes.

The real numbers in the VCF file(s) are stored in 32-bit floating-point format by default. Users can set storage.option=seqStorageOption(float.mode="float64") to switch to 64-bit floating point format. Or packed real numbers can be adopted by setting storage.option=seqStorageOption(float.mode="packedreal16:scale=0.0001").

By default, the compression method is "LZMA_RA" (https://tukaani.org/xz/, LZMA algorithm with default compression level + independent data blocks for fine-level random access). Users can maximize the compression ratio by storage.option="LZMA_RA.max" or storage.option=seqStorageOption("LZMA_RA.max"). LZMA is known to have higher compression ratio than the zlib algorithm. LZ4 (https://github.com/lz4/lz4) is an option via storage.option="LZ4_RA" or storage.option=seqStorageOption("LZ4_RA").

If multiple cores/processes are specified in parallel, all VCF files are scanned to calculate the total number of variants before format conversion, and then split by the number of cores/processes.

storage.option="Ultra" and storage.option="UltraMax" need much larger memory than other compression methods. Users may consider using seqRecompress to recompress the GDS file after calling seqVCF2GDS() with storage.option="ZIP_RA", since seqRecompress() compresses data nodes one by one, taking much less memory than "Ultra" and "UltraMax".

If storage.option="LZMA_RA" runs out of memory (e.g., there are too many annotation fields in the VCF file), users could use storage.option="ZIP_RA" and then call seqRecompress(, compress="LZMA").

Value

Return the file name of GDS format with an absolute path.

Author(s)

Xiuwen Zheng

References

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27, 2156-2158.

Examples

# the VCF file
vcf.fn <- seqExampleFileName("vcf")

# conversion
seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA")

# conversion in parallel
seqVCF2GDS(vcf.fn, "tmp_p2.gds", storage.option="ZIP_RA", parallel=2L)


# display
(f <- seqOpen("tmp.gds"))
seqClose(f)



# convert without the INFO fields
seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA",
    info.import=character(0))

# display
(f <- seqOpen("tmp.gds"))
seqClose(f)



# convert without the INFO and FORMAT fields
seqVCF2GDS(vcf.fn, "tmp.gds", storage.option="ZIP_RA",
    info.import=character(0), fmt.import=character(0))

# display
(f <- seqOpen("tmp.gds"))
seqClose(f)


# delete the temporary file
unlink(c("tmp.gds", "tmp_p2.gds"), force=TRUE)

zhengxwen/SeqArray documentation built on April 14, 2025, 2:19 a.m.