vcftobd: Convert a VCF file to Format 5 binary dosage files

View source: R/format5.R

vcftobdR Documentation

Convert a VCF file to Format 5 binary dosage files

Description

Reads the DS (dosage) and GP (genotype probabilities) FORMAT fields from a bgzipped, tabix-indexed VCF file — as produced by imputation servers such as the Michigan Imputation Server — and writes a pair of Format 5 BinaryDosage files.

Usage

vcftobd(
  vcffile,
  bdose_file,
  region = NULL,
  snpidformat = 0L,
  bdoptions = character(0)
)

Arguments

vcffile

Path to the bgzipped, tabix-indexed VCF file.

bdose_file

Path for the output .bdose file. The companion .bdi metadata file is written to paste0(bdose_file, ".bdi").

region

Optional genomic region string in bcftools format (e.g. "chr21" or "chr21:1-5000000"). Requires a tabix index. Default NULL processes the entire file.

snpidformat

Integer controlling how SNP IDs are stored.

-1

Generate IDs as chr:pos:ref:alt; equivalent to 2 for Format 5.

0

Use the IDs as they appear in the VCF file (default). Auto-detects format 1 or 2 if all IDs match.

1

Store IDs as chr:pos. An error is raised if the VCF already uses chr:pos:ref:alt format, as information would be lost.

2

Store IDs as chr:pos:ref:alt.

3

Store IDs as chr:pos_ref_alt.

bdoptions

Character vector specifying which per-SNP statistics to store. Any combination of "aaf" (alternate allele frequency), "maf" (minor allele frequency), and "rsq" (imputation r-squared). For each statistic, the corresponding VCF INFO field is used when present for the first SNP (AF, MAF, R2 respectively); otherwise the value is calculated from the dosage data. Default character(0) stores no statistics.

Details

The .bdose file begins with a 4-byte magic number followed by one gzip-compressed block per SNP. Each block contains the DS values for all samples followed by the GP values, encoded as unsigned 16-bit integers (round(value * 10000); 0xffff = missing).

The .bdi file is an RDS-serialised R list of class "genetic-info" with the following elements:

filename

Path to the associated .bdose file.

usesfid

Logical; always FALSE for VCF-sourced files.

samples

data.frame with columns fid (empty) and sid (sample IDs).

onechr

Logical; TRUE if all SNPs are on a single chromosome.

snpidformat

Numeric; resolved SNP ID format (see snpidformat parameter).

snps

data.frame with columns chromosome, location, snpid, reference, alternate.

snpinfo

Named list of per-SNP annotations requested via bdoptions. Each element is a numeric vector of length equal to the number of SNPs. Values are read from the VCF INFO column when available for the first SNP (AF for aaf, MAF for maf, R2 for rsq); otherwise they are calculated from the dosage values.

additionalinfo

List of class "bdose-info" with format, subformat, headersize, numgroups, and groups.

datasize

Integer vector of length 0 (unused in Format 5).

indices

Numeric vector of byte offsets into .bdose, one per SNP.

Value

NULL (invisibly)


BinaryDosage documentation built on April 30, 2026, 1:09 a.m.