vcf_sanity_check: Perform a Sanity Check on a VCF File

View source: R/utils.R

vcf_sanity_checkR Documentation

Perform a Sanity Check on a VCF File

Description

This function performs a series of checks on a VCF file to ensure its validity and integrity. It verifies the presence of required headers, columns, and data fields, and checks for common issues such as missing or malformed data.

Usage

vcf_sanity_check(
  vcf_path,
  n_data_lines = 100,
  max_markers = 10000,
  verbose = FALSE
)

Arguments

vcf_path

A character string specifying the path to the VCF file. The file can be plain text or gzipped.

n_data_lines

An integer specifying the number of data lines to sample for detailed checks. Default is 100.

max_markers

An integer specifying the maximum number of markers allowed in the VCF file. Default is 10,000.

verbose

A logical value indicating whether to print detailed messages during the checks. Default is FALSE.

Details

The function performs the following checks: - **VCF_header**: Verifies the presence of the '##fileformat' header. - **VCF_columns**: Ensures required columns ('#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO') are present. - **max_markers**: Checks if the total number of markers exceeds the specified limit. - **GT**: Verifies the presence of the 'GT' (genotype) field in the FORMAT column. - **allele_counts**: Checks for allele-level count fields (e.g., 'AD', 'RA', 'AO', 'RO'). - **samples**: Ensures sample/genotype columns are present. - **chrom_info** and **pos_info**: Verifies the presence of 'CHROM' and 'POS' columns. - **ref_alt**: Ensures 'REF' and 'ALT' fields contain valid nucleotide codes. - **multiallelics**: Identifies multiallelic sites (ALT field with commas). - **phased_GT**: Checks for phased genotypes (presence of '|' in the 'GT' field). - **duplicated_samples**: Checks for duplicated sample IDs. - **duplicated_markers**: Checks for duplicated marker IDs.

Value

A list containing: - 'checks': A named vector indicating the results of each check (TRUE or FALSE). - 'messages': A data frame containing messages for each check, indicating success or failure. - 'duplicates': A list containing any duplicated sample or marker IDs found in the VCF file. - 'ploidy_max': The maximum ploidy detected from the genotype field, if applicable.


Qploidy documentation built on June 8, 2025, 10 a.m.