vcf_sanity_check | R Documentation |
This function performs a series of checks on a VCF file to ensure its validity and integrity. It verifies the presence of required headers, columns, and data fields, and checks for common issues such as missing or malformed data.
vcf_sanity_check(
vcf_path,
n_data_lines = 100,
max_markers = 10000,
verbose = FALSE
)
vcf_path |
A character string specifying the path to the VCF file. The file can be plain text or gzipped. |
n_data_lines |
An integer specifying the number of data lines to sample for detailed checks. Default is 100. |
max_markers |
An integer specifying the maximum number of markers allowed in the VCF file. Default is 10,000. |
verbose |
A logical value indicating whether to print detailed messages during the checks. Default is FALSE. |
The function performs the following checks: - **VCF_header**: Verifies the presence of the '##fileformat' header. - **VCF_columns**: Ensures required columns ('#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO') are present. - **max_markers**: Checks if the total number of markers exceeds the specified limit. - **GT**: Verifies the presence of the 'GT' (genotype) field in the FORMAT column. - **allele_counts**: Checks for allele-level count fields (e.g., 'AD', 'RA', 'AO', 'RO'). - **samples**: Ensures sample/genotype columns are present. - **chrom_info** and **pos_info**: Verifies the presence of 'CHROM' and 'POS' columns. - **ref_alt**: Ensures 'REF' and 'ALT' fields contain valid nucleotide codes. - **multiallelics**: Identifies multiallelic sites (ALT field with commas). - **phased_GT**: Checks for phased genotypes (presence of '|' in the 'GT' field). - **duplicated_samples**: Checks for duplicated sample IDs. - **duplicated_markers**: Checks for duplicated marker IDs.
A list containing: - 'checks': A named vector indicating the results of each check (TRUE or FALSE). - 'messages': A data frame containing messages for each check, indicating success or failure. - 'duplicates': A list containing any duplicated sample or marker IDs found in the VCF file. - 'ploidy_max': The maximum ploidy detected from the genotype field, if applicable.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.