vcf2diem: Convert vcf files to diem format
In diemr: Genome Polarization via Diagnostic Index Expectation Maximization

vcf2diem

R Documentation

Convert vcf files to diem format

Description

Reads vcf files and writes genotypes of the most frequent alleles based on chromosome positions to diem format.

Usage

vcf2diem(
  SNP,
  filename,
  chunk = 1L,
  requireHomozygous = TRUE,
  ChosenInds = "all",
  maxMissing = 0L,
  bed = FALSE
)

Arguments

`SNP`	A character vector with a path to the '.vcf' or '.vcf.gz' file, or an `vcfR` object. Diploid data are currently supported.
`filename`	A character vector with a path where to save the converted genotypes.
`chunk`	Numeric indicating by how many sites should the result be split into separate files.
`requireHomozygous`	A logical or numeric vector indicating whether to require the site to have at least one or more homozygous individual(s) for each allele.
`ChosenInds`	A numeric or logical vector of indices of individuals to be included in the analysis.
`maxMissing`	A numeric value specifying the maximum tolerated missing data per site. Values less than 1 are interpreted as a proportion (e.g., 0.25 allows up to 25% missing), and values greater than 1 are interpreted as the maximum number of individuals allowed to have missing genotypes. The default `0L` disables filtering by missing data, meaning that no sites are excluded based on error rate.
`bed`	Logical. If `TRUE`, export `includedSites` and `omittedSites` in 3-column BED format.

Details

Importing vcf files larger than 1GB, and those containing multiallelic genotypes is not recommended. Instead, use the path to the vcf file in SNP. vcf2diem then reads the file line by line, which is a preferred solution for data conversion, especially for very large and complex genomic datasets.

The number of files vcf2diem creates depends on the chunk argument and class of the SNP object.

When chunk = 1L, all included sites will be written into one file. Use this option when you do not use parallelization or differentiate ploidy in some compartments.
Other values of chunk < 100 are interpreted as the number of files into which to split data in SNP. For SNP object of class vcfR, the number of sites per file is calculated from the dimensions of SNP. When class of SNP is character, the number of sites per file is approximated from a model with a message. If this number of sites per file is inappropriate for the expected output, provide the intended number of sites per file in chunk greater than 100 (values greater than 10000 are recommended for genomic data). vcf2diem will scan the whole input specified in the SNP file, creating additional output files until the last line in SNP is reached.
Values of chunk >= 100 mean that each output file in diem format will contain chunk number of lines with the data in SNP.

When the vcf file contains sites not informative for genome polarisation, those are removed and listed in a file ending with omittedSites.txt in the directory specified in the filename argument or in the working directory. The omitted loci are identified by their values in the CHROM and POS columns, and include the QUAL column data. The last column is an integer specifying the reason why the respective site was omitted. The reasons sites are not informative for genome polarisation using diem are:

Site has fewer than 2 alleles representing substitutions.
Required homozygous individuals for the two most frequent alleles are not present (optional, controlled by the requireHomozygous argument).
The second most frequent allele is found only in one heterozygous individual.
Dataset is invariant for the most frequent allele.
Dataset is invariant for the allele listed as the first ALT in the vcf input.
Site is not genotyped for a required number of individuals.

The CHROM, POS, and QUAL information for loci included in the converted files is listed in the file ending with includedSites.txt. An additional column shows which allele is encoded as 0 in its homozygous state and which is encoded as 2.

When bed = TRUE, both includedSites.txt and omittedSites.txt contain simplified 0-based site coordinates in the standard 3-column BED format: chromosome, start (POS - 1), and end (POS). All other columns described above are omitted in this case.

The file sampleNames.txt will contain the individual names listed from column 10 onward in the VCF header, restricted to those specified by the ChosenInds argument.

Value

No value returned, called for side effects.

Author(s)

Natalia Martinkova

Filip Jagos 521160@mail.muni.cz

Jachym Postulka 506194@mail.muni.cz

Examples

## Not run: 
# vcf2diem will write files to a working directory or a specified folder
# make sure the working directory or the folder are at a location with write permission
myofile <- system.file("extdata", "myotis.vcf", package = "diemr")

vcf2diem(SNP = myofile, filename = "test1")
vcf2diem(SNP = myofile, filename = "test2", chunk = 3)

## End(Not run)

diemr documentation built on Dec. 11, 2025, 5:07 p.m.