import.vcf: Import VCF file

View source: R/handle-vcf.R

import.vcfR Documentation

Import VCF file

Description

A wrapper for the VCF import function in the vcfR package that formats VCF data for PGS application with apply.polygenic.score().

Usage

import.vcf(
  vcf.path,
  long.format = FALSE,
  info.fields = NULL,
  format.fields = NULL,
  verbose = FALSE
)

Arguments

vcf.path

A character string indicating the path to the VCF file to be imported.

long.format

A logical indicating whether the VCF import should be converted into long format (one row per sample-variant combination)

info.fields

A character vector indicating the INFO fields to be imported, only applicable when long format is TRUE.

format.fields

A character vector indicating the FORMAT fields to be imported, only applicable when long format is TRUE.

verbose

A logical indicating whether verbose output should be printed by vcfR.

Value

A list of two elements containing imported VCF information in wide format and in long format if requested.

Output Structure

The outputed list contains the following elements:

  • split.wide.vcf.matrices: A list with two elements: a data.table of fixed VCF fields and a matrix of genotyped alleles.

  • combined.long.vcf.df: Default is NULL otherwise if long.format == TRUE a list with two elements inherited from vcfR: a data frame meta data from the VCF header and a data frame of all requested VCF fields (including INFO and FORMAT fields) in long format. Number of rows is equal to the number of samples times the number of sites in the VCF.

The split.wide.vcf.matrices list contains the following elements:

  • genotyped.alleles: A matrix of genotyped alleles (e.g. "A/C"). Rows are unique sites and columns are unique samples in the input VCF.

  • vcf.fixed.fields: A data table of the following fixed (not varying by sample) VCF fields: CHROM, POS, ID, REF, ALT. Also one additional column allele.matrix.row.index indicating the corresponding row in genotyped.alleles

The combined.long.vcf.df list contains the following elements:

  • meta: A data frame of meta data parsed from the VCF header

  • dat: A data frame of all default VCF fields and all requested INFO and FORMAT fields in long format. Number of rows is equal to the number of unique samples times the number of unique sites in the VCF.

The wide format is intended to efficiently contain the bare minimum information required for PGS application. It intentionally excludes much of the additional information included in a typical VCF, and splits off genotypes into a separate matrix for easy manipulation. If users wish to maintain additional information in the INFO and FORMAT fields for e.g. variant filtering, the long format allows this. However, the long format requires substantially more memory to store, and is not recommended for large input files.

Examples

# Example VCF
vcf <- system.file(
    'extdata',
    'HG001_GIAB.vcf.gz',
    package = 'ApplyPolygenicScore',
    mustWork = TRUE
    );
vcf.data <- import.vcf(vcf.path = vcf, long.format = TRUE);

ApplyPolygenicScore documentation built on Aug. 21, 2025, 5:43 p.m.