import.vcf | R Documentation |
A wrapper for the VCF import function in the vcfR package that formats VCF data for PGS application with apply.polygenic.score()
.
import.vcf(
vcf.path,
long.format = FALSE,
info.fields = NULL,
format.fields = NULL,
verbose = FALSE
)
vcf.path |
A character string indicating the path to the VCF file to be imported. |
long.format |
A logical indicating whether the VCF import should be converted into long format (one row per sample-variant combination) |
info.fields |
A character vector indicating the INFO fields to be imported, only applicable when long format is |
format.fields |
A character vector indicating the FORMAT fields to be imported, only applicable when long format is |
verbose |
A logical indicating whether verbose output should be printed by vcfR. |
A list of two elements containing imported VCF information in wide format and in long format if requested.
Output Structure
The outputed list contains the following elements:
split.wide.vcf.matrices: A list with two elements: a data.table
of fixed VCF fields and a matrix
of genotyped alleles.
combined.long.vcf.df: Default is NULL
otherwise if long.format == TRUE
a list with two elements inherited from vcfR: a data frame meta data from the VCF header and a data frame of all requested VCF fields (including INFO and FORMAT fields) in long format. Number of rows is equal to the number of samples times the number of sites in the VCF.
The split.wide.vcf.matrices
list contains the following elements:
genotyped.alleles: A matrix of genotyped alleles (e.g. "A/C"). Rows are unique sites and columns are unique samples in the input VCF.
vcf.fixed.fields: A data table of the following fixed (not varying by sample) VCF fields: CHROM, POS, ID, REF, ALT. Also one additional column allele.matrix.row.index
indicating the corresponding row in genotyped.alleles
The combined.long.vcf.df
list contains the following elements:
meta: A data frame of meta data parsed from the VCF header
dat: A data frame of all default VCF fields and all requested INFO and FORMAT fields in long format. Number of rows is equal to the number of unique samples times the number of unique sites in the VCF.
The wide format is intended to efficiently contain the bare minimum information required for PGS application. It intentionally excludes much of the additional information included in a typical VCF, and splits off genotypes into a separate matrix for easy manipulation. If users wish to maintain additional information in the INFO and FORMAT fields for e.g. variant filtering, the long format allows this. However, the long format requires substantially more memory to store, and is not recommended for large input files.
# Example VCF
vcf <- system.file(
'extdata',
'HG001_GIAB.vcf.gz',
package = 'ApplyPolygenicScore',
mustWork = TRUE
);
vcf.data <- import.vcf(vcf.path = vcf, long.format = TRUE);
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.