vcf2DT: VCF file to data table

View source: R/vcf2DT.R

vcf2DTR Documentation

VCF file to data table

Description

Reads a VCF file and converts to a long format data table. Note, that whilst the data.table object class is very memory efficient, very large genomic datasets might take longer to read in, and/or be difficult to hold in memory. Take your operating system and the size of your input dataset into consideration when using this function.

Usage

vcf2DT(vcfFile, dropCols = NULL, keepComments = FALSE, keepInfo = FALSE)

Arguments

vcfFile

Character: The path to the input VCF file.

dropCols

Character: Vector of column names from the VCF that you want to drop from the output data table. Use this for any column that occurs before the 'FORMAT' column in the original VCF file. Default = NULL.

keepComments

Logical: Should the VCF comments be kept? Default = FALSE. See Details for parameterisation.

keepInfo

Logical: Should the VCF info for each locus be kept? Default = FALSE.

Details

Firstly, it should be noted that while data tables are a really excellent way of handling genotype and sequence read information in R, they are not necessarily the most efficient way to do so for very large genomic datasets. Take your operating system and/or dataset in mind before using this function. Most RADseq datasets should be manageable, but whole-genome data can be challenging if you do not have a lot of available memory. You can always try loading in subsets (e.g., by chromosome or contigs) of your dataset to see how feasible it is to load with this function.

Value

A data.table object is returned with all the columns contained in the original VCF file with some additions:

  • A column called $LOCUS is generated. This is the concatenation of the $CHROM and $POS column to form a locus ID. "CHROM:POS".

  • A column called $SAMPLE is generated. This contains the sample IDs that are the columns that follow the $FORMAT column in the original VCF.

  • The items in the original $FORMAT column of the VCF are given their own columns.


Note, for VCF files produced by Stacks, the $CHROM is given the same value as the $ID column.

When keepInfo==TRUE and/or keepComments==TRUE, these are returned as attributes. E.g., if the returned object is vcfDT, then you can access Info and Comments (respectively) with: attr(vcfDT, 'vcf_info') and attr(vcfDT, 'vcf_comments').

Examples

# Create a link to raw external datasets in genomalicious
genomaliciousExtData <- paste0(find.package('genomalicious'), '/extdata')

# This command here shows you the VCF file that comes with genomalicious
list.files(path=genomaliciousExtData, pattern='indseq.vcf')

# Use this to create a path to that file
vcfPath <- paste0(genomaliciousExtData, '/data_indseq.vcf')

# You can read the file in as lines to see what it
# looks like:
readLines(vcfPath) %>%  head
readLines(vcfPath) %>%  tail

# Now read it in as a data table
readVcf1 <- vcf2DT(vcfFile=vcfPath)
readVcf1 %>% print()

# Read in VCF, but drop some columns,
# and keep comments and info.
readVcf2 <- vcf2DT(vcfPath
   , dropCols=c('QUAL')
   , keepComments=TRUE
   , keepInfo=TRUE)

readVcf2 %>% print

attr(readVcf2, 'vcf_comments')
attr(readVcf2, 'vcf_info')


j-a-thia/genomalicious documentation built on Oct. 19, 2024, 7:51 p.m.