vcf2DT | R Documentation |
Reads a VCF file and converts to a long format data table. Note, that whilst
the data.table
object class is very memory efficient, very large genomic
datasets might take longer to read in, and/or be difficult to hold in
memory. Take your operating system and the size of your input dataset into
consideration when using this function.
vcf2DT(vcfFile, dropCols = NULL, keepComments = FALSE, keepInfo = FALSE)
vcfFile |
Character: The path to the input VCF file. |
dropCols |
Character: Vector of column names from the VCF that you
want to drop from the output data table. Use this for any column that occurs
before the 'FORMAT' column in the original VCF file. Default = |
keepComments |
Logical: Should the VCF comments be kept?
Default = |
keepInfo |
Logical: Should the VCF info for each locus be kept?
Default = |
Firstly, it should be noted that while data tables are a really excellent way of handling genotype and sequence read information in R, they are not necessarily the most efficient way to do so for very large genomic datasets. Take your operating system and/or dataset in mind before using this function. Most RADseq datasets should be manageable, but whole-genome data can be challenging if you do not have a lot of available memory. You can always try loading in subsets (e.g., by chromosome or contigs) of your dataset to see how feasible it is to load with this function.
A data.table
object is returned with all the columns contained in
the original VCF file with some additions:
A column called $LOCUS
is generated. This is the concatenation of the
$CHROM
and $POS
column to form a locus ID. "CHROM:POS".
A column called $SAMPLE
is generated. This contains the sample IDs that
are the columns that follow the $FORMAT
column in the original VCF.
The items in the original $FORMAT
column of the VCF are given their own columns.
Note, for VCF files produced by Stacks, the $CHROM is given the same value
as the $ID column.
When keepInfo==TRUE
and/or keepComments==TRUE
, these are returned
as attributes. E.g., if the returned object is vcfDT
, then you can
access Info and Comments (respectively) with: attr(vcfDT, 'vcf_info')
and attr(vcfDT, 'vcf_comments')
.
# Create a link to raw external datasets in genomalicious
genomaliciousExtData <- paste0(find.package('genomalicious'), '/extdata')
# This command here shows you the VCF file that comes with genomalicious
list.files(path=genomaliciousExtData, pattern='indseq.vcf')
# Use this to create a path to that file
vcfPath <- paste0(genomaliciousExtData, '/data_indseq.vcf')
# You can read the file in as lines to see what it
# looks like:
readLines(vcfPath) %>% head
readLines(vcfPath) %>% tail
# Now read it in as a data table
readVcf1 <- vcf2DT(vcfFile=vcfPath)
readVcf1 %>% print()
# Read in VCF, but drop some columns,
# and keep comments and info.
readVcf2 <- vcf2DT(vcfPath
, dropCols=c('QUAL')
, keepComments=TRUE
, keepInfo=TRUE)
readVcf2 %>% print
attr(readVcf2, 'vcf_comments')
attr(readVcf2, 'vcf_info')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.