snpgdsVCF2GDS_R: Reformat a VCF file (R implementation)

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/Conversion.R

Description

Reformat a Variant Call Format (VCF) file

Usage

1
2
3
4
snpgdsVCF2GDS_R(vcf.fn, out.fn, nblock=1024,
    method = c("biallelic.only", "copy.num.of.ref"),
    compress.annotation="LZMA_RA", snpfirstdim=FALSE, option = NULL,
    verbose=TRUE)

Arguments

vcf.fn

the file name of VCF format, vcf.fn can be a vector, see details

out.fn

the output gds file

nblock

the buffer lines

method

either "biallelic.only" by default or "copy.num.of.ref", see details

compress.annotation

the compression method for the GDS variables, except "genotype"; optional values are defined in the function add.gdsn

snpfirstdim

if TRUE, genotypes are stored in the individual-major mode, (i.e, list all SNPs for the first individual, and then list all SNPs for the second individual, etc)

option

NULL or an object from snpgdsOption, see details

verbose

if TRUE, show information

Details

GDS – Genomic Data Structures used for storing genetic array-oriented data, and the file format used in the gdsfmt package.

VCF – The Variant Call Format (VCF), which is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.

If there are more than one file name in vcf.fn, snpgdsVCF2GDS will merge all dataset together once they all contain the same samples. It is useful to combine genetic data if VCF data are divided by chromosomes.

method = "biallelic.only": to exact bi-allelic and polymorhpic SNP data (excluding monomorphic variants); method = "biallelic.only": to exact bi-allelic and polymorhpic SNP data; method = "copy.num.of.ref": to extract and store dosage (0, 1, 2) of the reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs, indels and structural variants.

Haploid and triploid calls are allowed in the transfer, the variable snp.id stores the original the row index of variants, and the variable snp.rs.id stores the rs id.

The user could use option to specify the range of code for autosomes. For humans there are 22 autosomes (from 1 to 22), but dogs have 38 autosomes. Note that the default settings are used for humans. The user could call option = snpgdsOption(autosome.end=38) for importing the VCF file of dog. It also allows defining new chromosome coding, e.g., option = snpgdsOption(Z=27), then "Z" will be replaced by the number 27.

Value

None.

Author(s)

Xiuwen Zheng

References

The variant call format and VCFtools. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group. Bioinformatics. 2011 Aug 1;27(15):2156-8. Epub 2011 Jun 7.

See Also

snpgdsVCF2GDS_R, snpgdsOption, snpgdsBED2GDS

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# The VCF file
vcf.fn <- system.file("extdata", "sequence.vcf", package="SNPRelate")
cat(readLines(vcf.fn), sep="\n")

snpgdsVCF2GDS_R(vcf.fn, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")

snpgdsVCF2GDS_R(vcf.fn, "test2.gds", method="biallelic.only")
snpgdsSummary("test2.gds")

snpgdsVCF2GDS_R(vcf.fn, "test3.gds", method="copy.num.of.ref")
snpgdsSummary("test3.gds")

snpgdsVCF2GDS_R(vcf.fn, "test4.gds", method="copy.num.of.ref")
snpgdsSummary("test4.gds")

Example output

Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3
Start snpgdsVCF2GDS ...
	Extracting bi-allelic and polymorhpic SNPs.
	Scanning ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
	content: 5 rows x 12 columns
Mon Feb 11 21:49:39 2019 	store sample id, snp id, position, and chromosome.
	start writing: 3 samples, 2 SNPs ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
[1] 1
Mon Feb 11 21:49:39 2019 	Done.
The file name: /work/tmp/test1.gds 
The total number of samples: 3 
The total number of SNPs: 2 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
Start snpgdsVCF2GDS ...
	Extracting bi-allelic and polymorhpic SNPs.
	Scanning ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
	content: 5 rows x 12 columns
Mon Feb 11 21:49:39 2019 	store sample id, snp id, position, and chromosome.
	start writing: 3 samples, 2 SNPs ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
[1] 1
Mon Feb 11 21:49:39 2019 	Done.
The file name: /work/tmp/test2.gds 
The total number of samples: 3 
The total number of SNPs: 2 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
Start snpgdsVCF2GDS ...
	Storing dosage of the reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs, indels and structural variants.
	Scanning ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
	content: 5 rows x 12 columns
Mon Feb 11 21:49:39 2019 	store sample id, snp id, position, and chromosome.
	start writing: 3 samples, 5 SNPs ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
Mon Feb 11 21:49:39 2019 	Done.
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test3.gds 
The total number of samples: 3 
The total number of SNPs: 5 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
The number of valid samples: 3 
The number of biallelic unique SNPs: 2 
Start snpgdsVCF2GDS ...
	Storing dosage of the reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs, indels and structural variants.
	Scanning ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
	content: 5 rows x 12 columns
Mon Feb 11 21:49:39 2019 	store sample id, snp id, position, and chromosome.
	start writing: 3 samples, 5 SNPs ...
	file: /usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf
Mon Feb 11 21:49:39 2019 	Done.
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test4.gds 
The total number of samples: 3 
The total number of SNPs: 5 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
The number of valid samples: 3 
The number of biallelic unique SNPs: 2 

SNPRelate documentation built on Nov. 8, 2020, 5:31 p.m.