snpgdsVCF2GDS: Reformat VCF file(s)
In SNPRelate: Parallel Computing Toolset for Relatedness and Principal Component Analysis of SNP Data

Description Usage Arguments Details Value Author(s) References See Also Examples

Reformat Variant Call Format (VCF) file(s)

1
2
3

snpgdsVCF2GDS(vcf.fn, out.fn, method=c("biallelic.only", "copy.num.of.ref"),
    snpfirstdim=FALSE, compress.annotation="LZMA_RA", compress.geno="",
    ref.allele=NULL, ignore.chr.prefix="chr", verbose=TRUE)

`vcf.fn`	the file name of VCF format, `vcf.fn` can be a vector, see details
`out.fn`	the file name of output GDS
`method`	either "biallelic.only" by default or "copy.num.of.ref", see details
`snpfirstdim`	if TRUE, genotypes are stored in the individual-major mode, (i.e, list all SNPs for the first individual, and then list all SNPs for the second individual, etc)
`compress.annotation`	the compression method for the GDS variables, except "genotype"; optional values are defined in the function `add.gdsn`
`compress.geno`	the compression method for "genotype"; optional values are defined in the function `add.gdsn`
`ref.allele`	`NULL` or a character vector indicating reference allele (like `"A", "G", "T", NA, ...`) for each site where `NA` to use the original reference allele in the VCF file(s). The length of character vector should be the total number of variants in the VCF file(s).
`ignore.chr.prefix`	a vector of character, indicating the prefix of chromosome which should be ignored, like "chr"; it is not case-sensitive
`verbose`	if `TRUE`, show information

GDS – Genomic Data Structures used for storing genetic array-oriented data, and the file format used in the gdsfmt package.

VCF – The Variant Call Format (VCF), which is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.

If there are more than one file names in vcf.fn, snpgdsVCF2GDS will merge all dataset together if they all contain the same samples. It is useful to combine genetic/genomic data together if VCF data are divided by chromosomes.

method = "biallelic.only": to exact bi-allelic and polymorhpic SNP data (excluding monomorphic variants); method = "copy.num.of.ref": to extract and store dosage (0, 1, 2) of the reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs, indels and structural variants.

Haploid and triploid calls are allowed in the transfer, the variable snp.id stores the original the row index of variants, and the variable snp.rs.id stores the rs id.

When snp.chromosome in the GDS file is character, SNPRelate treats a chromosome as autosome only if it can be converted to a numeric value ( like "1", "22"). It uses "X" and "Y" for non-autosomes instead of numeric codes. However, some software format chromosomes in VCF files with a prefix "chr". Users should remove that prefix when importing VCF files by setting ignore.chr.prefix = "chr".

The extended GDS format is implemented in the SeqArray package to support the storage of single nucleotide variation (SNV), insertion/deletion polymorphism (indel) and structural variation calls. It is strongly suggested to use SeqArray for large-scale whole-exome and whole-genome sequencing variant data instead of SNPRelate.

Return the file name of GDS format with an absolute path.

Xiuwen Zheng

The variant call format and VCFtools. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group. Bioinformatics. 2011 Aug 1;27(15):2156-8. Epub 2011 Jun 7.

http://corearray.sourceforge.net/

snpgdsBED2GDS

# the VCF file
vcf.fn <- system.file("extdata", "sequence.vcf", package="SNPRelate")
cat(readLines(vcf.fn), sep="\n")

snpgdsVCF2GDS(vcf.fn, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")

snpgdsVCF2GDS(vcf.fn, "test2.gds", method="biallelic.only", snpfirstdim=TRUE)
snpgdsSummary("test2.gds")

snpgdsVCF2GDS(vcf.fn, "test3.gds", method="copy.num.of.ref", snpfirstdim=TRUE)
snpgdsSummary("test3.gds")

snpgdsVCF2GDS(vcf.fn, "test4.gds", method="copy.num.of.ref")
snpgdsSummary("test4.gds")

snpgdsVCF2GDS(vcf.fn, "test5.gds", method="copy.num.of.ref",
    ref.allele=c("A", "T", "T", "T", "A"))
snpgdsSummary("test5.gds")



# open "test1.gds"
(genofile <- snpgdsOpen("test1.gds"))

read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))

# close the file
snpgdsClose(genofile)


# open "test2.gds"
(genofile <- snpgdsOpen("test2.gds"))

read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))

# close the file
snpgdsClose(genofile)


# open "test3.gds"
(genofile <- snpgdsOpen("test3.gds"))

read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))

# close the file
snpgdsClose(genofile)


# open "test4.gds"
(genofile <- snpgdsOpen("test4.gds"))

read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "snp.allele"))
read.gdsn(index.gdsn(genofile, "genotype"))

# close the file
snpgdsClose(genofile)


# open "test5.gds"
(genofile <- snpgdsOpen("test5.gds"))

read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "snp.allele"))
read.gdsn(index.gdsn(genofile, "genotype"))

# close the file
snpgdsClose(genofile)


# delete the temporary files
unlink(paste("test", 1:5, ".gds", sep=""), force=TRUE)

Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3
VCF Format ==> SNP GDS Format
Method: exacting biallelic SNPs
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
	import 2 variants.
+ genotype   { Bit2 3x2, 2B } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test1.gds' (2.4K)
    # of fragments: 39
    save to 'test1.gds.tmp'
    rename 'test1.gds.tmp' (2.2K, reduced: 228B)
    # of fragments: 20
The file name: /work/tmp/test1.gds 
The total number of samples: 3 
The total number of SNPs: 2 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
VCF Format ==> SNP GDS Format
Method: exacting biallelic SNPs
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
	import 2 variants.
+ genotype   { Bit2 3x2, 2B } *
SNP genotypes: 3 samples, 2 SNPs
Genotype matrix is being transposed ...
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test2.gds' (2.5K)
    # of fragments: 41
    save to 'test2.gds.tmp'
    rename 'test2.gds.tmp' (2.2K, reduced: 333B)
    # of fragments: 20
The file name: /work/tmp/test2.gds 
The total number of samples: 3 
The total number of SNPs: 2 
SNP genotypes are stored in individual-major mode (SNP X Sample).
VCF Format ==> SNP GDS Format
Method: dosage (0,1,2) of reference allele for all variant sites
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
	import 5 variants.
+ genotype   { Bit2 3x5, 4B } *
SNP genotypes: 3 samples, 5 SNPs
Genotype matrix is being transposed ...
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test3.gds' (2.6K)
    # of fragments: 41
    save to 'test3.gds.tmp'
    rename 'test3.gds.tmp' (2.3K, reduced: 335B)
    # of fragments: 20
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test3.gds 
The total number of samples: 3 
The total number of SNPs: 5 
SNP genotypes are stored in individual-major mode (SNP X Sample).
The number of valid samples: 3 
The number of biallelic unique SNPs: 2 
VCF Format ==> SNP GDS Format
Method: dosage (0,1,2) of reference allele for all variant sites
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
	import 5 variants.
+ genotype   { Bit2 3x5, 4B } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test4.gds' (2.5K)
    # of fragments: 39
    save to 'test4.gds.tmp'
    rename 'test4.gds.tmp' (2.3K, reduced: 228B)
    # of fragments: 20
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test4.gds 
The total number of samples: 3 
The total number of SNPs: 5 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
The number of valid samples: 3 
The number of biallelic unique SNPs: 2 
VCF Format ==> SNP GDS Format
Method: dosage (0,1,2) of reference allele for all variant sites
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
	import 5 variants.
+ genotype   { Bit2 3x5, 4B } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
    open the file 'test5.gds' (2.5K)
    # of fragments: 39
    save to 'test5.gds.tmp'
    rename 'test5.gds.tmp' (2.3K, reduced: 228B)
    # of fragments: 20
Some of 'snp.allele' are not standard (e.g., T/A,G).
The file name: /work/tmp/test5.gds 
The total number of samples: 3 
The total number of SNPs: 5 
SNP genotypes are stored in SNP-major mode (Sample X SNP).
The number of valid samples: 3 
The number of biallelic unique SNPs: 2 
File: /work/tmp/test1.gds (2.2K)
+    [  ] *
|--+ sample.id   { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id   { Int32 2 ZIP_ra(300.0%), 31B }
|--+ snp.rs.id   { Str8 2 ZIP_ra(263.6%), 36B }
|--+ snp.position   { Int32 2 ZIP_ra(325.0%), 33B }
|--+ snp.chromosome   { Str8 2 ZIP_ra(400.0%), 31B }
|--+ snp.allele   { Str8 2 ZIP_ra(325.0%), 33B }
|--+ genotype   { Bit2 3x2, 2B } *
\--+ snp.annot   [  ]
   |--+ qual   { Float32 2 ZIP_ra(325.0%), 33B }
   \--+ filter   { Str8 2 ZIP_ra(300.0%), 34B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""         
     [,1] [,2]
[1,]    2    2
[2,]    1    1
[3,]    0    2
File: /work/tmp/test2.gds (2.2K)
+    [  ] *
|--+ sample.id   { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id   { Int32 2 ZIP_ra(300.0%), 31B }
|--+ snp.rs.id   { Str8 2 ZIP_ra(263.6%), 36B }
|--+ snp.position   { Int32 2 ZIP_ra(325.0%), 33B }
|--+ snp.chromosome   { Str8 2 ZIP_ra(400.0%), 31B }
|--+ snp.allele   { Str8 2 ZIP_ra(325.0%), 33B }
|--+ genotype   { Bit2 2x3, 2B } *
\--+ snp.annot   [  ]
   |--+ qual   { Float32 2 ZIP_ra(325.0%), 33B }
   \--+ filter   { Str8 2 ZIP_ra(300.0%), 34B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""         
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    2    1    2
File: /work/tmp/test3.gds (2.3K)
+    [  ] *
|--+ sample.id   { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id   { Int32 5 ZIP_ra(160.0%), 39B }
|--+ snp.rs.id   { Str8 5 ZIP_ra(146.9%), 54B }
|--+ snp.position   { Int32 5 ZIP_ra(190.0%), 45B }
|--+ snp.chromosome   { Str8 5 ZIP_ra(153.3%), 30B }
|--+ snp.allele   { Str8 5 ZIP_ra(148.3%), 50B }
|--+ genotype   { Bit2 5x3, 4B } *
\--+ snp.annot   [  ]
   |--+ qual   { Float32 5 ZIP_ra(180.0%), 43B }
   \--+ filter   { Str8 5 ZIP_ra(129.2%), 38B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""          "rs6040355" ""          "microsat1"
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    2    1    2
[3,]    0    0    0
[4,]    2    2    2
[5,]    1    1    0
File: /work/tmp/test4.gds (2.3K)
+    [  ] *
|--+ sample.id   { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id   { Int32 5 ZIP_ra(160.0%), 39B }
|--+ snp.rs.id   { Str8 5 ZIP_ra(146.9%), 54B }
|--+ snp.position   { Int32 5 ZIP_ra(190.0%), 45B }
|--+ snp.chromosome   { Str8 5 ZIP_ra(153.3%), 30B }
|--+ snp.allele   { Str8 5 ZIP_ra(148.3%), 50B }
|--+ genotype   { Bit2 3x5, 4B } *
\--+ snp.annot   [  ]
   |--+ qual   { Float32 5 ZIP_ra(180.0%), 43B }
   \--+ filter   { Str8 5 ZIP_ra(129.2%), 38B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""          "rs6040355" ""          "microsat1"
[1] "G/A"        "T/A"        "A/G,T"      "T/."        "GTC/G,GTCT"
     [,1] [,2] [,3] [,4] [,5]
[1,]    2    2    0    2    1
[2,]    1    1    0    2    1
[3,]    0    2    0    2    0
File: /work/tmp/test5.gds (2.3K)
+    [  ] *
|--+ sample.id   { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id   { Int32 5 ZIP_ra(160.0%), 39B }
|--+ snp.rs.id   { Str8 5 ZIP_ra(146.9%), 54B }
|--+ snp.position   { Int32 5 ZIP_ra(190.0%), 45B }
|--+ snp.chromosome   { Str8 5 ZIP_ra(153.3%), 30B }
|--+ snp.allele   { Str8 5 ZIP_ra(135.5%), 49B }
|--+ genotype   { Bit2 3x5, 4B } *
\--+ snp.annot   [  ]
   |--+ qual   { Float32 5 ZIP_ra(180.0%), 43B }
   \--+ filter   { Str8 5 ZIP_ra(129.2%), 38B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""          "rs6040355" ""          "microsat1"
[1] "A/G"          "T/A"          "T/A,G"        "T/."          "A/GTC,G,GTCT"
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    2    1    2    0
[2,]    1    1    1    2    0
[3,]    2    2    2    2    0