Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/SNPRelate_Main.r
Reformat a Variant Call Format (VCF) file
1 2 3 4 | snpgdsVCF2GDS(vcf.fn, outfn.gds, nblock=1024,
method = c("biallelic.only", "copy.num.of.ref"),
compress.annotation="ZIP.max", snpfirstdim=FALSE, option = NULL,
verbose=TRUE)
|
vcf.fn |
the file name of VCF format, |
outfn.gds |
the output gds file |
nblock |
the buffer lines |
method |
either "biallelic.only" by default or "copy.num.of.ref", see details |
compress.annotation |
the compression flag of the nodes stored, except
"genotype"; the string value is defined in the function of |
snpfirstdim |
if TRUE, genotypes are stored in the individual-major mode, (i.e, list all SNPs for the first individual, and then list all SNPs for the second individual, etc) |
option |
|
verbose |
if TRUE, show information |
GDS – Genomic Data Structures used for storing genetic array-oriented data, and the file format used in the gdsfmt package.
VCF – The Variant Call Format (VCF), which is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations.
If there are more than one file name in vcf.fn
, snpgdsVCF2GDS
will
merge all dataset together once they all contain the same samples. It is useful to
combine genetic data if VCF data are divided by chromosomes.
method = "biallelic.only"
: to exact bi-allelic and polymorhpic SNP data;
method = "copy.num.of.ref"
: to extract and store dosage (0, 1, 2) of the
reference allele for all variant sites, including bi-allelic SNPs, multi-allelic SNPs,
indels and structural variants.
Haploid and triploid calls are allowed in the transfer, the variable
snp.id
stores the original the row index of variants, and the variable
snp.rs.id
stores the rs id.
The user could use option
to specify the range of code for autosomes.
For humans there are 22 autosomes (from 1 to 22), but dogs have 38 autosomes.
Note that the default settings are used for humans. The user could call
option = snpgdsOption(autosome.end=38)
for importing the VCF file of dog.
It also allow define new chromosome coding, e.g., option = snpgdsOption(Z=27)
.
None.
Xiuwen Zheng
The variant call format and VCFtools. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R; 1000 Genomes Project Analysis Group. Bioinformatics. 2011 Aug 1;27(15):2156-8. Epub 2011 Jun 7.
http://corearray.sourceforge.net/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | # The VCF file
vcf.fn <- system.file("extdata", "sequence.vcf", package="SNPRelate")
cat(readLines(vcf.fn), sep="\n")
snpgdsVCF2GDS(vcf.fn, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
snpgdsVCF2GDS(vcf.fn, "test2.gds", method="biallelic.only", snpfirstdim=TRUE)
snpgdsSummary("test2.gds")
snpgdsVCF2GDS(vcf.fn, "test3.gds", method="copy.num.of.ref")
snpgdsSummary("test3.gds")
snpgdsVCF2GDS(vcf.fn, "test4.gds", method="copy.num.of.ref", snpfirstdim=TRUE)
snpgdsSummary("test4.gds")
# open "test1.gds"
(genofile <- openfn.gds("test1.gds"))
read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))
# close the genotype file
closefn.gds(genofile)
# open "test2.gds"
(genofile <- openfn.gds("test2.gds"))
read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))
# close the genotype file
closefn.gds(genofile)
# open "test3.gds"
(genofile <- openfn.gds("test3.gds"))
read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))
# close the genotype file
closefn.gds(genofile)
# open "test4.gds"
(genofile <- openfn.gds("test4.gds"))
read.gdsn(index.gdsn(genofile, "sample.id"))
read.gdsn(index.gdsn(genofile, "snp.rs.id"))
read.gdsn(index.gdsn(genofile, "genotype"))
# close the genotype file
closefn.gds(genofile)
|
Loading required package: gdsfmt
SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
VCF Format ==> SNP GDS Format
Method: exacting biallelic SNPs
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
import 2 variants.
+ genotype { Bit2 3x2, 2B } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
open the file 'test1.gds' (2.4K)
# of fragments: 39
save to 'test1.gds.tmp'
rename 'test1.gds.tmp' (2.2K, reduced: 228B)
# of fragments: 20
The file name: /work/tmp/test1.gds
The total number of samples: 3
The total number of SNPs: 2
SNP genotypes are stored in SNP-major mode (Sample X SNP).
VCF Format ==> SNP GDS Format
Method: exacting biallelic SNPs
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
import 2 variants.
+ genotype { Bit2 3x2, 2B } *
SNP genotypes: 3 samples, 2 SNPs
Genotype matrix is being transposed ...
Optimize the access efficiency ...
Clean up the fragments of GDS file:
open the file 'test2.gds' (2.5K)
# of fragments: 41
save to 'test2.gds.tmp'
rename 'test2.gds.tmp' (2.2K, reduced: 333B)
# of fragments: 20
The file name: /work/tmp/test2.gds
The total number of samples: 3
The total number of SNPs: 2
SNP genotypes are stored in individual-major mode (SNP X Sample).
VCF Format ==> SNP GDS Format
Method: dosage (0,1,2) of reference allele for all variant sites
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
import 5 variants.
+ genotype { Bit2 3x5, 4B } *
Optimize the access efficiency ...
Clean up the fragments of GDS file:
open the file 'test3.gds' (2.5K)
# of fragments: 39
save to 'test3.gds.tmp'
rename 'test3.gds.tmp' (2.3K, reduced: 228B)
# of fragments: 20
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test3.gds
The total number of samples: 3
The total number of SNPs: 5
SNP genotypes are stored in SNP-major mode (Sample X SNP).
The number of valid samples: 3
The number of biallelic unique SNPs: 2
VCF Format ==> SNP GDS Format
Method: dosage (0,1,2) of reference allele for all variant sites
Number of samples: 3
Parsing "/usr/local/lib/R/site-library/SNPRelate/extdata/sequence.vcf" ...
import 5 variants.
+ genotype { Bit2 3x5, 4B } *
SNP genotypes: 3 samples, 5 SNPs
Genotype matrix is being transposed ...
Optimize the access efficiency ...
Clean up the fragments of GDS file:
open the file 'test4.gds' (2.6K)
# of fragments: 41
save to 'test4.gds.tmp'
rename 'test4.gds.tmp' (2.3K, reduced: 335B)
# of fragments: 20
Some of 'snp.allele' are not standard (e.g., A/G,T).
The file name: /work/tmp/test4.gds
The total number of samples: 3
The total number of SNPs: 5
SNP genotypes are stored in individual-major mode (SNP X Sample).
The number of valid samples: 3
The number of biallelic unique SNPs: 2
File: /work/tmp/test1.gds (2.2K)
+ [ ] *
|--+ sample.id { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id { Int32 2 ZIP_ra(300.0%), 31B }
|--+ snp.rs.id { Str8 2 ZIP_ra(263.6%), 36B }
|--+ snp.position { Int32 2 ZIP_ra(325.0%), 33B }
|--+ snp.chromosome { Str8 2 ZIP_ra(400.0%), 31B }
|--+ snp.allele { Str8 2 ZIP_ra(325.0%), 33B }
|--+ genotype { Bit2 3x2, 2B } *
\--+ snp.annot [ ]
|--+ qual { Float32 2 ZIP_ra(325.0%), 33B }
\--+ filter { Str8 2 ZIP_ra(300.0%), 34B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""
[,1] [,2]
[1,] 2 2
[2,] 1 1
[3,] 0 2
File: /work/tmp/test2.gds (2.2K)
+ [ ] *
|--+ sample.id { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id { Int32 2 ZIP_ra(300.0%), 31B }
|--+ snp.rs.id { Str8 2 ZIP_ra(263.6%), 36B }
|--+ snp.position { Int32 2 ZIP_ra(325.0%), 33B }
|--+ snp.chromosome { Str8 2 ZIP_ra(400.0%), 31B }
|--+ snp.allele { Str8 2 ZIP_ra(325.0%), 33B }
|--+ genotype { Bit2 2x3, 2B } *
\--+ snp.annot [ ]
|--+ qual { Float32 2 ZIP_ra(325.0%), 33B }
\--+ filter { Str8 2 ZIP_ra(300.0%), 34B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" ""
[,1] [,2] [,3]
[1,] 2 1 0
[2,] 2 1 2
File: /work/tmp/test3.gds (2.3K)
+ [ ] *
|--+ sample.id { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id { Int32 5 ZIP_ra(160.0%), 39B }
|--+ snp.rs.id { Str8 5 ZIP_ra(146.9%), 54B }
|--+ snp.position { Int32 5 ZIP_ra(190.0%), 45B }
|--+ snp.chromosome { Str8 5 ZIP_ra(153.3%), 30B }
|--+ snp.allele { Str8 5 ZIP_ra(148.3%), 50B }
|--+ genotype { Bit2 3x5, 4B } *
\--+ snp.annot [ ]
|--+ qual { Float32 5 ZIP_ra(180.0%), 43B }
\--+ filter { Str8 5 ZIP_ra(129.2%), 38B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" "" "rs6040355" "" "microsat1"
[,1] [,2] [,3] [,4] [,5]
[1,] 2 2 0 2 1
[2,] 1 1 0 2 1
[3,] 0 2 0 2 0
File: /work/tmp/test4.gds (2.3K)
+ [ ] *
|--+ sample.id { Str8 3 ZIP_ra(125.0%), 37B }
|--+ snp.id { Int32 5 ZIP_ra(160.0%), 39B }
|--+ snp.rs.id { Str8 5 ZIP_ra(146.9%), 54B }
|--+ snp.position { Int32 5 ZIP_ra(190.0%), 45B }
|--+ snp.chromosome { Str8 5 ZIP_ra(153.3%), 30B }
|--+ snp.allele { Str8 5 ZIP_ra(148.3%), 50B }
|--+ genotype { Bit2 5x3, 4B } *
\--+ snp.annot [ ]
|--+ qual { Float32 5 ZIP_ra(180.0%), 43B }
\--+ filter { Str8 5 ZIP_ra(129.2%), 38B }
[1] "NA00001" "NA00002" "NA00003"
[1] "rs6054257" "" "rs6040355" "" "microsat1"
[,1] [,2] [,3]
[1,] 2 1 0
[2,] 2 1 2
[3,] 0 0 0
[4,] 2 2 2
[5,] 1 1 0
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.