read.vcf: Read Variant Calling Format Files

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/IO.R

Description

This function reads allelic data from VCF (variant calling format) files.

Usage

1
read.vcf(file, from = 1, to = 10000, which.loci = NULL, quiet = FALSE)

Arguments

file

a file name specified by either a variable of mode character, or a quoted string.

from, to

the loci to read; by default, the first 10,000.

which.loci

an alternative way to specify which loci to read is to give their indices (see link{VCFloci} how to obtain them).

quiet

a logical: should the progress of the operation be printed?

Details

The VCF file can be compressed (*.gz) or not, but compressed files cannot be read remotely (see examples).

A TABIX file is not required (and will be ignored if present).

In the VCF standard, missing data are represented by a dot and these are read “as is” by the present function without trying to substitute by NA.

Value

an object of class c("loci", "data.frame").

Note

Like for VCFloci, the present function can read either compressed (*.gz) or uncompressed files. There should be no difference in performance between both types of files if they are relatively small (less than 1 Gb as uncompressed, equivalent to ~50 Mb when compressed). For bigger files, it is more efficient to uncompress them (if disk space is sufficient), especially if they have to be accessed several times during the same session.

Author(s)

Emmanuel Paradis

References

http://www.1000genomes.org/node/101

https://github.com/samtools/hts-specs

See Also

VCFloci, read.loci, read.gtx, write.loci

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
## Not run: 
## Chr Y from the 1000 genomes:
a <- "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502"
b <- "ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz"
## WARNING: the name of the file above may change
url <- paste(a, b, sep = "/")
## file is compressed, so we download first:
download.file(url, "chrY.vcf.gz")
## no need to uncompress to read now that the file is local:
(info <- VCFloci("chrY.vcf.gz"))
str(info) # show the modes of the columns

SNP <- is.snp(info)
table(SNP) # how many loci are SNPs?
## compare with:
table(getINFO(info, "VT"))

op <- par(mfcol = c(4, 1), xpd = TRUE)
lim <- c(2.65e6, 2.95e6)
## distribution of SNP and non-SNP mutations along the Y chr:
plot(info$POS, !SNP, "h", col = "red", main = "non-SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
plot(info$POS, SNP, "h", col = "blue", main = "SNP mutations",
     xlab = "Position", ylab = "", yaxt = "n")
rect(lim[1], -0.1, lim[2], 1.1, lwd = 2, lty = 2)
par(xpd = FALSE)
## same focusing on a smaller portion of the chromosome:
plot(info$POS, !SNP, "h", col = "red", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
plot(info$POS, SNP, "h", col = "blue", xlim = lim, xlab = "Position",
     ylab = "", yaxt = "n")
par(op)

## read both types of mutations separately:
X.SNP <- read.vcf("chrY.vcf.gz", which.loci = which(SNP))
X.other <- read.vcf("chrY.vcf.gz", which.loci = which(!SNP))

identical(rownames(X.SNP), VCFlabels("chrY.vcf.gz")) # TRUE
cat(VCFheader("chrY.vcf.gz"))

## get haplotypes for the first 10 loci:
h <- haplotype(X.SNP, 1:10)
## plot their frequencies:
op <- par(mar = c(3, 10, 1, 1))
plot(h, horiz=TRUE, las = 1)
par(op)

## End(Not run)

pegas documentation built on May 29, 2017, 6:33 p.m.