LoadFiltering: To load and filter variants in batch mode

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/LoadFiltering.R

Description

To load data from study subjects and perform position-level quality filtering. The index.txt file contains group status and VCF file location of each subject. The function take index.txt file as input to load variant and sequence call files automatically.

Usage

1
2
3
LoadFiltering(file, datadir=NULL, filtering=TRUE, alter.PL=20,
alter.AD=3, alter.ADP=NULL, QUAL=20, DP=c(10,500), GQ=20, FILTER=NULL,
tabix="tabix", parallel=FALSE, pn=4, type=NULL, ...)

Arguments

file

Formatted input file including the annotation information of study subjects.

datadir

The work directory of the index file and variants data. If it is NULL, the absolute path of variants files should be provided in the annotation file.

filtering

Logical value. Whether to filter VCF data by specified quality criteria.

alter.PL

Phred-scaled genotype likelihoods of variant call to define a variant. The PL information can be extracted from PL column (both GATK and Samtools) in the VCF data.

alter.AD

The minimum depth of variant allele when alter is TRUE. The information of variant allele depth can be extracted from AD (GATK) or DP4 (Samtools) column in the VCF data.

alter.ADP

The minimum percentage of read depth containing variant allele.

QUAL

Phred-scaled variant likelihoods of variant call. The QUAL information can be extracted from QUAL column (both GATK and Samtools) in the VCF data.

DP

The minimum and maximum of position-level read depth. The DP information can be extracted from DP column (both GATK and Samtools) in the VCF data.

GQ

Phred-scaled score for most likely genotype at position of interest. The GQ information can be extracted from GQ column (both GATK and Samtools) in the VCF data. If NULL, the option will be ignored.

FILTER

'NULL' or 'PASS'. The VCF format of variant call produced by GATK will label quality status of each position. This information can be extracted from FILTER column (GATK) in the VCF data. If the VCF data is produced by Samtools, FILTER column will contain empty information. If 'NULL' is set, all variants will be parsed. If 'PASS' is set, only variant with 'PASS' label will be parsed.

tabix

The file path of executable tabix.

parallel

If TRUE, the function will run in parallel model.

pn

The CPU numbers to be used if parallel is TRUE.

type

MPI type. See detail in help(sfInit)

...

Arguments to pass to the method sfInit of the snowfall package.

Details

file The input file contains the annotation information of each sample. Each row is for one sample. The four columns are separated by tab, including sample name (required), group status (required), variant call file name (required) and sequence call file name (optional). Sample name column lists the sample name. Group status column lists the status (e.g., aggressive, benign or normal) of group each sample belongs to. Variant call file name column lists the path of VCF formatted variant call file. Sequence call file name column lists the path of compressed VCF sequence call file. The high-volume data in tab-delimited VCF formats can be efficiently compressed by bgzip program and retrieved through tabix program from open-source Samtools package. If the VCF format file is compressed by bgzip, tabix should be installed. The path of tabix should be specified in the function if it is not in the PATH system environment.

Quality criteria The detail of quality scores in VCF data can be found at http://www.1000genomes.org/node/101.

parallel This function will extract calls in sequential mode. If parallel is true, the function will extract calls in parallel mode. The package Rmpi and snowfall are required for parallel mode.

Value

The value returned is a varlist, including vcflist, VarIndex and Samples.

varlist

A list of vcf objects, one for each sample. If the filtering is true, the variant data are filtered by specified quality criteria.

VarIndex

The indexes for all variant positions. TRUE denotes the presence of variant. FALSE denotes the absence of variant. NA denotes low coverage.

Sample

Samples annotation from the input index file.

Author(s)

Qiang Hu

See Also

filtervcf

Examples

1
2
3
4
5
#setwd(system.file("extdata", package="VPA"))
#varflt <- LoadFiltering(file="index1.txt", filtering=TRUE, alter.PL=20,
#alter.AD=3)
#pattern <- cbind(A=c(1/4,1), B=c(0,0))
#varRes1 <- Patterning(varflt, pattern, var.PL=c(FALSE, TRUE))

VPA documentation built on May 2, 2019, 4:45 p.m.