filterLargeVCF: Pre-process of Large VCF File(s)

View source: R/filter_preprocessbigVCF.R

filterLargeVCFR Documentation

Pre-process of Large VCF File(s)

Description

Filter/extract one or multiple gene(s)/range(s) from a large ⁠*.vcf/*.vcf.gz⁠ file.

Usage

filterLargeVCF(VCFin = VCFin, VCFout = VCFout,
                Chr = Chr,
                POS = NULL,
                start = start,
                end = end,
                override = TRUE)

Arguments

VCFin

Path of input ⁠*.vcf/*.vcf.gz⁠ file.

VCFout

Path(s) of output ⁠*.vcf/*.vcf.gz⁠ file.

Chr

a single CHROM name or CHROM names vector.

POS, start, end

provide the range should be extract from orignal vcf. POS: a vector consist with start and end position or a list with length equal to Chr, eg.: list(c(1,200), c(300,500), c(300,400)) indicates 3 ranges (1~200, 300~500 and 300~400). if POS is NULL, start and end are needed, eg.: start = c(1, 30) and end = c(200, 150) indicates 2 ranges (1~200 and 30~150)

override

whether override existed file or not, default as TRUE.

Details

This package import VCF files with 'vcfR' which is more efficient to import/manipulate VCF files in 'R'. However, import a large VCF file is time and memory consuming. It's suggested that filter/extract variants in target range with filterLargeVCF().

When filter/extract multi genes/ranges, the parameter of Chr and POS must have equal length. Results will save to a single file if the user provide a single file path or save to multiple VCF file(s) when a equal length vector consist with file paths is provided.

However, if you have hundreds gene/ranges need to extract from very large VCF file(s), it's prefer to process with other linux tools in a script on server, such as: 'vcftools' and 'bcftools'.

Value

No return value

Examples


 # The filteration of small vcf should be done with `filter_vcf()`.
 # however, here, we use a mini vcf instead just for example and test.

 vcfPath <- system.file("extdata", "var.vcf.gz", package = "geneHapR")

 oldDir <- getwd()
 temp_dir <- tempdir()
 if(! dir.exists(temp_dir))
   dir.create(temp_dir)
 setwd(temp_dir)
 # extract a single gene/range from large vcf
 filterLargeVCF(VCFin = vcfPath, VCFout = "filtered.vcf.gz",
                Chr = "scaffold_1", POS = c(4300,5000), override = TRUE)

 # extract multi genes/ranges from large vcf
 filterLargeVCF(VCFin = vcfPath,
                VCFout = c("filtered1.vcf.gz",
                           "filtered2.vcf.gz",
                           "filtered3.vcf.gz"),
                Chr = rep("scaffold_1", 3),
                POS = list(c(4300, 5000),
                           c(5000, 6000),
                           c(5000, 7000)),
                override = TRUE)

setwd(oldDir)


geneHapR documentation built on May 29, 2024, 11:59 a.m.