read_vcf | R Documentation |
The function reads VCF files for radiator and
generate a connection SeqArray
GDS object/file of class SeqVarGDSClass
(Zheng et al. 2017).
The Genomic Data Structure (GDS) file format is detailed in
gdsfmt.
The function as an advance mode that allows various filtering arguments to be used. Handy to prune the dataset based on various QCs but also to remove artifacts, duplicated information and thus lower the overall noise in the VCF.
Used internally in radiator and might be of interest for users.
read_vcf(
data,
strata = NULL,
filename = NULL,
vcf.stats = TRUE,
parallel.core = parallel::detectCores() - 1,
verbose = TRUE,
...
)
data |
(path, character) The VCF markers are biallelic SNPs or haplotypes.
To make the VCF population-ready, you have to use
|
strata |
(optional)
The strata file is a tab delimited file with a minimum of 2 columns headers:
|
filename |
(optional) The file name of the Genomic Data Structure (GDS) file.
radiator will append |
vcf.stats |
(optional, logical) Generates individuals and markers
statistics helpful for filtering.
These are very fast to generate and because computational
cost is minimal, even for huge VCFs, the default is |
parallel.core |
(optional) The number of core used for parallel
execution during import.
Default: |
verbose |
(optional, logical) When |
... |
(optional) To pass further arguments for fine-tuning the function. |
A vcf file of 35 GB with ~4 millions SNPs take about ~7 min with 8 CPU. A vcf file of 21 GB with ~2 millions SNPs take about ~5 min with 7 CPU.
After the file is generated, you can close your computer and come back to it a months later and it's now a matter of sec to open a connection. See example below.
markers heterozygosity and inbreeding coefficient:
The heterozygosity statistics (het obs and Fis) presented in this function
are calculated globally across strata, without the possibility to filter out
markers. We think this filtering for these statistics are best
addressed in filter_het
, filter_fis
and
filter_hwe
, after outlier individuals and
markers (for other stats) are blacklisted. Why ? The reason is that an excess of
heterozygotes and negative Fis values are not a bad thing per se.
Genomic regions under balancing selection may contain such markers and
statistics.
The function returns a GDS object.
PLINK: radiator fills the LOCUS
column of PLINK VCFs with
a unique integer based on the CHROM
column
(as.integer(factor(x = CHROM))
).
The COL
column is filled with 1L for lack of bettern info on this.
Not what you need ? Open an issue on GitHub for a request.
ipyrad: the pattern locus_
in the CHROM
column
is removed and used. The COL
column is filled with the same value as
POS
.
GATK: Some VCF have an ID
column filled with .
,
the LOCUS information is all contained along the linkage group in the
CHROM
column. To make it work with
radiator,
the ID
column is filled with the POS
column info.
platypus: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above.
freebayes: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. samtools: Some VCF files don't have an ID filed with values, here the same thing is done as GATK VCF files above. stacks: with de novo approaches, the CHROM column is filled with "1", the LOCUS column correspond to the CHROM section in stacks VCF and the COL column is POS -1. With a reference genome, the ID column in stacks VCF is separated into "LOCUS", "COL", "STRANDS".
stacks problem: current version as some intrinsic problem with missing allele depth info, during the tidying process a message will highlight the number of genotypes impacted by the problem. When possible, the problem is corrected by adding the read depth info into the allele depth field.
dots-dots-dots ... allows to pass several arguments for fine-tuning the function:
whitelist.markers
: detailed in filter_whitelist
.
blacklist.id
: detailed in tidy_genomic_data
.
pop.select
: detailed in tidy_genomic_data
.
pop.levels
: detailed in tidy_genomic_data
.
pop.labels
: detailed in tidy_genomic_data
.
filter.strands
: (optional, character) Filter duplicate SNPs
found on different strands (+/-), options are:
"keep.both", "best.stats", "blacklist"
. "keep.both"
: does nothing
and duplicated markers are kept (not recommended, but here for testing purposes),
"best.stats"
: will keep only one, based on the best statistics
(MAC and missingness). "blacklist"
: discard all duplicated markers.
Default (filter.strands = "blacklist"
).
filter.common.markers
: (logical) Default: filter.common.markers = TRUE
.
Documented in filter_common_markers
.
filter.ma
: (integer) Default: filter.ma = NULL
.
To blacklist markers below a specific Minor Allele Count, Frequency or Depth (calculated overall/global).
Documented in filter_ma
.
filter.coverage
: (optional, string)
Default: filter.coverage = NULL
. To blacklist markers based on mean coverage.
Documented in filter_coverage
.
filter.genotyping
: (integer) To blacklist markers with too
many missing data. e.g. filter.genotyping = 0.1
, will only keep
markers with missing rate <= to 10 percent.
Documented in filter_genotyping
.
filter.snp.position.read
: 3 options available, "outliers", "q75", "iqr"
.
This argument will blacklist markers based on it's position on the read.
filter.snp.position.read = "outliers"
, will remove markers based
on outlier statistics of the position on the reads. e.g. if majority of SNPs
are found between 10 and 90 pb, and very few above and below, those markers are
discarded. Use this function argument with dataset with problematic assembly,
or de novo assembly with undocumented or poorly selected
mismatch threshold.
filter.short.ld
: Described in filter_ld
filter.long.ld
: Described in filter_ld
long.ld.missing
: Described in filter_ld
ld.method
: (optional, character) The values available are
"composite"
, for LD composite measure, "r"
for R coefficient
(by EM algorithm assuming HWE, it could be negative), "r2"
for r^2,
"dprime"
for D',
"corr"
for correlation coefficient. The method corr and composite are
equivalent when SNPs are coded based on the presence of the alternate allele
(0, 1, 2
).
Default: ld.method = "r2"
.
filter.individuals.missing
: (double) Use this argument to
blacklist individuals with too many missing data.
e.g. filter.individuals.missing = 0.7
, will remove individuals with >
0.7 or 70
with some dataset.
markers.info
: (character) With default: markers.info = NULL
,
all the variable in the vcf INFO field are imported.
To import only DP (the SNP total read depth) and AF (the SNP allele frequency),
use markers.info = c("DP", "AF")
.
Using, markers.info = character(0)
will not import INFO variables.
path.folder
: to write ouput in a specific path
(used internally in radiator). Default: path.folder = getwd()
.
If the supplied directory doesn't exist, it's created.
random.seed
: (integer, optional) For reproducibility, set an integer
that will be used inside codes that uses randomness. With default,
a random number is generated, printed and written in the directory.
Default: random.seed = NULL
.
subsample.markers.stats
: By default, when no filters are
requested and that the number of markers is > 200K,
0.2 of markers are randomly selected to generate the
statistics (individuals and markers). This is an all-around and
reliable number.
In doubt, overwrite this value by using 1 (all markers selected) and
expect a small computational cost.
Thierry Gosselin thierrygosselin@icloud.com
Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray – A storage-efficient high-performance data format for WGS variant calls. Bioinformatics.
Danecek P, Auton A, Abecasis G et al. (2011) The variant call format and VCFtools. Bioinformatics, 27, 2156-2158.
filter_ld
tidy_genomic_data
tidy_vcf
## Not run:
# require(SeqArray)
# with built-in defaults:
data <- radiator::read_vcf(data = "populations.snps.vcf")
# Because no strata file is provided, the individuals are assumed to be in
# the same group.
# If you need to filter the VCF I recommend using ?radiator::filter_rad
# But if you're certain about the filters thresholds, additional arguments are available:
data <- radiator::read_vcf(
data = "populations.snps.vcf",
strata = "strata_salamander.tsv",
path.folder = "salamander",
filter.individuals.missing = "outliers",
filter.common.markers = TRUE,
filter.strands = "blacklist",
filter.ma = 4,
filter.genotyping = 0.3,
filter.snp.position.read = "outliers",
filter.short.ld = "mac",
filter.long.ld = NULL,
verbose = TRUE
)
# In a new R session, to re-open the GDS file/connection:
data <- radiator::read_rad(data = "radiator_20200911@0748.gds")
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.