vcf_statistics: Calculating statistics on VCF-files.

Description Usage Arguments Value Note Author(s) References See Also Examples

View source: R/vcf_statistics.R

Description

This function estimates variable recombination rates from population genetic data using VCF-files. Therefore, a segmentation algorithm with specific segment lengths (segLength) and type-I error probability (alpha, α) is applied. The returned object can be plotted with the plot-function of the package stepR.

Usage

1
2
3
4
5
vcf_statistics(seqName, alpha = 0.05, quant = 0.35, segLength = 1000, pathLDhat = "",
       pathPhi = "", format = "fasta", refName = NULL, start = NULL, constant = F,
       rescale = F, status = T, polyThres = 0, cores = 1, accept = F,
       demography = F, regMod = "", out = "", lengthofseq = NULL,
       chr = NULL, startofseq = NULL, endofseq = NULL)

Arguments

seqName

A character string containing the full path and the name of the sequence file in fasta of vcf format. It is necessary to add the extension ("fileName.fa", "fileName.fasta", "fileName.vcf") in order to run LDJump. In case that format equals to DNABin the seqName equals to the name of the DNABin-object (without any extension).

alpha

A value from the interval (0,1) for the type-I error probability α used in the segmentation algorithm. We recommend to use 0.05. We enabled to estimate the recombination map efficiently (without recalculating all summary statistics) under several type-I errors when LDJump is applied with a vector of type-I error probabilities.

quant

A value between 0.1 and 0.5 with 0.05 distances in between which reflects the quantile used in the quantile regression. We recommend to use the 0.35 quantile which is the default value.

segLength

An integer value for the length of the segments, provided by the user. The default value of 1000 is our recommended value (1kb). The number of resulting segments, based on the sequence length is calculated within the funtion.

pathLDhat

A character string containing the path to LDhat. This path and the installation of LDhat is necessary for the computation of the package.

pathPhi

A character string containing the path to PhiPack. This path and the installation of PhiPack is necessary for the computation of the package.

format

A character string which can be fasta, vcf, or DNAStringSet. If fasta is used, the package will proceed with the computation of the recombination map. If vcf, the package will convert the data in vcf format to fasta format with the function vcfR_to_fasta and then proceed as in case fasta. For the last format the seqName must equal to the DNABin-object which contains the sequences.

refName

An (optional) path to the reference sequence for the region of interest downloaded from e.g. http://phase3browser.1000genomes.org/index.html. Only to be used in case that format == "vcf".

start

An (optional) integer value which reflects the starting position of the sequences in bp. Only to be used in case that format == "vcf".

constant

an optional logical value: by default FALSE estimating variable recombination rates. If TRUE, the constant recombination rate for the full sequence is estimated.

rescale

an optional logical value: by default FALSE the sequence length is not rescaled to 0 and 1. If TRUE this rescaling is performed.

status

an optional logical value: by default TRUE such that the current processing status of the segments is printed.

polyThres

a numeric value between 0 and 1. Used in data manipulation function DNAbin2genind: the minimum frequency of a minor allele for a locus to be considered as polymorphic (default to 0).

cores

a positive integer value which is by default 1. This integer reflects the number of cores to be used. Hence, when setting to an integer larger then one the same number of cores are used to compute the recombination map.

accept

an optional logical value: by default FALSE and LDJump checks for segments with less than 2 SNPs and requires user input to proceed. If set to TRUE, it is accepted that the rates for these segments are estimated via imputation.

demography

an optional logical value: by default FALSE indicating that the recombination rate estimation is estimated under neutrality. If TRUE the regression model estimated based on samples from populations under a bottleneck followed by rapid growth is used.

regMod

an optional character string: for the default empty string "" LDJump uses an existing regression model (constant population size or simple demography example, depending on demography). In oder to use the regression model estimated by user input demography, then this variable should equal to the name of the regression object. Please see the examples for more details.

out

an optional character string: by default an empty string "". Can be set to any user-defined string in order to rename all output files used within LDJump. This parameter enables to run LDJump from the same directory without creating interfering files in the working directory.

lengthofseq

an integer value describing the length of the sequence (Only required when running LDJump with VCF-Files). It is used to compute the number of segments and to loop through each segment.

chr

either an integer value between 1-22 or a character value "X"/"Y" describing which chromosome is used to run LDJump on (Only required when running LDJump with VCF-Files). It is required for the vcftools system call in order to slice the VCF-File into several segments.

startofseq

an integer value describing at which position the sequence to be analyzed starts (Only required when running LDJump with VCF-Files). The starting value is provided to vcftools to select the appropriate range for splicing the VCF-File into segments.

endofseq

an integer value describing at which position the sequence to be analyzed ends (Only required when running vcftools with VCF-Files).The ending value is provided to vcftools to select the appropriate range for splicing the VCF-File into segments.

Value

The following list is returned in the case of estimating variable recombination rates (constant == FALSE).

seq.full.cor

The final estimate of the recombination map. Depiction with plot-function of stepR package.

pr.full.cor

The (constant) estimates of the recombination rate per segment.

help

A helper matrix containing the summary statistics per segment used in the regression model.

alpha

The type-I error probability α.

nn

The number of individuals (more precisely sequences) for which the recombination map was estimated.

ll

Total sequence length

segs

The number of segments by which the sequence is divided. Resulting from the user-defined segment length (segLength).

For constant recombination rate estimation across the whole sequences (constant == TRUE), we provide the same list except for seq.full.cor.

Note

This function only works with unix and having PhiPack installed. We strongly recommend to also install LDhat (Auton and McVean (2007)) in order to decrease the computational cost of estimating recombination maps. Please properly check all paths to PhiPack and in case of LDhat as well as the sequence files. Previous versions (older than v 0.2.1) required lookup tables within the pairwise estimate of LDhat. These files should be located in the path "pathToLDhat/LDhat-master/lk_files". Lookup tables are contained in LDhat, but we still provide several lookup tables here. We strongly recommend to use the most recent version of LDJump in order to estimate recombination rates.

Author(s)

Philipp Hermann philipp.hermann@jku.at, Andreas Futschik, Fardokhtsadat Mohammadi fardokht.fm@gmail.com

References

Auton, A. and McVean, G. (2007). Recombination rate estimation in the presence of hotspots. Genome Research, 17(8), 1219-1227.

Bruen, T. C., Philippe, H., and Bryant, D. (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics, 172(4):2665-2681.

Frick, K., Munk, A., and Sieling, H. (2014). Multiscale change-point inference. Journal of the Royal Statistical Society: Series B, 76(3), 495-580.

Futschik, A., Hotz, T., Munk, A., and Sieling, H. (2014). Multiscale DNA partitioning: Statistical evidence for segments. Bioinformatics, 30(16), 2255-2262.

Hermann, P., Heissl, A., Tiemann-Boege, I., and Futschik, A. (2019), LDJump: Estimating Variable Recombination Rates from Population Genetic Data. Mol Ecol Resour. doi:10.1111/1755-0998.12994.

Jombart T. and Ahmed I. (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. doi:10.1093/bioinformatics/btr521

Knaus BJ and Grünwald NJ (2017). VCFR: a package to manipulate and visualize variant call format data in R. Molecular Ecology Resources, 17(1), pp. 44-53. ISSN 757, doi:10.1111/1755-0998.12549.

McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly, P. (2004). The fine-scale structure of recombination rate variation in the human genome. Science, 304(5670), 581-584.

Paradis E., Claude J. & Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290.

The 1000 Genomes Project Consortium (2015). Aglobal reference for human genetic variation. Nature, 526(7571), 68-74.

Wood, S.N. (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B) 73(1):3-36

See Also

summary_statistics, vcfR_to_fasta, getPhi, get_smuce, smuceR, rq, gam, vcfR2DNAbin, diseq, genotype, readDNAStringSet, calcRegMod

Examples

1
2
3
4
5
6
7
##### Do not run these examples                                                             #####
##### result = LDJump(fileName, alpha = 0.05, segLength = 1000,                             #####
#####                 pathLDhat = getwd(), format = "fasta")                                #####
##### plot(results)                                                                         #####
##### results = LDJump("/pathToSample/HatLandscapeN16Len1000000Nrhs15_th0.01_540_1.fa",     #####
##### alpha = 0.05, segLength = 1000, pathLDhat = "/pathToLDhat", pathPhi = "/pathToPhi",   #####
##### format = "fasta", refName = NULL                                                      #####

PhHermann/LDJump documentation built on Nov. 16, 2019, 12:53 p.m.