summary_statistics: Summary Statistics per Segment

Description Usage Arguments Value Author(s) References See Also Examples

View source: R/summary_statistics.R

Description

This function computes summary statistics for every segment of the sequence. Sequence files are generated within this function which are then used by LDhat and other packages to estimate all necessary parameters.

Usage

1
2
summary_statistics(x, s, segLength, segs, seqName, nn,
                   pathLDhat, pathPhi, status, polyThres, out, format, startofseq)

Arguments

x

An integer control variable for the considered segment of the DNA sequence.

s

An XStringSet object which is read by readDNAStringSet

segLength

An integer value for the length of the segments, provided by the user. The default value of 1000 is our recommended value (1kb). The number of resulting segments, based on the sequence length is calculated within the funtion.

segs

A (non-negative) integer which reflects the number of segments considered. It is calculated in the program based on the user-defined segLength.

seqName

A character string containing the full path and the name of the sequence file in fasta of vcf format. It is necessary to add the extension ("fileName.fa", "fileName.fasta", "fileName.vcf") in order to run LDJump. In case that format equals to DNABin the seqName equals to the name of the DNABin-object (without any extension).

nn

An integer which reflects the number of individuals (more precisely sequences) of the population to be analyzed. In case of diploid samples this is twice the number of individuals.

pathLDhat

A character string containing the path to LDhat. This path and the installation of LDhat is necessary for the computation of the package.

pathPhi

A character string containing the path to PhiPack. This path and the installation of PhiPack is necessary for the computation of the package.

status

an optional logical value: by default TRUE such that the current processing status of the segments is printed.

polyThres

a numeric value between 0 and 1. Used in data manipulation function DNAbin2genind: the minimum frequency of a minor allele for a locus to be considered as polymorphic (default to 0).

out

an optional character string: by default an empty string "". Can be set to any user-defined string in order to rename all output files used within LDJump. This parameter enables to run LDJump from the same directory without creating interfering files in the working directory.

format

a character string describing the format of the used file g.e. "fasta" or "vcf". The default is set to "fasta".

startofseq

an integer value describing at which position the sequence to be analyzed starts (Only required when running LDJump with VCF-Files). The starting value is provided to vcftools to select the appropriate range for splicing the VCF-File into segments. In summary_statistics, the same value is used to loop over each FASTA-segment.

Value

This function returns a concatenated vector of all computed summary statistis as:

hahe

The haplotype heterozygosity of the considered segment. Returned with stats.

tajd

Tajima's D. Only used in the regression model for demography.

haps

The number of haplotypes. Later on it is normalized by sequence length and number of individuals.

apwd

Average pairwise differences. Later it is normalized by sequence length.

vapw

Variance of pairwise differences. Later it is normalized by sequence length.

wath

Watterson's theta. Later it is normalized by sequence length.

phis

A vector containing the four summary statistics obtained from PhiPack as MaxChi, NSS, mean(Phi) and var(Phi).

Author(s)

Philipp Hermann philipp.hermann@jku.at, Andreas Futschik, Fardokhtsadat Mohammadi fardokht.fm@gmail.com

References

Auton, A. and McVean, G. (2007). Recombination rate estimation in the presence of hotspots. Genome Research, 17(8), 1219–1227.

Bruen, T. C., Philippe, H., and Bryant, D. (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics, 172(4):2665-2681.

Jombart T. and Ahmed I. (2011) adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics. doi:10.1093/bioinformatics/btr521

Hermann, P., Heissl, A., Tiemann-Boege, I., and Futschik, A. (2019), LDJump: Estimating Variable Recombination Rates from Population Genetic Data. Mol Ecol Resour. doi:10.1111/1755-0998.12994.

McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R., and Donnelly, P. (2004). The fine-scale structure of recombination rate variation in the human genome. Science, 304(5670), 581–584.

Paradis E., Claude J. & Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289-290.

See Also

LDJump, vcfR_to_fasta, getPhi, get_smuce, readDNAStringSet, DNAbin2genind

Examples

1
2
3
##### Do not run these examples                                                      #####
##### In LDJump.R the function is called as follows                                  #####
##### sapply(1:segs,summary_statistics,s=s,segs=segs,seqName=seqName,nn=nn,ll = ll)  #####

PhHermann/LDJump documentation built on Nov. 16, 2019, 12:53 p.m.