Test of Purine-Pyrimidine Parity Based on Euclidean distance

Description

Performs a test proposed by Hart and Mart<ed>nez (2011) for the equivalence of the relative frequencies of purines (A+G) and pyrimidines (C+T) in DNA sequences. It does this by checking whether or not the mononucleotide frequencies of a DNA sequence satisfy the relationship A+G=C+T.

Usage

1
agct.test(x, alg=c("exact", "simulate", "lower", "Lower", "upper"), n)

Arguments

x

either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.

alg

the algorithm for computing the p-value. If set to “simulate”, the p-value is obtained via Monte Carlo simulation. If set to “lower”, an analytic lower bound on the p-value is computed. If set to “upper”, an analytic upper bound on the p-value is computed. “lower” and “upper” are based on formulae in Hart and Mart<ed>nez (2011). a Tighter (though unpublished) lower bound on the p-value may be obtained by specifying “Lower”. If alg is specified as “exact” (the default value), the p-value for the test is computed exactly.

n

The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

Let A, C, G and T denote the relative frequencies of the nucleotide bases appearing in a DNA sequence. This function carries out a statistical hypothesis test that the relative frequencies satisfy the relation A+G=C+T, or that purines {A,G} occur equally as often as pyrimidines {C,T} in a DNA sequence. The relationship can be rewritten as A-T=C-G, from which it is easy to see that the property being tested is a generalisation of Chargaff's second parity rule for mononucleotides, which states that A=T and C=G. The test is set up as follows:

H0: A+G != C+T
H1: A+G = C+T

The vector (A,C,G,T) is assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under the null hypothesis.

The test statistic etaV is the Euclidean distance from the relative frequency vector (A,C,G,T) to the closest point in the square set thetaV = {(x,y,1/2-x,1/2- y) : 0 <= x,y <= 1/2}, which divides the 3-simplex into two equal parts. etaV lies in the range [0,sqrt(3/8)].

Value

A list with class "htest.ext" containing the following components:

statistic

the value of the test statistic.

p.value

the p-value of the test.

method

a character string indicating what type of test was performed.

data.name

a character string giving the name of the data.

estimate

the probability vector used to derive the test statistic.

stat.desc

a brief description of the test statistic.

null

the null hypothesis (H0) of the test.

alternative

the alternative hypothesis (H1) of the test.

Note

agct.test(x, alg="upper") is equivalent to ag.test(x, alg="simplex") except that the p-value computed using the formula for alg="upper" is exact for the test statistic etaV* used in ag.test, whereas it is merely an upper bound on the p-value for etaV.

Author(s)

Andrew Hart and Servet Mart<ed>nez

References

Hart, A.G. and Mart<ed>nez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

See Also

chargaff0.test, chargaff1.test, chargaff2.test, ag.test, chargaff.gibbs.test

Examples

1
2
3
4
5
6
7
8
9
#Demonstration on real viral sequence
data(pieris)
agct.test(pieris)

#Simulate synthetic DNA sequence that does not exhibit Purine-Pyrimidine parity
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
agct.test(seq)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.