agct.test: Test of Purine-Pyrimidine Parity Based on Euclidean distance

View source: R/agct.test.R

agct.testR Documentation

Test of Purine-Pyrimidine Parity Based on Euclidean distance

Description

Performs a test proposed by Hart and Martínez (2011) for the equivalence of the relative frequencies of purines (A+G) and pyrimidines (C+T) in DNA sequences. It does this by checking whether or not the mononucleotide frequencies of a DNA sequence satisfy the relationship A+G=C+T.

Usage

agct.test(x, alg=c("exact", "simulate", "lower", "Lower", "upper"), n)

Arguments

x

either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.

alg

the algorithm for computing the p-value. If set to “‘⁠simulate⁠’”, the p-value is obtained via Monte Carlo simulation. If set to “‘⁠lower⁠’”, an analytic lower bound on the p-value is computed. If set to “‘⁠upper⁠’”, an analytic upper bound on the p-value is computed. “‘⁠lower⁠’” and “‘⁠upper⁠’” are based on formulae in Hart and Martínez (2011). a Tighter (though unpublished) lower bound on the p-value may be obtained by specifying “‘⁠Lower⁠’”. If ‘⁠alg⁠’ is specified as “‘⁠exact⁠’” (the default value), the p-value for the test is computed exactly.

n

The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

Let A, C, G and T denote the relative frequencies of the nucleotide bases appearing in a DNA sequence. This function carries out a statistical hypothesis test that the relative frequencies satisfy the relation A+G=C+T, or that purines \{A, G\} occur equally as often as pyrimidines \{C,T\} in a DNA sequence. The relationship can be rewritten as A-T=C-G, from which it is easy to see that the property being tested is a generalisation of Chargaff's second parity rule for mononucleotides, which states that A=T and C=G. The test is set up as follows:

H_0: A+G \neq C+T
H_1: A+G = C+T

The vector (A,C,G,T) is assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under the null hypothesis.

The test statistic \eta_V is the Euclidean distance from the relative frequency vector (A,C,G,T) to the closest point in the square set \theta_V=\{(x,y,1/2-x,1/2-y) : 0 <= x,y <= 1/2\}, which divides the 3-simplex into two equal parts. \eta_V lies in the range [0,\sqrt{3/8}].

Value

A list with class "htest.ext" containing the following components:

statistic

the value of the test statistic.

p.value

the p-value of the test.

method

a character string indicating what type of test was performed.

data.name

a character string giving the name of the data.

estimate

the probability vector used to derive the test statistic.

stat.desc

a brief description of the test statistic.

null

the null hypothesis (H_0) of the test.

alternative

the alternative hypothesis (H_1) of the test.

Note

agct.test(x, alg="upper") is equivalent to ag.test(x, alg="simplex") except that the p-value computed using the formula for ‘⁠alg="upper"⁠’ is exact for the test statistic \eta_V^* used in ag.test, whereas it is merely an upper bound on the p-value for \eta_V.

Author(s)

Andrew Hart and Servet Martínez

References

Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

See Also

chargaff0.test, chargaff1.test, chargaff2.test, ag.test, chargaff.gibbs.test

Examples

#Demonstration on real viral sequence
data(pieris)
agct.test(pieris)

#Simulate synthetic DNA sequence that does not exhibit Purine-Pyrimidine parity
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
agct.test(seq)

spgs documentation built on Oct. 3, 2023, 5:07 p.m.