chargaff0.test: Vector Test of Chargaff's Second Parity Rule (CSPR) for...
In spgs: Statistical Patterns in Genomic Sequences

chargaff0.test

R Documentation

Vector Test of Chargaff's Second Parity Rule (CSPR) for Mononucleotides

Description

Performs the vector test of Chargaff's second parity rule (CSPR) for mononucleotides proposed in Hart and Martínez (2001).

Usage

chargaff0.test(x, alg=c("exact", "simulate", "lower", "upper", "Lower", "Upper"), n, 
no.p.value=FALSE)

Arguments

`x`	either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.
`alg`	the algorithm for computing the p-value. If set to “‘⁠simulate⁠’”, the p-value is obtained via Monte Carlo simulation. If set to “‘⁠lower⁠’”, an analytic lower bound on the p-value is computed. If set to “‘⁠upper⁠’”, an analytic upper bound on the p-value is computed. “‘⁠lower⁠’” and “‘⁠upper⁠’” are based on formulae in Hart and Martínez (2011). a Tighter (though unpublished) lower /upper bound on the p-value may be obtained by specifying “‘⁠Lower⁠’”/“‘⁠Upper⁠’”. If ‘⁠type⁠’ is specified as “‘⁠exact⁠’” (the default value),the p-value for the test is computed exactly for small values of the test statistic and crudely approximated for large values. See the note below for further details.
`n`	The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.
`no.p.value`	If ‘⁠TRUE⁠’, do not compute the p-value. The default is ‘⁠FALSE⁠’.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

Letting A, C, G and T denote the relative frequencies of their corresponding nucleic acids, this function performs the following hypothesis test:

H_0: A\neq T or C\neq G
H_1: A=T and C=G

The vector (A,C,G,T) is assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under the null hypothesis.

The test statistic is \eta_0 = \sqrt{(A-T)^2/2+(C-G)^2/2}.

Value

A list with class "htest.ext" containing the following components:

`statistic`	the value of the test statistic.
`p.value`	the p-value of the test. Only included if no.p.value is FALSE.
`method`	a character string indicating what type of test was performed.
`data.name`	a character string giving the name of the data.
`estimate`	the probability vector used to derive the test statistic.
`stat.desc`	a brief description of the test statistic.
`null`	the null hypothesis (`H_0`) of the test.
`alternative`	the alternative hypothesis (`H_1`) of the test.

Note

Currently, regardless of the algorithm (alg) selected, the p-value or bound is only computed correctly when the test statistic is smaller than or equal to \sqrt{2}/4. A value of 1 is returned when the test statistic is greater than \sqrt{2}/4. This is not accurate, but shouldn't matter as it is well within the acceptance region of the null hypothesis.

The algebraically computed bounds on the p-value obtained when ‘⁠alg⁠’ is set to either “‘⁠lower⁠’”or “‘⁠upper⁠’” are not as tight as those corresponding to “‘⁠Lower⁠’” and “‘⁠Upper⁠’”, which should be generally preferred. However, “‘⁠exact⁠’” or “‘⁠simulate⁠’” should be employed for real- world analysis.

‘⁠no.p.value⁠’ suppresses computation of the p-value when it is set to ‘⁠TRUE⁠’. This may be useful wen using this function to help simulate the test statistic.

Author(s)

Andrew Hart and Servet Martínez

References

Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

Examples

#Demonstration on real bacterial sequence
data(nanoarchaeum)
chargaff0.test(nanoarchaeum)

#Simulate synthetic DNA sequence that does not satisfy Chargaff's second parity rule
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
chargaff0.test(seq)

spgs documentation built on Oct. 3, 2023, 5:07 p.m.