Matrix Test of CSPR for Mononucleotides

Description

Performs a test of Chargaff's second parity rule (CSPR) for mononucleotides on a genomic sequence using a 4X4 stochastic matrix estimated from the sequence. The test was proposed by Hart and Mart<ed>nez (2011).

Usage

1
chargaff1.test(x, alg=c("table", "simulate", "upper"), n, no.p.value=FALSE)

Arguments

x

either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.

alg

the algorithm for computing the p-value. If set to “simulate”, the p-value is obtained via Monte Carlo simulation. If set to “upper”, an analytic upper bound on the p-value is computed. “upper” are based on formulae in Hart and Mart<ed>nez (2011). If type is specified as “table” (the default value),the p-value for the test is obtained from a linear interpolation of a look-up table. See the note below for further details.

n

The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.

no.p.value

If TRUE, do not compute the p-value. The default is FALSE.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

This function performs a test of Chargaff's second parity rule for mononucleotides based on a 4X4 stochastic matrix P estimated from the empirical dinucleotide distribution of a genomic sequence . The a,b) entry of P gives the empirical probability (relative frequency) that a nucleotide a is followed by a nucleotide b in the sequence. The test is set up as follows:

H0: the sequence (or matrix P) does not comply with CSPR for mononucleotides
H1: the sequence (or matrix P) complies with CSPR for mononucleotides

Value

A list with class "htest.ext" containing the following components:

statistic

the value of the test statistic.

p.value

the p-value of the test. Only included if no.p.value is FALSE.

method

a character string indicating what type of test was performed.

data.name

a character string giving the name of the data.

f

the 2-element vector used in calculating the test statistic.

estimate

the stochastic matrix P used to derive the test statistic.

stat.desc

a brief description of the test statistic.

null

the null hypothesis (H0) of the test.

alternative

the alternative hypothesis (H1) of the test.

Note

Currently, the look-up table that is employed when alg is set to “table” does not provide an accurate p-value when the statistic is smaller than 0.00806. Care should be taken when adjusting p-values for multiple testing.

The algebraically computed upper bound on the p-value obtained when alg is set to “upper” is not very tight and not suitable for real- world applications.

no.p.value suppresses computation of the p-value when it is set to TRUE. This may be useful wen using this function to help simulate the test statistic.

Author(s)

Andrew Hart and Servet Mart<ed>nez

References

Hart, A.G. and Mart<ed>nez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

See Also

chargaff0.test, chargaff2.test, agct.test, ag.test, chargaff.gibbs.test

Examples

1
2
3
4
5
6
7
8
9
#Demonstration on real bacterial sequence
data(nanoarchaeum)
chargaff1.test(nanoarchaeum)

#Simulate synthetic DNA sequence that does not satisfy Chargaff's second parity rule
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
chargaff1.test(seq)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.