scanning: Scan Statistics
In miraculix: Algebraic and Statistical Functions for Genetics

Description Usage Arguments Details Value Author(s) References Examples

The function implements the scan statistics method of Kabluchko and Spodarev (2009), Theorem 3.1.

scanning(pos, freq, file, tuningUnits, alpha = 0.1, coarsening = 1,
         minscans=0, maxscans = 0, sumscan = FALSE, perSNP = TRUE,
         colname , n, threshold, collect=!old.def, old.def=FALSE,
         max.intervals = length(alpha) * 100000,
         max.basepair.distance = 50000, exclude.negative.at.boundary = TRUE,
         maximum = TRUE, mean.freq, sd.freq, mean.n)

scan.statistics(file, tuningUnits, alpha=c(0.05, 0.01), repet=1000,
                coarsening = 1,
                minscans=0, maxscans=0, sumscan = FALSE, perSNP=TRUE, 
                colname, n, return.simu = FALSE,
                debug = FALSE, formula = FALSE, 
                old.def=FALSE,
                max.intervals = length(alpha) * 100000,
                max.basepair.distance = 50000,
                exclude.negative.at.boundary = TRUE,
                pos, freq)

`pos, freq`	alternatively to the `file` name, two vectors, `pos` and `freq` might be given.
`file`	filename or list. The rda file must contain the variables `pos`, `freq`, `colname`, and `n`. Or it is a list with the same named elements. If the extension of the filename is ‘bed’, the behaviour of the programme is different, see the details
`tuningUnits`	real number. The value 0 codes the case of Theorem 3.1 in Kabluchko and Spordarev (2009). A positive value codes the case of Theorem 2.1 (which is very much preferred). The case of Theorem 3.2 does not suit, hence is not coded. Good values for `tuningUnits` seem to be around 0.85. Note that first, the frequencies are standardized. Then `tuningUnits` * `mean`(n) / n is substracted.
`alpha`	level(s) of testing. The levels should decrease.
`coarsening`	integer. If the value is larger than 1 then the data are first windower'ed by `length=coarsening`. This is important to do if the data are fine scaled!
`repet`	The number of simulation to determine the threshold(s) for testing in `scan.statistics`; see also `formula`. Should be at least 100 better 1000.
`minscans,maxscans`	integers. The minimunm and maximum length of the window, respectively. If non-positive the window sizes are not restricted from below or above, respectively.
`sumscan`	logical. If `TRUE` the old style picture appears. Otherwise the relative number of significant intervals containing a certain point is shown.
`perSNP`	logical. If `TRUE` then the test is based on SNPs as units. If `FALSE` the test is based on basepairs (not programmed yet).
`colname`	the column of the data frame that gives the relative frequencies. The default name (i.e., if missing) is "HeterAB". Alternatively `colname` is a number indicating the respective column. In case the extension of the filename equals ‘bed’, the behaviour is different, see Details.
`n`	The number of individuals, the data are based on. Usually that number is determined automatically, but might be given for safety explicitely
`return.simu`	logical. to do
`debug`	logical or 2. If not `FALSE` important data are saved on the disk. If `debug == 2` pictures of each simulation are shown. [to do in more detail]
`threshold`	`scanning` counts the number of intervals found above the given `threshold`. `threshold` is an alternative to `alpha` and is used instead of `alpha` if both are given. This threshold is applied to the standardized frequency data. A value around 0.8 seems to be appropriate for Christian's data whereas values around 18 are appropriate for Amanda's data.
`collect`	`scanning` can be used in two ways. If `collect=FALSE` essentially only the scan statistic is determined. If `collect=TRUE` then also all the intervalls are determined that are considered to be significant at the given `alpha` levels.
`old.def`	logical. If `TRUE` all the tiny snippets that have not been agglutinated yet, are also returned. If `TRUE` it takes a lot of memory. Further, if `TRUE`, negative (modified) values are allowed at the borders of an interval. Finally, if `TRUE` the parameters `max.intervals`, `max.basepair.distance`, `exclude.negative.at.boundary` are not considered.
`max.intervals`	[only if `old.def=FALSE`] As the number of intervals is determined dynamically, the total number of significant intervals cannot be determined in advance. To economise a lot of copying, an upper threshold is given by the user. 100000 for each level should be large enough. If not, please contact the author.
`max.basepair.distance`	[only if `old.def=FALSE`] if a basepair distance is larger than `max.basepair.distance` then the significant areas are considered as two separate areas.
`exclude.negative.at.boundary`	logical. If `TRUE` negative values at boundaries are not allowed. I.e. each significant area starts and ends with a positive modified frequency.
`maximum`	logical. MISSING DOC
`mean.freq`	If given, `mean.freq` overwrites `mean(freq)`
`sd.freq`	If given, `sd.freq` overwrites `sd(freq)`
`mean.n`	If given, `mean.n` overwrites `mean(n)`
`formula`	if `formula=TRUE` then the formula of Kabluchko and Spodarev (2009) is used in `scan.statististics`. Otherwise, a `repet` number of simulations under the null hypothesis are performed to get the threshold right.

The ideas for the code are taken from Kabluchko and Spordarev (2009) although the values are not calculated from the respective theorems. Instead, values are obtained by simulation in a procedure similar to Bootstrapping.

In case the file is a bed-file, the following differences to the standard behaviour appears:

colname must be of the form c(pos=, freq=, n=) with default value c(pos=3, freq=4, n=5)
the sign of the frequency is changed
it is not checked whether the frequencies * n equals an integer number

scanning

returns invisibly a list that contains always

file, pos, freq, tuningUnits, alpha, n, maxscans, perSNP: the input data
above.threshold: the number of intervals showing a total sum larger than the given threshold.
threshold: corresponding to alpha, if not given explicitely
maximum: the maximum value reached scanning over all windows

if collect=TRUE then the list also contains

areas: matrix of three rows containing information of all the (overlapping) intervals where the sums exceeds the thresholds. Each interval is given by a column. First row: left end point of the interval. Second row: right end point of the interval. Third interval: maximum number of threshold that are passed.
values: the sums that correspond to the maxima in areas
significant.areas: list of matrices. For each threshold, all the overlapping intervals are joined that overlap, so that non-overlapping intervals are finally obtained.
Message: whether the null hypothesis is rejected at the lowest alpha level.

scan.statistics

returns invisibly a list containing all elements of scanning for collect=TRUE. Additionally, it contains

maxima: the maxima of repet simulated data if formula=FALSE

Martin Schlather, schlather@math.uni-mannheim.de

Kabluchko, Z. and Spodarev, E. (2009) Scan statistics of Levy noises and marked empirical processes. Adv. Appl. Probab. 41, 13-37.

 
if (interactive()) {
  n <- 30
  loci <- 9000 
  positions <- 25000:15000000
} else {
  n <- 3
  loci <- 900
  positions <- 2500:1500000
}
pos <- sort(sample(positions, loci))
freq <- rpois(loci, lambda=0.3) / n

alpha <- c(0.1, 0.05, 0.01)
s <- scan.statistics(n=n, pos=pos, freq=freq, repet=100,
                     tuningUnits=0.65, alpha=alpha)
str(s)