pcKeepCompDetect: Auto detection of a fitted 'pcKeepComp' param for filterFFT...
In nucleR: Nucleosome positioning package for R

Description Usage Arguments Details Value Author(s) Examples

This function tries to obtain the minimum number of components needed in a FFT filter to achieve or get as close as possible to a given correlation value. Usually you don't need to call directly this function, is used in filterFFT by default.

pcKeepCompDetect(
  data,
  pc.min = 0.01,
  pc.max = 0.1,
  max.iter = 20,
  verbose = FALSE,
  cor.target = 0.98,
  cor.tol = 0.001,
  smpl.num = 25,
  smpl.min.size = 2^10,
  smpl.max.size = 2^14
)

`data`	Numeric vector to be filtered
`pc.min, pc.max`	Range of allowed values for `pcKeepComp` (minimum and maximum), in the range 0:1.
`max.iter`	Maximum number of iterations
`verbose`	Extra information (debug)
`cor.target`	Target correlation between the filtered and the original profiles. A value around 0.99 is recommeded for Next Generation Sequencing data and around 0.7 for Tiling Arrays.
`cor.tol`	Tolerance allowed between the obtained correlation an the target one.
`smpl.num`	If `data` is a large vector, some samples from the vector will be used instead the whole dataset. This parameters tells the number of samples to pick.
`smpl.min.size, smpl.max.size`	Minimum and maximum size of the samples. This is used for selection and sub-selection of ranges with meaningful values (i,e, different from 0 and NA). Power of 2 values are recommended, despite non-mandatory.
`...`	Parameters to be pass to `autoPcKeepComp`

This function predicts a suitable pcKeepComp value for filterFFT function. This is the recommended amount of components (in percentage) to keep in the filterFFT function to obtain a correlation of (or near of) cor.target.

The search starts from two given values pc.min, pc.max and uses linial interpolation to quickly reach a value that gives a corelation between the filtered and the original near cor.target within the specified tolerance cor.tol.

To allow a quick detection without an exhaustive search, this function uses a subset of the data by randomly sampling those regions with meaningful coverage values (i,e, different from 0 or NA) larger than smpl.min.size. If it's not possible to obtain smpl.max.size from this region (this could be due to flanking 0's, for example) at least smpl.min.size will be used to check correlation. Mean correlation between all sampled regions is used to test the performance of the pcKeepComp parameter.

If the number of meaningful bases in data is less than smpl.min.size * (smpl.num/2) all the data vector will be used instead of using sampling.

Fitted pcKeepComp value

Oscar Flores oflores@mmb.pcb.ub.es, David Rosell david.rosell@irbbarcelona.org

# Load dataset
data(nucleosome_htseq)
data <- as.vector(coverage.rpm(nucleosome_htseq)[[1]])

# Get recommended pcKeepComp value
pckeepcomp <- pcKeepCompDetect(data, cor.target=0.99)
print(pckeepcomp)

# Call filterFFT
f1 <- filterFFT(data, pcKeepComp=pckeepcomp)

# Also this can be called directly
f2 <- filterFFT(data, pcKeepComp="auto", cor.target=0.99)

# Plot
library(ggplot2)
i <- 1:2000
plot_data <- rbind(
    data.frame(x=i, y=data[i], coverage="original"),
    data.frame(x=i, y=f1[i], coverage="two calls"),
    data.frame(x=i, y=f2[i], coverage="one call")
)
qplot(x=x, y=y, color=coverage, data=plot_data, geom="line",
  xlab="position", ylab="coverage")