overfitting: Overfitting diagnostic functions
In BayesPeak: Bayesian Analysis of ChIP-seq Data

Description Usage Arguments Details Value Author(s) Examples

Three functions that provide diagnostic plots and tools to mitigate the effects of overfitting.

plot.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)
identify.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)
region.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)

`x`	Raw output from the `bayespeak` function.
`whatX, whatY`	Character. The quantities to plot on the X and Y axes. Common choices would be `"lambda1"`, `"score"`, `"calls"`. Any choice in `names(raw.output$QC)` is, in theory, acceptable (except for `"chr"` and `"status"`, which do not correspond to numeric quantities).
`logX, logY`	Logical. If TRUE, the quantity on the corresponding axis undergoes a log transformation before being plotted.
`main`	Title of plot (corresponds to `main` argument in `plot` function).
`...`	Further arguments. `plot.overfitdiag` passes these through to `plot`. `identify.overfitdiag` passes these through to `identify`. `region.overfitdiag` passes these through to `plot.overfitdiag`.

These three functions are used to investigate the prevalence of overfitting in a data set, and to aid selection of sensible criteria for performing overfitting corrections.

plot.overfitdiag provides a scatterplot of the key parameters associated with jobs. Please see section 9 of the vignette for an description of how to interpret this information.

identify.overfitdiag is used after a call plot.overfitdiag, with the same arguments, to find out which job was plotted at a particular location. The interface is operated in the same manner as identify - left-click on the plot to label the job closest to that point, and right-click on the plot to end this process.

region.overfitdiag is used to define an overfit region on the plot, and return the jobs in the region. The function is used in the same manner as locator - left-click on the plot to define the vertices of a polygon, and then right click anywhere to close the polygon (there is no need to left-click on the first vertex again). The area selected will be filled in with red hatching. The function then returns the IDs of the jobs in the hatched area. Typically, this output will be used as an exclude.jobs argument in summarize.peaks

All three functions output to the active graphical device. In addition, identify.overfitdiag and region.overfitdiag return integer vectors corresponding to the jobs selected on the plot.

Jonathan Cairns

data(raw.output)

plot.overfitdiag(raw.output)

##recreate figures in vignette
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "lambda1", logY = TRUE)
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)

## Not run: 

##identify particular jobs in the plot
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)
identify.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)

##define an overfit region
##left-click to define the polygon vertices, right-click to close the polygon
sel <- region.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)
output <- summarize.peaks(raw.output, exclude.jobs = sel)


## End(Not run)