Overfitting diagnostic functions

Share:

Description

Three functions that provide diagnostic plots and tools to mitigate the effects of overfitting.

Usage

1
2
3
4
5
6
7
8
9
plot.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)
identify.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)
region.overfitdiag(x, whatX = "lambda1", whatY = "score",
	logX = TRUE, logY = FALSE,
	main = "Overfitting diagnostic", ...)

Arguments

x

Raw output from the bayespeak function.

whatX, whatY

Character. The quantities to plot on the X and Y axes. Common choices would be "lambda1", "score", "calls". Any choice in names(raw.output$QC) is, in theory, acceptable (except for "chr" and "status", which do not correspond to numeric quantities).

logX, logY

Logical. If TRUE, the quantity on the corresponding axis undergoes a log transformation before being plotted.

main

Title of plot (corresponds to main argument in plot function).

...

Further arguments.

  • plot.overfitdiag passes these through to plot.

  • identify.overfitdiag passes these through to identify.

  • region.overfitdiag passes these through to plot.overfitdiag.

Details

These three functions are used to investigate the prevalence of overfitting in a data set, and to aid selection of sensible criteria for performing overfitting corrections.

plot.overfitdiag provides a scatterplot of the key parameters associated with jobs. Please see section 9 of the vignette for an description of how to interpret this information.

identify.overfitdiag is used after a call plot.overfitdiag, with the same arguments, to find out which job was plotted at a particular location. The interface is operated in the same manner as identify - left-click on the plot to label the job closest to that point, and right-click on the plot to end this process.

region.overfitdiag is used to define an overfit region on the plot, and return the jobs in the region. The function is used in the same manner as locator - left-click on the plot to define the vertices of a polygon, and then right click anywhere to close the polygon (there is no need to left-click on the first vertex again). The area selected will be filled in with red hatching. The function then returns the IDs of the jobs in the hatched area. Typically, this output will be used as an exclude.jobs argument in summarize.peaks

Value

All three functions output to the active graphical device. In addition, identify.overfitdiag and region.overfitdiag return integer vectors corresponding to the jobs selected on the plot.

Author(s)

Jonathan Cairns

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
data(raw.output)

plot.overfitdiag(raw.output)

##recreate figures in vignette
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "lambda1", logY = TRUE)
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)

## Not run: 

##identify particular jobs in the plot
plot.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)
identify.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)

##define an overfit region
##left-click to define the polygon vertices, right-click to close the polygon
sel <- region.overfitdiag(raw.output, whatX="calls", logX = TRUE, whatY = "score", logY = TRUE)
output <- summarize.peaks(raw.output, exclude.jobs = sel)


## End(Not run)