Description Usage Arguments Details Value Box Sequence Object Attributes Author(s) See Also Examples
This is the primary function for the sdtoolkit
package. It organizes many subsidiary functions into an interactive session to aid the user in identifying policy-relevant scenarios. It is based on Friedman and Fisher's PRIM, but includes additional modifications and diagnostics to better suit the scenario discovery task.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | sdprim(x, y = NULL,
thresh = NULL,
peel.alpha = 0.1,
paste.alpha = 0.05,
mass.min = 0.001,
pasting = TRUE,
box.init = NULL,
coverage = TRUE,
outfile = "boxsum.txt",
csvfile = "primboxes.csv",
repro = TRUE,
nbump = 10,
dfrac = 0.5,
threshtype = ">",
trajplot_xlim = c(0,1),
trajplot_ylim = c(0,1),
peel_crit = 1)
|
x |
Matrix of input variables - all values must be numeric at present. (No non-binary categorical values allowed.) |
y |
Output vector, which may or may not be thresholded to be zeros and ones. If not thresholded, you must specify both |
thresh |
Numeric - if |
peel.alpha |
Peeling parameter for PRIM. Typically around .05 to .1. |
paste.alpha |
Pasting parameter for PRIM. |
mass.min |
Minimum fraction of original points that must remain to allow continued peeling. |
pasting |
Logical indicating whether or not to paste. |
box.init |
Matrix, providing specification of the initial box bounds, if for some reason you wanted to limit the area over which it searched. The format is 2 rows by |
coverage |
Logical, indicating whether or not to provide coverage-oriented statistics during plotting, rather than support-oriented statistics |
outfile |
Optional character, naming a text file where the box summary information will be copied. Set equal to |
csvfile |
Optional character, naming a default csv file where CARs-readable output will be sent. Regardless of what you put here, |
repro |
Logical, specifying whether or not to automatically generate reproducibility statistics by rerunning PRIM on random samplings of the dataset. The only reason to set to FALSE is if there is a prohibitively long run time. Another alternative is to lower the number of resamplings, |
nbump |
Integer - If |
dfrac |
Numeric between 0 and 1. If |
threshtype |
Required only if the output data is not already in a zero-one formate. A relational operator entered as a character (either “<”,“>”,“<=” or “>=”, describing how to values in |
trajplot_xlim |
x-limits on the peeling trajectory plot. Must be either a vector that can be passed as xlim argument to plot function, or NULL to make x limits narrowly around the data. |
trajplot_ylim |
y-limits on the peeling trajectory plot. Must be either a vector that can be passed as ylim argument to plot function, or NULL to make y limits narrowly around the data. |
peel_crit |
Either 1, 2, or 3, 1 is default – choose peels based on maximizing density in remaining box. 2 is normalize improvement by peeled box support, 3 is minimize mean in peeled box. See F&F 1999 Sec 14.1. |
Here we discuss several terminology issues that will be useful in understanding the output below, and then describe the interactive process used to identify scenarios.
This package is generally oriented around producing boxes, which are defined by a set of orthogonal restrictions on the input (uncertainty) space of a model. In a policy context, a box or set of related boxes can be interpreted as a scenario. See the references at the end of this entry for a more thorough explanation of the concept of analytically generated scenarios.
Two important measures of scenario adequacy are termed coverage and density. These are defined based on the binary output variable, which typically denotes some measure of “interestingness” of the input-output combination. Coverage is the ratio of interesting points (those with an output value of one) captured to the total interesting points in the dataset, and density is the fraction of captured points that are actually interesting (ones in the box to total points in the box). (These have analogues to Type I and Type II errors and other measures from information theory.)
The basic PRIM algorithm operates in two steps, first generating a trajectory of boxes (in coverage-density space) that provide a tradeoff frontier for scenarios, with coverage, density and the number of restricted dimensions generally in tension. From this trajectory, the user selects and further inspects one or more candidate boxes that appropriately balance the measures of interest. After identifying and possibly modifying a box, the data points contained within that box are removed from the input matrix, and a new trajectory is generated. This process (covering) can continue until the user is satisfied with the total coverage achieved, or until other conditions prohibit further covering. The resulting set of selected boxes is referred to as a box sequence, and is the primary product of the PRIM algorithm.
The sdprim
algorithm is very interactive, and the user will receive several prompts while running it. Note that during this process, at least on MS Windows versions, you will receive a “busy” hourglass even when you have an activated R console awaiting user input. Input can still (must) be entered in this state.
The first asks whether they would like the peeling trajectory displayed with dimension contours, with dominating points, or in the standard form.
Dimension contours represent coverage and density values achieved by boxes while holding the total number of restricted dimensions constant. They do not represent a complete or optimal contour derived by considering the frontier of all possible boxes of a given dimension, but rather simply display the coverage and density associated with boxes that could be created by dropping dimensions (in order of least importance) from other boxes in the current trajectory. That is, if there is a 3 dimensional box with restrictions x1 < 1, x8 > 5, x2 < 200, then the box defined by x1 < 1 and x8 >5 will be included in the contour for 2-d boxes. Often, each point on a dimension contour will be defined by different restrictions on the same set of dimensions, but this is not always true.
Dominating points are those coverage and density combinations which satisfy two conditions: They are associated with boxes generated by dropping dimensions from those boxes on the trajectory, and they lie “above and to the right” of the current trajectory. That is, they are more simply defined, and also extend the density coverage frontier. While a dominating point may not represent an overall ideal tradeoff between density, coverage and interpretability, it should be locally preferred to points nearby.
After selecting which display option they would like, the user is then asked to select candidate points from the trajectory displayed. Clicking on points displays a number for the box associated with that point. When clicking on points on the original trajectory (denoted by large filled circles), the number shown refers to the index number of exactly that box. When clicking on dominating points or points on contour lines, the number shown references the box on the trajectory from which the box represented by the point of interest was derived. That is, points not on the trajectory are all derived by taking some box on the trajectory and removing some dimension restrictions - and the number shown when you click on those points refers to the original full box. This should be born in mind later on.
After identifying candidate boxes, their statistics are displayed in the R console. The user then picks a box to consider in more detail. The program then shows various additional information about the box, and afterwards the user is given the opportunity to drop specific dimension restrictions. This completes the process of selecting a single box. The user is then given opportunity to continue covering, in which they repeat the process above on all the data not encompassed by the previously selected box.
A box sequence object, which is a list, each entry containing a box object (which, in an ideal more formalized world, would be a class). A box object contains a great deal of information about the particular box in question, including it's definition and associated statistics (described below). Additionally, the box sequence object contains two attributes, estats
and olap
which are ensemble statistics for the entire box sequence. While the structure of the output below may be of interest for advanced users, there is no need for the non-R user to be familiar with these outputs, as there are multiple functions for interpreting and displaying the output in a more friendly manner, such as seqinfo
and dimplot
.
y.mean |
Numeric, the mean of the points inside the box. If the output data is 0-1, then this is the density. Note that if this is something other than the first box in the sequence, the density is contingent on the covering process up to that point. |
box |
A 2 by d matrix giving the absolute bounds of the box, including those bounds that were not restricted. For unrestricted dimension ends, they are taken from the range of the data along that dimension. |
mass |
The number of points inside the box, expressed as a fraction of the (sub)dataset used to generate this box. The “full” mass can be found in the |
dimlist |
A list of three logical vectors ( |
morestats |
A matrix with one row per restricted box definition, which contains the “remove variable” statistics. The columns are as follows: Column number in input matrix, density, coverage, support. |
relcoverage |
Tracks the coverage of the entire trajectory from which this box was taken, with it normalized to the subdataset the box was taken from. |
pvvalist |
A matrix giving the quasi-p-values for each dimension restriction. |
freqmat |
A matrix giving the reproducibility statistics (assuming the |
index |
For the sake of reproducibility, this gives the index of the box in the trajectory from which it was selected. Note that, unless this is the first box in the sequence, the accuracy of this index is contingent on selection of the identical index, dimensions and restrictions for the previous boxes, and parameters for PRIM - ie, it refers to the trajectory that results after the previous boxes in the sequence have been selected. |
relbox |
A matrix of structure similar to |
The overall box sequence object returns three attributes as well. The list estats
contains “ensemble” statistics for the entire dataset, as follows:
Total coverage for the box sequence.
Total support for the box sequence.
Overall density for the space encompassed by the box sequence.
Total points in the dataset used.
Total number of interesting points in the dataset (after thresholding).
Total number of interesting points captured by the box sequence.
Total number of points captured by the box sequence.
Total dimensions in the input data.
The attribute olaps
contains a matrix that holds several pieces of information. The diagonals display the absolute coverage of box i (in position matrix[i,i]). The lower triangle displays the the overlap in total interesting points between boxes i and j. The upper triangle displays the total points in common. All are expressed as a fraction of total points in the dataset.
Lastly, the attribute data
contains the data used to generate the boxes, augmented by columns full of 0's and 1's. The first augmented column is named 'in.seq' and gives a 1 for any point that falls in any box in the box set, and the remaining augmented columns are titled 'in.BX', where 'X' is the box number in the box sequence. To access this data frame, use mynewdata <- attr(myboxes, "data")
.
Benjamin P. Bryant, bryant@prgs.edu
sd.start
for reading in and cleaning data, seqinfo
for viewing the output of sdprim
, and dimplot
for visualizing dimension restrictions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | #Load some example data to play with:
data(quakes)
#quakes is a 1000 by 5 dataset of earthquake information. This has no obvious
#policy significance, but we can use this built-in dataset to illustrate the use
#of PRIM.
#Here are the columns:
colnames(quakes)
#We will say magnitude is the output of interest, and call earthquakes greater
#5.0 'interesting.' We can then call sdprim two different ways.
#First, make an input matrix from columns 1,2,3 and 5
inputs <- quakes[,c(1:3,5)] #could also do quakes[,-4]
#Now put our unthresholded y vector:
yout <- quakes[,"mag"] #could also do quakes[,4]
#Now we can either call sdprim and threshold inside PRIM, like this:
## Not run: myboxes <- sdprim(x=inputs, y=yout, thresh=5.0, threshtype=">")
#Or we can first threshold yout:
ythresh <- 1*(yout>5.0)
#and then call sdprim without worrying about the thresholds:
## Not run: myboxes <- sdprim(x=inputs, y=ythresh)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.