Description Usage Arguments Value Note Author(s) See Also Examples
This function exports five R source code files containing low-level idioms of the language so that the Cubist model can be reused or idiom implemented without having the Cubist package itself. The function, thus, permits dependency free archival of a Cubist model based on the basic functionality of the language.
For idiom implementation, it is important that the concept of “neighbors” as a feature and argument to the prediction of Cubist not be used. The pursuit here, for a given row of the input matrix, is to isolate by name each applicable rule and then trigger a prediction for each rule and then pool the estimate by some method.
1 2 3 | cubist_asR_idioms(cubist_object, cubist_tag="C0", nut_digits=3,
path=".", prefix="", tidyrule.return=FALSE,
sample_str="", var.names=NULL, write.utility.file=TRUE, ...)
|
cubist_object |
Either a Cubist object (model) from the Cubist package on which the function |
cubist_tag |
A string thought of a being a small fragment of arbitrary text chosen by a user that identifies a particular Cubist model. This tag is critically important and useful for subroutines on export and the R files will have this tag in their names. In imagined use, it is expected that the tag would be a just one or two characters long; |
nut_digits |
The number of digits to pass to the |
path |
The directory path to which the various R files are to be written; |
prefix |
A string that is prefixed to the output file names of this function. This option is useful if the |
tidyrule.return |
A logical triggering the |
sample_str |
An arbitrary string of sample-size summary information or really any other content about the Cubist model that the user wants to give over to the idiom implementation by creating a variable |
var.names |
The comprehensive list of variable names given over to the Cubist fitting. These names need to be specified at least once for a given use of |
write.utility.file |
A logical to trigger the writing of the |
... |
Additional arguments to pass (currently not used); |
This function is used for its side effects written to the file system, but can be used to return without modification the results of the tidyrules::tidyRules()
function.
Basic description of the written files by this function follow in this section.
Utilities: The file cubist_utils.R
contains cubist_tag
independent logic that is used as an abstraction layer to access the various idioms specific to the Cubist rules, branches, and nuts. The function therein getVARIABLE_NAMES
returns a vector of variable names given over to Cubist. This vector could be used to testing external incoming input data frames as having all of the variables required. The function therein useBRANCH
is the abstraction layer to access the linear model at the end of each rule by evaluation of its function name; this material is the linear model as shown by tidyrules::tidyRules()
. Finally, the function therein getNUTS
is the abstraction layer to access the diagnostics about the rule as shown by tidyrules::tidyRules()
that include the support (number of observations) for the rule, mean response value for the rule, minimum response value for the rule, maximum response value for the rule, and error in units of the response for the linear model.
W.H. Asquith
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 | # We set the simulation sample size to 1,000; the number of committees is 3.
nsim <- 1000; committees <- 3
# The high-level design is to have the ability to tag a given Cubist object.
# The tag will be used as output files and subroutines and variables therein
# are written.
cubist_tag <- 9 # say our "ninth" Cubist object
set.seed(cubist_tag) # arbitrary here, but setting the seed for reproducibility
cubist_tag <- paste0("C",cubist_tag) # create the tag "C9" to be used
# simulate some X and Y data
X <- sort(runif(nsim, min=0, max=2))
Y <- 0.34*sin(2*pi*X) - .74*cos(2*pi*X) -
0.14*sin(4*pi*X) + .19*cos(4*pi*X) + rnorm(nsim, sd=0.2)
X <- data.frame(X=X, pi=pi) # the design that follows both by Cubist
# and by the idiom functions in this package need at least two columns
# even if the second column containing pi is never used.
# We can foreshadow some type of cross-validation, but here just use all
phi <- 1 # when the phi parameter is set to unity.
inSchool <- sample(1:nrow(X), floor(phi*nrow(X)))
Xs <- X[ inSchool,]; Xt <- X[-inSchool,]
Ys <- Y[ inSchool ]; Yt <- Y[-inSchool ]
# make the Cubist model
cubist_tree <- Cubist::cubist(x=Xs, y=Ys, committees=committees)
# user could inspect tidy_rules and see what is happening internally
# to functions of the Cubist.Idioms package, but we will let the package
# call the tidyRules internally in cubist_asR_idioms().
# tidy_rules <- tidyrules::tidyRules(cubist_tree)
# later a diagnostic plot is made, so get the Cubist predictions
# neighbors=0 is default but critical as the idioms do not support
cubist_pred <- predict(cubist_tree, X, neighbors=0) #
# construct a string as described above, here a colon-delimited,
# equal-sign keyed sequence of content that we want preserved in
# the exported Cubist idioms for access again when we build
# infrastructure to use the idioms, not otherwise in this demo
txt <- paste0("committees=",committees,":sample_size=",nsim)
tmpath <- tempdir() # temporary directory
message("temporary path = ",tmpath)
cubist_asR_idioms(cubist_tree, cubist_tag=cubist_tag,
path=tmpath, sample_str=txt)
files <- list.files(path=tmpath, pattern=paste0(cubist_tag,".+.R"))
files <- c(files, "cubist_utils.R")
print(files)
# [1] "C9cubist_cubes.R" "C9cubist_funcs.R" "C9cubist_nuts.R"
# [4] "C9cubist_rules.R" "cubist_utils.R"
# so four source files of idioms for the C9 tagged Cubist were
# made, and the cubist_utils.R is tag-independent and has a couple
# of accessor functions to work with the big picture
# now begin a minimal recipe on use of the idioms and hence breaking
# from a dependency of having to have Cubist library around to make
# the predictions. first, idiom files are sourced, here five files:
for(file in files) source(paste0(tmpath,"/",file))
# now some trickery, in case we had multiple tagged Cubist models
# loaded via the idiom export, lets dynamically construct an
# interface called CUBES() that for a row of input matrix will provide
# the applicable rules.
CUBES <- eval(parse(text=paste0("CUBES_",cubist_tag)))
# initialize the predictions, here we want to demonstrate two
# the use of mean() is basically how Cubist works if neighbors=0, but
# because we have errors, a weighted mean is also accessible.
# is.na()'s (if they hit) are trapped separately, so we do not use
# na.rm=TRUE for the mean() or weighted.mean()
Y_by_idioms_mean <- Y_by_idioms_wgt <- rep(NA, nrow(X))
# loop through the rows of the input matrix
for(i in 1:nrow(X)) {
x <- X[i,] # isolate a row
rl <- CUBES(x) # figure out which rules are applicable
rl <- rl[! is.na(rl)] # all rules for each committee are tested
if(length(rl) == 0) { # so we need to strip the NA's and if zero
Y_by_idioms_mean[i] <- NA # then trap the situation and make
Y_by_idioms_wgt[i] <- NA # the prediction NA
next
}
# the "branch" here is the actually linear function sitting at the
# end of the rule; the "nut" here are the diagnostics of the linear
# function at the end of the branch.
pr <- useBRANCH(x, rules=rl, cubes=CUBES) # vector of predictions
nt <- getNUTS( rules=rl, cubes=CUBES) # data frame of diagnostics
Y_by_idioms_mean[i] <- mean(pr) # compute the predicted mean
wgts <- 1/nt$error; wgts <- wgts/sum(wgts) # weighted mean
Y_by_idioms_wgt[i] <- sum(wgts*pr) # requires sum weights == 1
}
message("Last row in X matrix had these branches and nuts\n")
print(nt) # showing here the structure idea of the "nut"
# support mean min max error
# MODC9_COM01_BRA002 172 -0.566 -1.293 -0.048 0.174
# MODC9_COM02_BRA001 80 -0.573 -1.269 -0.117 0.168
# MODC9_COM03_BRA002 172 -0.566 -1.293 -0.048 0.174
# the nut shows us by row.names(), which rules and committees
# tested as applicable for the last row of the input, three here
plot(X$X,Y, type="p", lwd=0.4, col=grey(0.2),
xlab="X variable", ylab="Y variable",
xlim=c(0,2), ylim=c(-1,2))
lines(X$X, cubist_pred, col=3, lwd=7)
lines(X$X, Y_by_idioms_mean, col=2, lwd=3)
lines(X$X, Y_by_idioms_wgt, col="#2E5090", lwd=1.3, lty=2)
legend("topleft", c("Simulated data", "Cubist itself",
"Cubist idioms (mean)",
"Cubist idioms (weighted mean)"),
lty=c(0, 1, 1, 2), lwd=c(0, 7, 3, 1.3), bty="n",
pch=c(1, NA, NA, NA), pt.lwd=c(0.4, NA, NA, NA),
col=c(grey(0.2), 3, 2, "#2E5090"), cex=0.8) #
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.