Description Usage Arguments Value Author(s) References Examples
View source: R/PGC_Bag_of_Prunes_v200829.R
MARSquake is bagged and perturbed ensemble of completely overfitting MARS base learners. There is a data augmentation option which can be activated to enhanced perturbation's potential when only few features are available. As argued in the paper referenced below, proper randomization will implicitly deliver the optimally stopped MARS. Hence, like for Random Forests, cross-validation is not necessary to avoid overfitting. However, it is not impossible that tweaking some randomization hyperparameters could marginally increase performance. Currently, the function only supports regression. For more details and thorough explanations see appendix A.2 in the paper.
1 2 3 4 |
y |
training target |
X |
training features |
X.new |
features for test set prediction |
B |
number of ensemble members |
mtry |
fraction of randomly selected features considered for each base learners |
sampling.rate |
subsampling rate |
data.aug |
Should we augment the feature matrix with two noisy pseudo-carbo copies of X? |
noise.level |
Standard deviation of the Gaussian noise added to the continuous variables copies of X (when data.aug=TRUE). Note that X's are standardized beforehand. |
shuffle.rate |
Controls the fraction of observations being shuffled for non-continuous regressors when data.aug=TRUE. |
make.sure.it.overfits |
If TRUE, this option partially forces "earth" to overfit (in-sample) if it is recalcitrant to do so (happens when features are scarce). |
mars.mtry.frac |
Controls the fraction of randomly selected features as potential candidates at each step in "earth" forward pass. Analogous to "mtry" in Random Forests. |
degree |
Option on base learners. See "earth"package. |
prune |
Option on base learners. See "earth"package. The whole point of the related paper is that there is little to no benefits in using anything other than "none". However, in data sets where perfect randomization seems unattainable, this could be worth exploring. |
nk |
Option on base learners. See "earth" package. Keep NULL unless you really know what you are doing. |
fix.seeds |
for replicability |
The function returns a vector binding (in this order) training set fitted values and test set predictions.
Philippe Goulet Coulombe
Related paper is available at https://arxiv.org/abs/2008.07063.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | set.seed(200905)
K=5
dat=matrix(rnorm(K*200),200,K)
test=101:200
train=1:100
X=dat[train,2:K]
X.new=dat[test,2:K]
y=crossprod(t(X),rep(1,(K-1)))+dat[train,1]
y.new=crossprod(t(X.new),rep(1,(K-1)))+dat[test,1]
output=MARSquake(y,X,X.new)
benchmark = sqrt(mean((mean(y)-y.new)^2))
sqrt(mean((output[test]-y.new)^2))/benchmark
output=MARSquake(y,X,X.new,data.aug =TRUE)
sqrt(mean((output[test]-y.new)^2))/benchmark
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.