submodels | R Documentation |
The original Gilmour study proposed a method for interpreting adjusted Cp statistic (Cp) to identify the best regression model when dealing with a relatively large number of numerical regressors and a smaller number of observations.
(The interpretation of Mallows's Cp-statistic. The Statistician. 1996. 45(1):49-56. doi:10.2307/2348411)
The procedure consists of two main steps (two diagrams are avaiable in the vignette of the package):
Initial Selection:
Adjusted \bar{C}_p
values are calculated for all possible combinations of regressors.
The submodel with the minimum of adjusted \bar{C}_p
value is selected and labeled as ModelMin.
Reduction and Testing (it is only in the function ’final_model’) :
All submodels nested within ModelMin are considered.
Among them, the submodel with the lowest value of adjusted \bar{C}_p
is selected as a new candidate.
A hypothesis test is then performed where the null hypothesis states that ModelMin provides a better fit than the candidate submodel.
If the null hypothesis is not rejected, ModelMin is accepted as the ’final model’.
If the null hypothesis is rejected, the candidate submodel becomes the new ModelMin, and the process is repeated with submodels nested within this new model.
The procedure continues iteratively until a null hypothesis is not rejected or until the so-called trivial model (i.e., a model using only the arithmetic mean) is reached.
The second step is procced only in the function ’final model’
The function returns all regression results for:
the full model and the ModelMin.
Additionally, it outputs \bar{C}_p
values for all submodels, enabling users to easily generate a adjusted \bar{C}_p
vs. p plot (where p is the number of regressors+1), as recommended by Gilmour.
The input data should be provided as a data.frame, where the first column contains the numerical dependent variable, and the subsequent columns contain the regressors. The package also includes nine illustrative examples to demonstrate its functionality.
Adjusted \bar{C}_p
is calculated using the formula:
\bar{C}_p = C_p - \frac{2(k - p + 1)}{(n - k - 3})
where k is the number of regressors in the full model, and p - 1 is the number of regressors in the submodel (thus, p includes the intercept term).
Further details and theoretical background can be found in the Gilmour's original paper.
For example, in the case of the trivial model (i.e., a model with only an intercept), the adjusted statistic is:
\bar{C}_1 = C_1 - \frac{2(k - p + 1)}{(n - k – 3)}= \frac{var(y)(n-1)}{MSE}-n+2-\frac{2(k - p + 1)}{(n - k – 3)}
where the function var() of sample variation is used for calculation of sum of squares in the trivial model and MSE is estimation of \sigma^2
.
submodels(d)
d |
dataframe |
y (numeric vector of dependend variable, the first column in data.frame input "d"); X (numeric matrix of regressors, the columns 2:k in data.frame input "d"); n number of observations (numeric value, the number of observations in "d"); k number of regressors (numeric value); t step of freedom (t=n-k-1, numeric value); regressors (set of strings of regressors); expression of full_model (string); regression results of full model (list calculated using lm()); MSE (characteristic in full model, numeric value); submodels_number (number of all possible submodels, numeric value); Cq_plus_1A (numeric value, Cp characteristic in model_min with the minimal adjusted Cp); q (level of model_min, number of regressors in model_min, numeric value); submodels (matrix with four columns: row number, p, Cp,submodel ); model_min (textual expression of model min, string); regression results of model_min (results in model min, list calculated using lm()).
# the standard well known data
d=mtcars ; d ; MyResults=submodels(d)
# the red points correspond to the starting submodel in the testing process
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# 8 tables are avaiable for illustration of the functions ’submodels’ and ’final_model’
# ilustrative data from the original Gilmour paper
Gilmour9p;MyResults=submodels(Gilmour9p)
# the red points correspond to the starting submodel in the testing process
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# 12 regressors and 16 observations, simulated data without real meaning
# more submodels and calculation takes about 5 seconds
# the null hypothesis is not rejected in the first test
T1 ; MyResults=submodels(T1)
# the red points correspond to the starting submodel in the testing process
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# the loop is finished by the Trivial model for data T2
T2 ; MyResults=submodels(T2)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# Trivial is illustrative data in which Trivial model is model_min without testing process
Trivial ; MyResults=submodels(Trivial)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# special illustrative data for more than two tests in the loop in the function ’final_model’
Modified_Gilmour9p ; MyResults=submodels(Modified_Gilmour9p)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# number of visitors in parks
# citation: Stemberk Josef, Josef Dolejs, Petra Maresova, Kamil Kuca.
# Factors affecting the number of Visitors in National Parks
# in the Czech Republic, Germany and Austria.
# International Journal of Geo-Information. https://www.mdpi.com/2220-9964/7/3/124
# ISPRS Int. J. Geo-Inf. 2018, 7(3), 124; doi:10.3390/ijgi7030124
d=Parks5p ; rownames(d)= d[,1]; d=d[,-1]; d
MyResults=submodels(d)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# number of patents in universities (see column names – regressors)
# citation: Perspective and Suitable Research Area
# in Public Research-Patent Analysis of the Czech Public Universities
# Education and Urban Society, 54(7), https://doi.org/10.1177/00131245211027362
Patents5p ; MyResults=submodels(Patents5p)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9* min(yCp))
YRange=c( ymin ,1.5*max(xp))
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="", ylim=YRange ,
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
# illustrative econometric data from Eurostat for 5 variables in 17 countries in 2019
# columns: LifExp , HDP, Unempl, Obesity, APassangers
d= EU2019
rownames(d)= d[,1]; d=d[,-1]; d # the same data without the first column (country names)
MyResults=submodels(d)
yCp= as.numeric(MyResults$submodels[,3]) ; xp= as.numeric(MyResults$submodels[,2])
# plot without y limits "ylim=c( ymin ,1.5*max(xp)"
plot(yCp ~ xp, xlab="Number of Parameters in Submodel",ylab="",
col=ifelse( round(yCp,4)== round(min(yCp),4), "red", "darkblue") )
lines(xp, xp, col="red")
mtext(bquote(paste( bar(C) , "p")), side=2, line=3, padj=1, cex=1.2)
mtext(bquote(paste("All Submodels: ",bar(C),"p ~ p")), side=3, line=3, padj=1, cex=1.2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.