Description Usage Format Details Source References Examples
For convenience we have labelled the input variables 1 through 11 to be consistent with the notation used in Miller (2002). Only the first 11 variables were used in Miller's analyses. The best fitting subset regression with these 11 variables, uses only 3 inputs and has a residual sum of squares of 6.77 while using forward selection produces a best fit with 3 inputs with residual sum of squares 21.19. Backward selection and stagewise methods produce similar results. It is remarkable that there is such a big difference. Note that the usual forward and backward selection algorithms may fail since the linear regression using 11 variables gives essentially a perfect fit.
1 |
A data frame with 13 observations on the following 14 variables.
FTP.1
Full-time police per 100,000 population
UEMP.2
Percent unemployed in the population
MAN.3
Number of manufacturing workers in thousands
LIC.4
Number of handgun licences per 100,000 population
GR.5
Number of handgun registrations per 100,000 population
CLEAR.6
Percent homicides cleared by arrests
WM.7
Number of white males in the population
NMAN.8
Number of non-manufacturing workers in thousands
GOV.9
Number of government workers in thousands
HE.10
Average hourly earnings
WE.11
Average weekly earnings
ACC
Death rate in accidents per 100,000 population
ASR
Number of assaults per 100,000 population
HOM
Number of homicides per 100,000 of population
The data were orginally collected and discussed by Fisher (1976) but the complete dataset first appeared in Gunst and Mason (1980, Appendix A). Miller (2002) discusses this dataset throughout his book. The data were obtained from StatLib.
http://lib.stat.cmu.edu/datasets/detroit
Fisher, J.C. (1976). Homicide in Detroit: The Role of Firearms. Criminology, vol.14, 387-400.
Gunst, R.F. and Mason, R.L. (1980). Regression analysis and its application: A data-oriented approach. Marcel Dekker.
Miller, A. J. (2002). Subset Selection in Regression. 2nd Ed. Chapman & Hall/CRC. Boca Raton.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | #Detroit data example
data(Detroit)
#As in Miller (2002) columns 1-11 are used as inputs
p<-11
#For possible comparison with other algorithms such as LARS
# it is preferable to work with the scaled inputs.
#From Miller (2002, Table 3.14), we see that the
#best six inputs are: 1, 2, 4, 6, 7, 11
X<-as.data.frame(scale(Detroit[,c(1,2,4,6,7,11)]))
y<-Detroit[,ncol(Detroit)]
Xy<-cbind(X,HOM=y)
#Use backward stepwise regression with BIC selects full model
out <- lm(HOM~., data=Xy)
step(out, k=log(nrow(Xy)))
#
#Same story with exhaustive search algorithm
out<-bestglm(Xy, IC="BIC")
out
#But many coefficients have p-values that are quite large considering
# the selection bias. Note: 1, 6 and 7 are all about 5% only.
#We can use BICq to reduce the number of variables.
#The qTable let's choose q for other possible models,
out$qTable
#This suggest we try q=0.05 or q=0.0005
bestglm(Xy,IC="BICq", t=0.05)
bestglm(Xy,IC="BICq", t=0.00005)
#It is interesting that the subset model of size 2 is not a subset
# itself of the size 3 model. These results agree with
#Miller (2002, Table 3.14).
#
#Using delete-d CV with d=4 suggests variables 2,4,6,11
set.seed(1233211)
bestglm(Xy, IC="CV", CVArgs=list(Method="d", K=4, REP=50))
|
Loading required package: leaps
Start: AIC=-11.34
HOM ~ FTP.1 + UEMP.2 + LIC.4 + CLEAR.6 + WM.7 + WE.11
Df Sum of Sq RSS AIC
<none> 1.3659 -11.3357
- WM.7 1 1.2724 2.6383 -5.3427
- CLEAR.6 1 1.3876 2.7535 -4.7871
- FTP.1 1 1.4376 2.8035 -4.5533
- WE.11 1 8.1170 9.4830 11.2888
- UEMP.2 1 16.3112 17.6771 19.3849
- LIC.4 1 20.6368 22.0027 22.2305
Call:
lm(formula = HOM ~ FTP.1 + UEMP.2 + LIC.4 + CLEAR.6 + WM.7 +
WE.11, data = Xy)
Coefficients:
(Intercept) FTP.1 UEMP.2 LIC.4 CLEAR.6 WM.7
25.127 1.724 2.570 5.757 -2.329 -2.452
WE.11
6.084
BIC
BICq equivalent for q in (0.115398370069658, 1)
Best Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.126923 0.1323333 189.875990 1.439772e-12
FTP.1 1.724110 0.6861084 2.512883 4.572467e-02
UEMP.2 2.569527 0.3035648 8.464511 1.485656e-04
LIC.4 5.757015 0.6046682 9.520948 7.657697e-05
CLEAR.6 -2.329338 0.9435019 -2.468822 4.853518e-02
WM.7 -2.452200 1.0372544 -2.364126 5.596776e-02
WE.11 6.083694 1.0188489 5.971144 9.892298e-04
LogL q1 q2 k
[1,] -35.832829 0.000000e+00 5.144759e-08 0
[2,] -17.767652 5.144759e-08 3.468452e-05 1
[3,] -6.215995 3.468452e-05 1.039797e-04 2
[4,] 4.237691 1.039797e-04 7.680569e-02 3
[5,] 8.006726 7.680569e-02 1.153984e-01 4
[6,] 14.645170 1.153984e-01 1.000000e+00 6
BICq(q = 0.05)
BICq equivalent for q in (0.000103979673982901, 0.0768056921650394)
Best Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.12692 0.2406075 104.43119 3.435051e-15
UEMP.2 3.38307 0.2601848 13.00257 3.876404e-07
LIC.4 8.20378 0.2802445 29.27365 3.090409e-10
WE.11 10.90084 0.2787164 39.11089 2.321501e-11
BICq(q = 5e-05)
BICq equivalent for q in (3.46845195655643e-05, 0.000103979673982901)
Best Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.126923 0.5101048 49.258354 2.871539e-13
LIC.4 4.473245 0.6381795 7.009384 3.673796e-05
CLEAR.6 -13.386666 0.6381795 -20.976334 1.346067e-09
CVd(d = 4, REP = 50)
BICq equivalent for q in (0.076805692165038, 0.115398370069659)
Best Model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.126923 0.1909731 131.573114 1.244969e-14
UEMP.2 2.571151 0.3840754 6.694391 1.535921e-04
LIC.4 7.270181 0.4337409 16.761574 1.624771e-07
CLEAR.6 -3.250371 1.2964006 -2.507227 3.652839e-02
WE.11 8.329213 1.0492726 7.938083 4.617821e-05
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.