dr: Main function for dimension reduction regression

Description Usage Arguments Details Value Author(s) References Examples

Description

This is the main function in the dr package. It creates objects of class dr to estimate the central (mean) subspace and perform tests concerning its dimension. Several helper functions that require a dr object can then be applied to the output from this function.

Usage

1
2
3
4
dr (formula, data, subset, group=NULL, na.action = na.fail, weights, ...)
    
dr.compute (x, y, weights, group=NULL, method = "sir", chi2approx="bx",...)
 

Arguments

formula

a two-sided formula like y~x1+x2+x3, where the left-side variable is a vector or a matrix of the response variable(s), and the right-hand side variables represent the predictors. While any legal formula in the Rogers-Wilkinson notation can appear, dimension reduction methods generally expect the predictors to be numeric, not factors, with no nesting. Full rank models are recommended, although rank deficient models are permitted.

The left-hand side of the formula will generally be a single vector, but it can also be a matrix, such as cbind(y1+y2)~x1+x2+x3 if the method is "save" or "sir". Both of these methods are based on slicing, and for the multivariate case slices are determined by slicing on all the columns of the left-hand side variables.

data

an optional data frame containing the variables in the model. By default the variables are taken from the environment from which ‘dr’ is called.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

group

If used, this argument specifies a grouping variable so that dimension reduction is done separately for each distinct level. This is implemented only when method is one of "sir", "save", or "ire". This argument must be a one-sided formula. For example, ~Location would fit separately for each level of the variable Location. The formula ~A:B would fit separately for each combination of A and B, provided that both have been declared factors.

weights

an optional vector of weights to be used where appropriate. In the context of dimension reduction methods, weights are used to obtain elliptical symmetry, not constant variance.

na.action

a function which indicates what should happen when the data contain ‘NA’s. The default is ‘na.fail,’ which will stop calculations. The option 'na.omit' is also permitted, but it may not work correctly when weights are used.

x

The design matrix. This will be computed from the formula by dr and then passed to dr.compute, or you can create it yourself.

y

The response vector or matrix

method

This character string specifies the method of fitting. The options include "sir", "save", "phdy", "phdres" and "ire". Each method may have its own additional arguments, or its own defaults; see the details below for more information.

chi2approx

Several dr methods compute significance levels using statistics that are asymptotically distributed as a linear combination of χ^2(1) random variables. This keyword chooses the method for computing the chi2approx, either "bx", the default for a method suggested by Bentler and Xie (2000) or "wood" for a method proposed by Wood (1989).

...

For dr, all additional arguments passed to dr.compute. For dr.compute, additional arguments may be required for particular dimension reduction method. For example, nslices is the number of slices used by "sir" and "save". numdir is the maximum number of directions to compute, with default equal to 4. Other methods may have other defaults.

Details

The general regression problem studies F(y|x), the conditional distribution of a response y given a set of predictors x. This function provides methods for estimating the dimension and central subspace of a general regression problem. That is, we want to find a p by d matrix B of minimal rank d such that

F(y|x)=F(y|B'x)

Both the dimension d and the subspace R(B) are unknown. These methods make few assumptions. Many methods are based on the inverse distribution, F(x|y).

For the methods "sir", "save", "phdy" and "phdres", a kernel matrix M is estimated such that the column space of M should be close to the central subspace R(B). The eigenvectors corresponding to the d largest eigenvalues of M provide an estimate of R(B).

For the method "ire", subspaces are estimated by minimizing an objective function.

Categorical predictors can be included using the groups argument, with the methods "sir", "save" and "ire", using the ideas from Chiaromonte, Cook and Li (2002).

The primary output from this method is (1) a set of vectors whose span estimates R(B); and various tests concerning the dimension d.

Weights can be used, essentially to specify the relative frequency of each case in the data. Empirical weights that make the contours of the weighted sample closer to elliptical can be computed using dr.weights. This will usually result in zero weight for some cases. The function will set zero estimated weights to missing.

Value

dr returns an object that inherits from dr (the name of the type is the value of the method argument), with attributes:

x

The design matrix

y

The response vector

weights

The weights used, normalized to add to n.

qr

QR factorization of x.

cases

Number of cases used.

call

The initial call to dr.

M

A matrix that depends on the method of computing. The column space of M should be close to the central subspace.

evalues

The eigenvalues of M (or squared singular values if M is not symmetric).

evectors

The eigenvectors of M (or of M'M if M is not square and symmetric) ordered according to the eigenvalues.

chi2approx

Value of the input argument of this name.

numdir

The maximum number of directions to be found. The output value of numdir may be smaller than the input value.

slice.info

output from 'sir.slice', used by sir and save.

method

the dimension reduction method used.

terms

same as terms attribute in lm or glm. Needed to make update work correctly.

A

If method="save", then A is a three dimensional array needed to compute test statistics.

Author(s)

Sanford Weisberg, <sandy@stat.umn.edu>.

References

Bentler, P. M. and Xie, J. (2000), Corrections to test statistics in principal Hessian directions. Statistics and Probability Letters, 47, 381-389. Approximate p-values.

Cook, R. D. (1998). Regression Graphics. New York: Wiley. This book provides the basic results for dimension reduction methods, including detailed discussion of the methods "sir", "phdy" and "phdres".

Cook, R. D. (2004). Testing predictor contributions in sufficient dimension reduction. Annals of Statistics, 32, 1062-1092. Introduced marginal coordinate tests.

Cook, R. D. and Nachtsheim, C. (1994), Reweighting to achieve elliptically contoured predictors in regression. Journal of the American Statistical Association, 89, 592–599. Describes the weighting scheme used by dr.weights.

Cook, R. D. and Ni, L. (2004). Sufficient dimension reduction via inverse regression: A minimum discrrepancy approach, Journal of the American Statistical Association, 100, 410-428. The "ire" is described in this paper.

Cook, R. D. and Weisberg, S. (1999). Applied Regression Including Computing and Graphics, New York: Wiley, http://www.stat.umn.edu/arc. The program arc described in this book also computes most of the dimension reduction methods described here.

Chiaromonte, F., Cook, R. D. and Li, B. (2002). Sufficient dimension reduction in regressions with categorical predictors. Ann. Statist. 30 475-497. Introduced grouping, or conditioning on factors.

Shao, Y., Cook, R. D. and Weisberg (2007). Marginal tests with sliced average variance estimation. Biometrika. Describes the tests used for "save".

Wen, X. and Cook, R. D. (2007). Optimal Sufficient Dimension Reduction in Regressions with Categorical Predictors, Journal of Statistical Inference and Planning. This paper extends the "ire" method to grouping.

Wood, A. T. A. (1989) An F approximation to the distribution of a linear combination of chi-squared variables. Communications in Statistics: Simulation and Computation, 18, 1439-1456. Approximations for p-values.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data(ais)
# default fitting method is "sir"
s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+
  log(Hc)+log(Ferr),data=ais) 
# Refit, using a different function for slicing to agree with arc.
summary(s1 <- update(s0,slice.function=dr.slices.arc))
# Refit again, using save, with 10 slices; the default is max(8,ncol+3)
summary(s2<-update(s1,nslices=10,method="save"))
# Refit, using phdres.  Tests are different for phd, and not
# Fit using phdres; output is similar for phdy, but tests are not justifiable. 
summary(s3<- update(s1,method="phdres"))
# fit using ire:
summary(s4 <- update(s1,method="ire"))
# fit using Sex as a grouping variable.  
s5 <- update(s4,group=~Sex)

Example output

Loading required package: MASS

Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + 
    log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc)

Method:
sir with 11 slices, n = 202.

Slice Sizes:
19 19 19 19 19 19 19 18 18 18 15 

Estimated Basis Vectors for Central Subspace:
               Dir1       Dir2     Dir3      Dir4
log(SSF)  -0.143177  0.0476079 -0.02815  0.003785
log(Wt)    0.879504  0.1425841  0.23303 -0.094970
log(Hg)    0.195963 -0.6318503  0.24483 -0.509424
log(Ht)    0.058923  0.1100757 -0.87893  0.217803
log(WCC)   0.007276  0.0029772 -0.05309  0.043056
log(RCC)   0.167736 -0.3924936 -0.19711 -0.213689
log(Hc)   -0.368652  0.6418658 -0.26373  0.796849
log(Ferr)  0.002697 -0.0002593  0.03492  0.039116

              Dir1   Dir2    Dir3    Dir4
Eigenvalues 0.9572 0.2275 0.09368 0.07319
R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864

Large-sample Marginal Dimension Tests:
              Stat df p.value
0D vs >= 1D 284.78 80 0.00000
1D vs >= 2D  91.43 63 0.01113
2D vs >= 3D  45.48 48 0.57690
3D vs >= 4D  26.55 35 0.84694

Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + 
    log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, 
    nslices = 10, method = "save")

Method:
save with 10 slices, n = 202.

Slice Sizes:
21 21 20 20 20 25 24 22 20 9 

Estimated Basis Vectors for Central Subspace:
               Dir1     Dir2     Dir3     Dir4
log(SSF)   0.127709 -0.00907  0.01018 -0.06144
log(Wt)   -0.905004 -0.07107 -0.15734  0.25774
log(Hg)   -0.056187  0.50674 -0.34064 -0.38087
log(Ht)    0.399868  0.36613  0.68439 -0.54216
log(WCC)   0.032608  0.02733  0.02277  0.03474
log(RCC)  -0.008463  0.15137 -0.24136 -0.47219
log(Hc)   -0.021630 -0.76164  0.57591  0.51526
log(Ferr)  0.002116 -0.01670  0.01631 -0.03360

              Dir1   Dir2   Dir3   Dir4
Eigenvalues 0.9389 0.6611 0.5129 0.4653
R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989

Large-sample Marginal Dimension Tests:
             Stat df(Nor) p.value(Nor) p.value(Gen)
0D vs >= 1D 378.3     324      0.02012       0.1071
1D vs >= 2D 279.6     252      0.11214       0.3116
2D vs >= 3D 179.9     189      0.67101       0.5160
3D vs >= 4D 134.3     135      0.50176       0.2786

Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + 
    log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, 
    method = "phdres")

Method:
phdres, n = 202.

Estimated Basis Vectors for Central Subspace:
              Dir1     Dir2      Dir3      Dir4
log(SSF)  -0.03675 -0.23340 -0.001928 -0.006563
log(Wt)    0.59536  0.03252  0.238599 -0.025140
log(Hg)   -0.36061 -0.47699  0.014747  0.596972
log(Ht)    0.21613 -0.08133 -0.959780  0.038954
log(WCC)   0.02948 -0.07203 -0.065847  0.047897
log(RCC)  -0.29816 -0.13669  0.123638  0.166642
log(Hc)    0.61429  0.82846 -0.044891 -0.781899
log(Ferr) -0.01761  0.01068  0.005824  0.001390

              Dir1    Dir2   Dir3    Dir4
Eigenvalues 2.8583 -1.4478 0.9612 -0.5621
R^2(OLS|dr) 0.8774  0.9444 0.9643  0.9891

Large-sample Marginal Dimension Tests:
              Stat df Normal theory Indep. test General theory
0D vs >= 1D 223.67 36     0.000e+00           0      0.0005181
1D vs >= 2D  69.64 28     2.091e-05          NA      0.0067417
2D vs >= 3D  30.12 21     8.970e-02          NA      0.2343969
3D vs >= 4D  12.70 15     6.257e-01          NA      0.2655593

Call:
dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + 
    log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, 
    method = "ire")

Method:
ire with 11  slices, n = 202.

Slice Sizes:
19 19 19 19 19 19 19 18 18 18 15 

Large-sample Marginal Dimension Tests:
              Test df   p.value iter
0D vs > 0D 1952.00 80 0.000e+00    0
1D vs > 1D  121.23 63 1.468e-05    4
2D vs > 2D   58.17 48 1.493e-01    9
3D vs > 3D   30.17 35 7.005e-01   13
4D vs > 4D   12.93 24 9.673e-01    8


Solutions for various dimensions:

 Dimension = 1 
               Dir1
log(SSF)   0.155332
log(Wt)   -0.949820
log(Hg)   -0.040287
log(Ht)   -0.128578
log(WCC)  -0.028183
log(RCC)  -0.111974
log(Hc)    0.205481
log(Ferr) -0.002108

 Dimension = 2 
                Dir1     Dir2
log(SSF)   0.1427314  0.05767
log(Wt)   -0.8859754  0.12268
log(Hg)   -0.1720583 -0.66227
log(Ht)   -0.0656570  0.34654
log(WCC)  -0.0115689  0.03252
log(RCC)  -0.1459691 -0.34423
log(Hc)    0.3732552  0.55020
log(Ferr)  0.0003757 -0.02589

 Dimension = 3 
              Dir1      Dir2      Dir3
log(SSF)   0.14426  0.049529  0.022355
log(Wt)   -0.91380 -0.088235 -0.330387
log(Hg)   -0.09735 -0.476748 -0.156659
log(Ht)   -0.18321  0.531988  0.780883
log(WCC)  -0.01482  0.017182  0.024080
log(RCC)  -0.15448 -0.189613  0.146398
log(Hc)    0.27753  0.665731  0.483662
log(Ferr) -0.00331 -0.003886  0.009095

 Dimension = 4 
               Dir1      Dir2      Dir3      Dir4
log(SSF)   0.148037  0.045075  0.005911  0.009285
log(Wt)   -0.922296 -0.088770 -0.151055 -0.314343
log(Hg)   -0.109033 -0.485958 -0.421073  0.238307
log(Ht)   -0.166533  0.559115  0.293499  0.798519
log(WCC)  -0.018888  0.023363  0.038669  0.024506
log(RCC)  -0.128822 -0.173308 -0.228054  0.420809
log(Hc)    0.266220  0.640846  0.811151 -0.164810
log(Ferr) -0.002935 -0.007616  0.047248 -0.042851

dr documentation built on May 2, 2019, 5:49 p.m.