svystatB | R Documentation |
Computes estimates, standard errors and confidence intervals for Multiple Regression Coefficients in subpopulations.
svystatB(design, model, by = NULL,
vartype = c("se", "cv", "cvpct", "var"),
conf.int = FALSE, conf.lev = 0.95, deff = FALSE,
na.rm = FALSE)
## S3 method for class 'svystatB'
coef(object, ...)
## S3 method for class 'svystatB'
SE(object, ...)
## S3 method for class 'svystatB'
VAR(object, ...)
## S3 method for class 'svystatB'
cv(object, ...)
## S3 method for class 'svystatB'
deff(object, ...)
## S3 method for class 'svystatB'
confint(object, ...)
## S3 method for class 'svystatB'
summary(object, ...)
design |
Object of class |
model |
Formula specifying the linear model whose coefficients have to be estimated. |
by |
Formula specifying the variables that define the "estimation domains" (see ‘Details’). If |
vartype |
|
conf.int |
Compute confidence intervals for the estimates? The default is
|
conf.lev |
Probability specifying the desired confidence level: the default value is |
deff |
Should the design effect be computed? The default is |
na.rm |
Should missing values (if any) be removed from the variables of interest? The default is
|
object |
An object of class |
... |
Additional arguments to |
This function computes weighted estimates for Multiple Regression Coefficients using suitable weights depending on the class of design
: calibrated weights for class cal.analytic
and direct weights otherwise. Standard errors are calculated using the Taylor linearization technique.
The mandatory argument model
identifies the regression model whose population coefficients have to be estimated (for details on model specification, see e.g. lm
). The design
variables referenced by model
should be numeric
or factor
(variables of other types - e.g. character
- will need to be converted in advance, e.g. using function des.addvars
).
The optional argument by
specifies the variables that define the "estimation domains", that is the subpopulations for which the estimates are to be calculated. If by=NULL
(the default option), the estimates produced by svystatB
refer to the whole population. Estimation domains must be defined by a formula: for example the statement by=~B1:B2
selects as estimation domains the subpopulations determined by crossing the modalities of variables B1
and B2
. Notice that a formula like by=~B1+B2
will be automatically translated into the factor-crossing formula by=~B1:B2
: if you need to compute estimates for domains B1
and B2
separately, you have to call svystatQ
twice. The design
variables referenced by by
(if any) should be of type factor
, otherwise they will be coerced. Note that, to prevent obvious collinearity issues, the variables referenced by argument by
must not appear in the input model
formula: otherwise, the program will stop and print an error message.
The conf.int
argument allows to request the confidence intervals for the estimates. By default conf.int=FALSE
, that is the confidence intervals are not provided.
Whenever confidence intervals are requested (i.e. conf.int=TRUE
), the desired confidence level can be specified by means of the conf.lev
argument. The conf.lev
value must represent a probability (0<=conf.lev<=1
) and its default is chosen to be 0.95
.
The optional argument deff
allows to request the design effect [Kish 1995] for the estimates. By default deff=FALSE
, that is the design effect is not provided. The design effect of an estimator is defined as the ratio between the variance of the estimator under the actual sampling design and the variance that would be obtained for an 'equivalent' estimator under a hypothetical simple random sampling without replacement of the same size. To obtain an estimate of the design effect comparing to simple random sampling “with replacement”, one must use deff="replace"
.
For nonlinear estimators, the design effect is estimated on the linearized version of the estimator (that is for the estimator of the total of the linearized variable, aka "Woodruff transform").
When dealing with domain estimation, the design effects referring to a given subpopulation are currently computed by taking the ratios between the actual variance estimates and those that would have been obtained if a simple random sampling were carried out within that subpopulation. This is the same as the srssubpop
option for Stata's function estat
.
Missing values (NA
) in model variables should be avoided. If na.rm=FALSE
(the default) they generate an error. If na.rm=TRUE
, observations containing NAs in model variables are dropped, and estimates get computed on non missing values only. This implicitly assumes that missing values hit interest variables completely at random: should this not be the case, computed estimates would be biased.
The summary
method invoked on regression coefficients (say b
) estimated via svystatB
, gives p-values and significance codes for the component-wise test b = 0
. Such values are computed assuming that the distribution of the regression coefficients estimators is normal (which is asymptotically true for large scale surveys). This assumption has the advantage of overcoming the problem of choosing the "right" statistic and assessing its "right" number of degrees of freedom when using data from a complex survey (see e.g. [Korn, Graubard 1990]).
An object inheriting from the data.frame
class, whose detailed structure depends on input parameters' values. In special cases (see Section ‘Collinearity, Aliasing and Impacts in Domain Estimation’), a list object.
Function svystatB
overcomes problems arising from exact collinearity between model variables via ‘aliasing’ (see the ‘Examples’ Section). Put simply, aliasing discards redundant (i.e. collinear) regressors, yielding exact estimates and standard errors for non-aliased regression coefficients (namely, the same results that would be obtained with a reduced - no collinearity - model). Note that for the way aliasing works, the order of the terms in the linear model formula definitely matters.
Collinearity between variables may manifest itself in subsets of the sample, and with different patterns across subsets. In domain estimation, this phenomenon can have an impact on the structure of svystatB
's output. In fact, owing to aliasing, the estimable regression coefficients - for the same input linear model - can be different across domains. In such cases, the domain estimates produced by svystatB
can no longer be stored in a data.frame
, and the output object will instead be a list (see the ‘Examples’ Section). Note that for these svystatB
's return objects (whose class
is svystatB.by.list
) no variance extractors
are currently available.
Note also that, for the reasons above, the usage of svystatB
via function svystat
is restricted: it is not allowed to specify svystat
's arguments by
and group
when kind == "B"
.
Diego Zardetto
Sarndal, C.E., Swensson, B., Wretman, J. (1992) “Model Assisted Survey Sampling”, Springer Verlag.
Kish, L. (1995). “Methods for design effects”. Journal of Official Statistics, Vol. 11, pp. 55-77.
European Commission, Eurostat, (2013). “Handbook on precision requirements and variance estimation for ESS households surveys: 2013 edition”, Publications Office. doi: 10.2785/13579
Korn, E.L., Graubard, B.I. (1990) “Simultaneous testing of regression coefficients with complex survey data: Use of Bonferroni t statistics”. The American Statistician, 44, 270-276.
Estimators of Totals and Means svystatTM
, Ratios between Totals svystatR
, Shares svystatS
, Ratios between Shares svystatSR
, Quantiles svystatQ
, Complex Analytic Functions of Totals and/or Means svystatL
, and all of the above svystat
.
######################################################
# A simple regression model with a single predictor. #
# Let's compare the estimated regression coefficient #
# to its true value computed on the sampling frame. #
######################################################
# Load sbs data:
data(sbs)
# Create a design object:
sbsdes<-e.svydesign(data=sbs,ids=~id,strata=~strata,weights=~weight,
fpc=~fpc)
# The population scatterplot of y vs emp.num reveals a linear
# behaviour:
plot(sbs.frame$emp.num,sbs.frame$y,
col=rgb(50,205,50,100,maxColorValue=255),pch=16)
# Compute the population fit of the linear regression
# model y~emp.num-1 (no intercept):
pop.fit<-lm(y~emp.num-1,data=sbs.frame)
abline(pop.fit,col="red",lwd=2,lty=2)
# The obtained population R-squared is quite significant
# (greater than 0.7):
pop.R2<-summary(pop.fit)$r.squared
pop.R2
# The population regression coefficient is:
B<-coef(pop.fit)
B
# Now let's estimate B on the basis of the sbs sample and
# let's build a 95% confidence interval for the obtained estimate:
svystatB(sbsdes,y~emp.num-1,conf.int=TRUE)
# Thus, the confidence interval covers the true value of B.
# Notice that using ReGenesees Complex Estimators function
# svystatL, you would have obtained exactly the same results:
sbsdes<-des.addvars(sbsdes,y4emp.num=y*emp.num,
emp.num.sq=emp.num^2)
svystatL(sbsdes,expression(y4emp.num/emp.num.sq),
conf.int=TRUE)
##################################
# A multiple regression example. #
##################################
# Let's estimate the coefficients of a model describing
# value added (variable va.imp2) as a linear function
# of number of employees by region and of nace.macro:
b <- svystatB(sbsdes,va.imp2~emp.num:region+nace.macro,vartype="cvpct")
b
# To obtain p-values and significance codes for the
# component-wise test t=0, you can exploit the
# summary method:
summary(b)
# Notice that estimators normality is assumed.
##########################################
# Obtaining domain means via regression. #
##########################################
# The domain mean of a numeric variable can be thought
# as a regression coefficient. Suppose you need the
# average number of employees by macro-sector, you can
# do as follows:
svystatB(sbsdes,emp.num~nace.macro-1)
# ...which, indeed, gives exactly the same results of:
svystatTM(sbsdes,y=~emp.num,by=~nace.macro,estimator="Mean")
##########################
# Handling collinearity. #
##########################
# Function svystatB overcomes problems arising from exact
# collinearity between model variables via 'aliasing'.
# To understand how aliasing works, let's build a manifestly
# redundant linear model:
svystatB(sbsdes,y~emp.num+I(2*emp.num)+I(3*va.imp2)+va.imp2-1)
# The obtained warning message shows that order definitely matters
# in aliasing, indeed:
svystatB(sbsdes,y~emp.num+I(2*emp.num)+va.imp2+I(3*va.imp2)-1)
# Notice also that aliasing gives exact estimates and standard errors
# for non-aliased regression coefficients (i.e. the same results that
# would be obtained with a reduced - no collinearity - model):
svystatB(sbsdes,y~emp.num+va.imp2-1)
###############################################
# Handling missing values in model variables. #
###############################################
# Load fpcdat:
data(fpcdat)
# Now, let's introduce some NAs in survey data:
fpcdat$y[c(1,3)]<-NA
fpcdat$x[c(3,5)]<-NA
# Create a design object:
fpcdes<-e.svydesign(data=fpcdat,ids=~psu+ssu,strata=~stratum,weights=~w,
fpc=~fpc1+fpc2)
# Let's estimate regression coefficients of model z~y+x
# na.rm=FALSE (the default) leads to an error:
## Not run:
svystatB(fpcdes,z~y+x)
## End(Not run)
# whereas na.rm=TRUE simply drops all the cases
# with missing data in model variables:
svystatB(fpcdes,z~y+x,na.rm=TRUE)
##################################
# Handling non positive weights. #
##################################
# Non positive direct weights are not allowed, anyway some
# calibrated weights can sometimes turn out to be <= 0. The
# corrisponding observations would be dropped by svystatB.
# Prepare a template for population totals:
pop<-pop.template(fpcdes,~z+pl.domain-1)
# Fill it with silly values in order to obtain some negative g-weights:
pop[1,]<-c(20000,90,10,90)
# Calibrate:
fpccal<-e.calibrate(fpcdes,pop)
# We got 2 negative calibrated weights:
g.range(fpccal)
sum(weights(fpccal)<=0)
# Now, let's estimate regression coefficients of model z~y+x
# and pay attantion to the warnings:
svystatB(fpccal,z~y+x,na.rm=TRUE)
####################################################################
# Domain estimates of simple and multiple regression coefficients. #
####################################################################
# Estimate the coefficients of the simple regression y ~ emp.num by domains
# obtained crossing region and nace.macro:
bb <- svystatB(sbsdes, model= va.imp2 ~emp.num, by= ~region:nace.macro)
bb
# Obtain p-values and significance codes via the summary method:
summary(bb)
# You have yet another method to estimate domain means of numeric variables.
# Suppose you need the average number of employees by macro-sector, you can
# do as follows:
svystatB(sbsdes, model= emp.num ~1, by= ~nace.macro)
# ...which gives exactly the same results of:
svystatB(sbsdes, model= emp.num ~nace.macro -1)
# ...and, of course, of:
svystatTM(sbsdes, y= ~emp.num, by= ~nace.macro, estimator= "Mean")
# One multiple regression example:
svystatB(sbsdes, model = y ~ va.imp2 + emp.num, by = ~region:nace.macro)
# A case of differential aliasing across domain (note the warning messages).
# A list-like object is returned:
svystatB(sbsdes, va.imp2 ~emp.num:emp.cl + nace.macro, by= ~region:public)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.