Description Usage Arguments Details Value References See Also Examples
The function ellsae_big
implements the "ELL-method" method
for small area estimation by Elbers, C., Lanjouw, J. O. and Lanjouw, P
(2003) used to impute a missing variable from a smaller survey dataset into
a census. The imputation is based on a linear model and bootstrap samples.
ellsae_big
provides the same functionality as
ellsae
, but trades a potential speed penalty
for the ability to work with much larger data sets that are not restricted
by RAM size.
1 2 3 4 5 | ellsae_big(model, weights = NULL, survey, census, location_survey,
n_boot = 250L, seed, welfare.function, transfy, transfy_inv,
output = "default", cores_c = "auto", cores_r = 1,
quantiles = c(0, 0.25, 0.5, 0.75, 1), clustermeans, location_census,
save_boot = F)
|
model |
a model that describes the relationship between the response and
the explanatory variables. Input must be a linear model that can be
processed by |
survey |
data set with the response variable of interest included. Will be used to estimate the linear model. Input will be coerced to a data.table |
census |
dataset where the variable of interest is missing and shall be imputed |
location_survey |
string with the name of the variable in the survey data set that contains information about the cluster (= location) of an observation |
n_boot |
integer indicating the size of the bootstrap sample |
seed |
integer, seed can be set to obtain reproducible results |
welfare.function |
function that transforms the bootstrapped variable of interested to obtain some welfare estimate |
transfy |
function to transform the response y in the model |
transfy_inv |
inverse function of |
output |
character string or character vector. Either "default", "all", or a vector with one or more of the following elements: c("summary", "yboot", "model_fit", "bootsample", "survey", "census") |
cores_c |
either a string, "auto", or an integer value indicating the number of cores to be used for the estimation in C++. |
cores_r |
either a string, "auto", or an integer value indicating the number of cores to be used for the estimation in R. |
quantiles |
vector of requested quantiles for the |
clustermeans |
character vector with names of variables present in both data sets. The mean of those variables in the census will be computed by location and added to the survey data set before estimation of the linear model. This may enhance precision of ther estimates |
location_census |
string with the name of the variable in the survey data
set that contains information about the cluster (= location) of an
observation. Only needed if |
save_boot |
logical value. TRUE saves the bootstrap sample as BootstrapSampleELLsae-DATE.csv in the current working direktory. |
weights=NULL |
weights than can be used for fitting the model |
The function takes the survey data set and uses the argument
model
to estimate a linear model of the type lm()
. In case the
argument clustermeans
is specified, means from the census data for the
given variables are calculated and merged with the survey data by cluster
locations. These new explanatory variables are also used for the estimation of
the linear model. Rows with NA's are omitted from the computation.
The user may choose to transform the response variable using a function,
transfy
, previous to estimating the model. This function will be
directly applied to the entire vector of the response variable, i.e.
transfy(response)
. This means the specified function needs to be able
to take a vector as input. For transformations like log
, exp
,
sqrt
this will just yield an element-wise transformation. For more
complex transformation, you may want to use sapply
inside your
function, to ensure element-wise transformation. This also applies to
transfy_inv
, and welfare.function
which need to be able to take
a matrix as input. In many cases a transformation like transfy
could
also be achieved by altering the specified model appropriately, but using
transfy
and transfy_inv
is the recommended usage.
From the regression, location
effects are calculated as the mean by location of the regression residuals.
Individual random error terms are then obtained as the difference between the
regression residuals and the location effects. The bootstrapped response
variables are generated using three sources of randomness. The betas obtained
from lm()
are replaced by draws from a multivariate normal
distribution. In addition random location effects and residuals are drawn with
replacement. Internally the sample is a matrix, bootstrap
, with
the columns corresponding to bootstrap samples for one individual observation in
the census data set.
If transfy_inv
was specified, the bootstrap sample
is transformed back. This function will be directly applied to the matrix
of bootstrap samples, i.e. transfy_inv(bootstrap)
.
If a welfare
function was specified it will be used to transform the bootstrap sample. It
will be diretly applied to the matrix of bootstrap samples, i.e.
welfare.function(bootstrap)
. Differing from ellsae
,
bootstrap samples that belong to one observation in the internally stored
matrix are arranged column-wise.
cores_c
specifies the number of cores to use for the calculation. As
parallelization is done in C++ and incurs little overhead this should in most
cases be left to "auto".
cores_r
specifies the number of cores to used for calculations in R.
The method of parallelization is the one implemented in the pacakge
foreach
. Parallelization does come with a signifacnt
overhead,
the default is therefore 1. "auto" invokes
nb_cores
and creates clusters according to the number of physical CPUs available.
To obtain reproducicble results, a seed
can be specified. Simply
running set.seed()
in R does not work. Providing a seed will not
permanently alter the seed in R.
ellsae_big
returns a list. By default, this list included a
matrix
with basic summary statistics as specified in quantiles
, a vector with
the means of the bootstrap samples for every observation, and the
lm
-object obtained from the linear model estimation. In addition, the
user can request the full file-based-matrix of bootstrap samples, and an updated
data.table of the survey and census data set with residuals and location
effects and clustermeans added. The FBM can be subsetted with [i,j] just like
a regular matrix.
Elbers, C., Lanjouw, J. O. and Lanjouw, P. (2003). Micro-Level Estimation of Poverty and Inequality. In: Econometrica 71.1, pp. 355-364, Jan 2003
Guadarrama Sanz, M., Molina, I., and Rao, J.N.K. (2016). A comparison of small area estimation methods for poverty mapping. In: 17 (Mar. 2016), 41-66 and 156 and 158.
Other small area estimation methods can also be found in the package
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ## Not run:
# Generate a sample survey and census data from the provided brazil data set
brazil <- ELLsae::brazil
helper <- sample(x = 1:nrow(brazil), size = nrow(brazil)/5, replace = FALSE)
helper <- sort(helper)
survey <- brazil[helper,]
census <- brazil[-helper,]
model.example <- hh_inc ~ geo2_br + age + sex + computer + trash
ELLsae::ellsae_big(model = model.example,
survey = survey,
census = census,
location_survey = "geo2_br",
n_boot = 250L,
seed = 1234,
transfy = log,
transfy_inv = exp,
output = "all",
cores_c = "auto",
cores_r = 1,
quantiles = c(0, 0.25, 0.5, 0.75, 1),
clustermeans = "age",
location_census = "geo2_br",
save_boot = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.