ellsae: ellsae
In nikosbosse/ELLsae: A Small Area Estimation Approach

Description Usage Arguments Details Value References See Also Examples

The function ellsae implements the "ELL-method" method for small area estimation by Elbers, C., Lanjouw, J. O. and Lanjouw, P (2003) used to impute a missing variable from a smaller survey dataset into a census. The imputation is based on a linear model and bootstrap samples

ellsae(model, weights = NULL, survey, census, location_survey,
  n_boot = 250L, seed, welfare.function, transfy, transfy_inv,
  output = "default", cores = "auto", quantiles = c(0, 0.25, 0.5,
  0.75, 1), clustermeans, location_census, save_boot = F)

`model`	a model that describes the relationship between the response and the explanatory variables. Input must be a linear model that can be processed by `lm()`
`survey`	data.table with the response variable of interest included. Will be used to estimate the linear model. Input will be coerced to a data.table
`census`	data.table where the variable of interest is missing and shall be imputed
`location_survey`	string with the name of the variable in the survey data set that contains information about the cluster (= location) of an observation
`n_boot`	integer indicating the size of the bootstrap sample
`seed`	integer, seed can be set to obtain reproducible results
`welfare.function`	function that transforms the bootstrapped variable of interested to obtain some welfare estimate
`transfy`	function to transform the response y in the model
`transfy_inv`	inverse function of `transfy` for back-transformation
`output`	character string or character vector. Either "default", "all", or a vector with one or more of the following elements: c("summary", "yboot", "model_fit", "bootsample", "survey", "census")
`cores`	either a string, "auto", or an integer value indicating the number of cores to be used for the estimation.
`quantiles`	vector of requested quantiles for the `summaryboot` output defined as decimals between 0 and 1.
`clustermeans`	character vector with names of variables present in both data sets. The mean of those variables in the census will be computed by location and added to the survey data set before estimation of the linear model. This may enhance precision of the estimates
`location_census`	string with the name of the variable in the survey data set that contains information about the cluster (= location) of an observation. Only needed if `clustermeans` are computed.
`save_boot`	logical value. TRUE saves the bootstrap sample as BootstrapSampleELLsae-DATE.csv in the current working direktory.
`weights=NULL`	weights than can be used for fitting the model

The function takes the survey data set and uses the argument model to estimate a linear model of the type lm(). In case the argument clustermeans is specified, means from the census data for the given variables are calculated and merged with the survey data by cluster locations. These new explanatory variables are also used for the estimation of the linear model. Rows with NA's are omitted from the computation.

The user may choose to transform the response variable using a function, transfy, previous to estimating the model. This function will be directly applied to the entire vector of the response variable, i.e. transfy(response). This means the specified function needs to be able to take a vector as input. For transformations like log, exp, sqrt this will just yield an element-wise transformation. For more complex transformation, you may want to use sapply inside your function, to ensure element-wise transformation. This also applies to transfy_inv, and welfare.function which need to be able to take a matrix as input. In many cases a transformation like transfy could also be achieved by altering the specified model appropriately, but using transfy and transfy_inv is the recommended usage.

From the regression, location effects are calculated as the mean by location of the regression residuals. Individual random error terms are then obtained as the difference between the regression residuals and the location effects. The bootstrapped response variables are generated using three sources of randomness. The betas obtained from lm() are replaced by draws from a multivariate normal distribution. In addition random location effects and residuals are drawn with replacement. Internally the sample is a matrix, bootstrap, with the rows corresponding to bootstrap samples for one individual observation in the census data set.

If transfy_inv was specified, the bootstrap sample is transformed back. This function will be directly applied to the matrix of bootstrap samples, i.e. transfy_inv(bootstrap).

If a welfare function was specified it will be used to transform the bootstrap sample. It will be diretly applied to the matrix of bootstrap samples, i.e. welfare.function(bootstrap). Bootstrap samples that belong to one observation are arranged row-wise.

cores specifies the number of cores to use for the calculation. As parallelization is done in C++ and incurs little overhead this should in most cases be left to "auto".

To obtain reproducicble results, a seed can be specified. Simply running set.seed() in R does not work. Providing a seed will not permanently alter the seed in R.

ellsae returns a list. By default, this list included a matrix with basic summary statistics as specified in quantiles, a vector with the means of the bootstrap samples for every observation, and the lm-object obtained from the linear model estimation. In addition, the user can request the full matrix of bootstrap samples, and an updated data.table of the survey and census data set with residuals and location effects and clustermeans added.

Elbers, C., Lanjouw, J. O. and Lanjouw, P. (2003). Micro-Level Estimation of Poverty and Inequality. In: Econometrica 71.1, pp. 355-364, Jan 2003

Guadarrama Sanz, M., Molina, I., and Rao, J.N.K. (2016). A comparison of small area estimation methods for poverty mapping. In: 17 (Mar. 2016), 41-66 and 156 and 158.

If issues with memory allocation occur one, can also use ellsae_big instead.Other small area estimation methods can also be found in the package sae.

## Not run: 
# Generate a sample survey and census data from the provided brazil data set
brazil <-  ELLsae::brazil
helper <- sample(x = 1:nrow(brazil), size = nrow(brazil)/5, replace = FALSE)
helper <- sort(helper)
survey <- brazil[helper,]
census <- brazil[-helper,]
model.example <- hh_inc ~ geo2_br + age + sex + computer + trash

ELLsae::ellsae(model = model.example,
               survey = survey,
               census = census,
               location_survey = "geo2_br",
               n_boot = 250L,
               seed = 1234,
               transfy = log,
               transfy_inv = exp,
               output = "all",
               cores = "auto",
               quantiles = c(0, 0.25, 0.5, 0.75, 1),
               clustermeans = "age",
               location_census = "geo2_br",
               save_boot = FALSE)

## End(Not run)