ellsae_big: ellsae_big
In nikosbosse/ELLsae: A Small Area Estimation Approach

Description Usage Arguments Details Value References See Also Examples

The function ellsae_big implements the "ELL-method" method for small area estimation by Elbers, C., Lanjouw, J. O. and Lanjouw, P (2003) used to impute a missing variable from a smaller survey dataset into a census. The imputation is based on a linear model and bootstrap samples. ellsae_big provides the same functionality as ellsae, but trades a potential speed penalty for the ability to work with much larger data sets that are not restricted by RAM size.

ellsae_big(model, weights = NULL, survey, census, location_survey,
  n_boot = 250L, seed, welfare.function, transfy, transfy_inv,
  output = "default", cores_c = "auto", cores_r = 1,
  quantiles = c(0, 0.25, 0.5, 0.75, 1), clustermeans, location_census,
  save_boot = F)

`model`	a model that describes the relationship between the response and the explanatory variables. Input must be a linear model that can be processed by `lm()`
`survey`	data set with the response variable of interest included. Will be used to estimate the linear model. Input will be coerced to a data.table
`census`	dataset where the variable of interest is missing and shall be imputed
`location_survey`	string with the name of the variable in the survey data set that contains information about the cluster (= location) of an observation
`n_boot`	integer indicating the size of the bootstrap sample
`seed`	integer, seed can be set to obtain reproducible results
`welfare.function`	function that transforms the bootstrapped variable of interested to obtain some welfare estimate
`transfy`	function to transform the response y in the model
`transfy_inv`	inverse function of `transfy` for backtransformation
`output`	character string or character vector. Either "default", "all", or a vector with one or more of the following elements: c("summary", "yboot", "model_fit", "bootsample", "survey", "census")
`cores_c`	either a string, "auto", or an integer value indicating the number of cores to be used for the estimation in C++.
`cores_r`	either a string, "auto", or an integer value indicating the number of cores to be used for the estimation in R.
`quantiles`	vector of requested quantiles for the `summaryboot` output defined as decimals between 0 and 1.
`clustermeans`	character vector with names of variables present in both data sets. The mean of those variables in the census will be computed by location and added to the survey data set before estimation of the linear model. This may enhance precision of ther estimates
`location_census`	string with the name of the variable in the survey data set that contains information about the cluster (= location) of an observation. Only needed if `clustermeans` are computed.
`save_boot`	logical value. TRUE saves the bootstrap sample as BootstrapSampleELLsae-DATE.csv in the current working direktory.
`weights=NULL`	weights than can be used for fitting the model

The function takes the survey data set and uses the argument model to estimate a linear model of the type lm(). In case the argument clustermeans is specified, means from the census data for the given variables are calculated and merged with the survey data by cluster locations. These new explanatory variables are also used for the estimation of the linear model. Rows with NA's are omitted from the computation.

The user may choose to transform the response variable using a function, transfy, previous to estimating the model. This function will be directly applied to the entire vector of the response variable, i.e. transfy(response). This means the specified function needs to be able to take a vector as input. For transformations like log, exp, sqrt this will just yield an element-wise transformation. For more complex transformation, you may want to use sapply inside your function, to ensure element-wise transformation. This also applies to transfy_inv, and welfare.function which need to be able to take a matrix as input. In many cases a transformation like transfy could also be achieved by altering the specified model appropriately, but using transfy and transfy_inv is the recommended usage.

From the regression, location effects are calculated as the mean by location of the regression residuals. Individual random error terms are then obtained as the difference between the regression residuals and the location effects. The bootstrapped response variables are generated using three sources of randomness. The betas obtained from lm() are replaced by draws from a multivariate normal distribution. In addition random location effects and residuals are drawn with replacement. Internally the sample is a matrix, bootstrap, with the columns corresponding to bootstrap samples for one individual observation in the census data set.

If transfy_inv was specified, the bootstrap sample is transformed back. This function will be directly applied to the matrix of bootstrap samples, i.e. transfy_inv(bootstrap).

If a welfare function was specified it will be used to transform the bootstrap sample. It will be diretly applied to the matrix of bootstrap samples, i.e. welfare.function(bootstrap). Differing from ellsae, bootstrap samples that belong to one observation in the internally stored matrix are arranged column-wise.

cores_c specifies the number of cores to use for the calculation. As parallelization is done in C++ and incurs little overhead this should in most cases be left to "auto".

cores_r specifies the number of cores to used for calculations in R. The method of parallelization is the one implemented in the pacakge foreach. Parallelization does come with a signifacnt overhead, the default is therefore 1. "auto" invokes nb_cores and creates clusters according to the number of physical CPUs available.

To obtain reproducicble results, a seed can be specified. Simply running set.seed() in R does not work. Providing a seed will not permanently alter the seed in R.

ellsae_big returns a list. By default, this list included a matrix with basic summary statistics as specified in quantiles, a vector with the means of the bootstrap samples for every observation, and the lm-object obtained from the linear model estimation. In addition, the user can request the full file-based-matrix of bootstrap samples, and an updated data.table of the survey and census data set with residuals and location effects and clustermeans added. The FBM can be subsetted with [i,j] just like a regular matrix.

Elbers, C., Lanjouw, J. O. and Lanjouw, P. (2003). Micro-Level Estimation of Poverty and Inequality. In: Econometrica 71.1, pp. 355-364, Jan 2003

Guadarrama Sanz, M., Molina, I., and Rao, J.N.K. (2016). A comparison of small area estimation methods for poverty mapping. In: 17 (Mar. 2016), 41-66 and 156 and 158.

Other small area estimation methods can also be found in the package

## Not run: 
# Generate a sample survey and census data from the provided brazil data set
brazil <-  ELLsae::brazil
helper <- sample(x = 1:nrow(brazil), size = nrow(brazil)/5, replace = FALSE)
helper <- sort(helper)
survey <- brazil[helper,]
census <- brazil[-helper,]
model.example <- hh_inc ~ geo2_br + age + sex + computer + trash

ELLsae::ellsae_big(model = model.example,
                   survey = survey,
                   census = census,
                   location_survey = "geo2_br",
                   n_boot = 250L,
                   seed = 1234,
                   transfy = log,
                   transfy_inv = exp,
                   output = "all",
                   cores_c = "auto",
                   cores_r = 1,
                   quantiles = c(0, 0.25, 0.5, 0.75, 1),
                   clustermeans = "age",
                   location_census = "geo2_br",
                   save_boot = FALSE)

## End(Not run)