View source: R/SAEforest_model.R
SAEforest_model | R Documentation |
This function enables the use of Mixed Effects Random Forests (MERFs)
for applications of Small Area Estimation (SAE). Unit-level survey data on a target and
auxiliary covariates is required to produce reliable estimates of various
disaggregated economic and inequality indicators. Option meanOnly
saves computational
time for users that are only interested in the estimation of domain-specific means using
unit-level and aggregated auxiliary data. Predefined indicators include
the mean, median, quantiles (10%, 25%, 75% and 90%
), the head count ratio, the poverty gap,
the Gini-coefficient and the quintile share ratio. The MERF algorithm is an algorithmic procedure
reminiscent of an EM-algorithm (see Details). Overall, the function serves as a coherent framework
for the estimation of point estimates and if requested uncertainty estimates for the indicators.
Methodological details are found in Krennmair & Schmid (2022) and Krennmair et al. (2022b).
The following examples showcase further potential applications.
SAEforest_model(
Y,
X,
dName,
smp_data,
pop_data,
MSE = "none",
meanOnly = TRUE,
aggData = FALSE,
smearing = TRUE,
popnsize = NULL,
importance = "impurity",
OOsample_obs = 25,
ADDsamp_obs = 0,
w_min = 3,
B = 100,
B_adj = 100,
B_MC = 100,
threshold = NULL,
custom_indicator = NULL,
initialRandomEffects = 0,
ErrorTolerance = 1e-04,
MaxIterations = 25,
aggregate_to = NULL,
na.rm = TRUE,
...
)
Y |
Continuous input value of target variable. |
X |
Matrix or data.frame of predictive covariates. |
dName |
Character specifying the name of the domain identifier, for which random intercepts are modeled. |
smp_data |
data.frame of survey sample data including the specified elements of |
pop_data |
data.frame of unit-level population covariate data |
MSE |
Character input specifying the type of uncertainty estimates. Available options are:
(i) "none" if only point estimates are requested,
(ii) "nonparametric" following the MSE bootstrap procedure proposed by Krennmair & Schmid (2022) or by Krennmair et al.
(2022a) if |
meanOnly |
Logical. Calculating domain-level means only. Defaults to |
aggData |
Logical input indicating whether aggregated covariate information or unit-level covariate information
is used for domain-level means. Defaults to |
smearing |
Logical input indicating whether a smearing based approach or a Monte Carlo (MC) version for
point estimates should be obtained to estimate (nonlinear) indicators. MC should be used if computational constraints prohibit a
smearing approach. For theoretical details see Krennmair et al (2022b). Defaults to |
popnsize |
data.frame, comprising information of population size of domains.
Only needed if |
importance |
Variable importance mode processed by the
random forest from ranger. Must be 'none', 'impurity', 'impurity_corrected' or
'permutation'. A concept of variable importance is needed for the production of
generic plots |
OOsample_obs |
Number of out-of-sample observations taken from the closest area for potentially unsampled
areas. Only needed if |
ADDsamp_obs |
Number of out-of-sample observations taken from the closest area if first iteration for the
calculation of calibration weights fails. Only needed if |
w_min |
Minimal number of covariates from which informative weights are calculated.
Only needed if |
B |
Number of bootstrap replications for MSE estimation procedures. Defaults to 100. |
B_adj |
Number of bootstrap replications for the adjustment of residual variance proposed by Mendez and Lohr (2001). Defaults to 100. |
B_MC |
Number of bootstrap populations in the MC version for point estimates of (nonlinear) indicators. Defaults to 100. |
threshold |
Set a custom threshold for indicators, such as the head count ratio. The threshold
can be a known numeric value or function of |
custom_indicator |
A list of additional functions containing the indicators to be
calculated. These functions must only depend on the target variable |
initialRandomEffects |
Numeric value or vector of initial estimates of random effects. Defaults to 0. |
ErrorTolerance |
Numeric value to monitor the MERF algorithm's convergence. Defaults to 1e-04. |
MaxIterations |
Numeric value specifying the maximal amount of iterations for the MERF algorithm. Defaults to 25. |
aggregate_to |
Character containing the name of a variable from
population data that indicates the target domain level for which the
results are to be displayed. Only available if |
na.rm |
Logical. Whether missing values should be removed. Defaults to |
... |
Additional parameters are directly passed to the random forest ranger.
Most important parameters are for instance |
MERFs combine advantages of regression forests (such as implicit model-selection and robustness properties) with the ability to model hierarchical dependencies.
The MERF algorithm iteratively optimizes two separate steps: a) the random forest function, assuming the random effects term to be correct and b) estimates the random effects part, assuming the OOB-predictions from the forest to be correct. Overall convergence of the algorithm is monitored by log-likelihood of a joint model of both components. For further details see Krennmair and Schmid (2022).
Users that are only interested in the estimation of domain-level means should set meanOnly = TRUE
.
The MERF requires covariate micro-data. This function, however also
allows for the use of aggregated covariate information, by setting aggData = TRUE
. Aggregated
covariate information is adaptively incorporated through calibration-weights based on empirical likelihood
for the estimation of area-level means. See methodological details in Krennmair et al. (2022a)
For the estimation of (nonlinear) poverty indicators and/or quantiles, we need information on the
area-specific cumulative distribution function (CDF) of Y
. Krennmair et al. (2022b) propose a smearing
approach originated by Duan (1983). Alternatively, Monte-Carlo methods are used to simulate the domain-specific
CDF of Y
.
For the estimation of the MSE, the bootstrap population is built based on a bias-corrected residual variance as discussed in Krennmair and Schmid (2022). The bootstrap bias correction follows Mendez and Lohr (2011).
Note that the MERFmodel
object is a composition of elements from a random forest of class
ranger
and a random effects model of class merMod
. Thus, all generic functions are
applicable to corresponding objects. For further details on generic functions see ranger
and lmer
as well as the examples below.
An object of class SAEforest
includes point estimates for disaggregated indicators
as well as information on the MERF-model. Optionally corresponding MSE estimates are returned.
Several generic functions have methods for the returned object of class SAEforest
. For a full
list and explanation of components and possibilities for objects of class SAEforest
,
see SAEforestObject
.
Duan, N. (1983). Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association, 78(383), 605–610.
Krennmair, P., & Schmid, T. (2022). Flexible Domain Prediction Using Mixed Effects Random Forests. Journal of Royal Statistical Society: Series C (Applied Statistics) (forthcoming).
Krennmair, P., & Würz, N. & Schmid, T. (2022a). Analysing Opportunity Cost of Care Work using Mixed Effects Random Forests under Aggregated Census Data.
Krennmair, P., & Schmid, T & Tzavidis, Nikos. (2022b). The Estimation of Poverty Indicators Using Mixed Effects Random Forests. Working Paper.
Mendez, G., & Lohr, S. (2011). Estimating residual variance in random forest regression. Computational Statistics & Data Analysis, 55 (11), 2937–2950.
SAEforestObject
, ranger
, lmer
# Loading data
data("eusilcA_pop")
data("eusilcA_smp")
income <- eusilcA_smp$eqIncome
X_covar <- eusilcA_smp[,-c(1, 16, 17, 18)]
# Example 1:
# Calculating point estimates and discussing basic generic functions
model1 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
smp_data = eusilcA_smp, pop_data = eusilcA_pop)
# SAEforest generics:
summary(model1)
# Example 2:
# Calculating point + MSE estimates for aggregated covariate data and passing
# arguments to the random forest.
# Note that B is unrealistically low to improve example speed
# remove factor for gender
X_covar <- X_covar[,-1]
model2 <- SAEforest_model(Y = income, X = X_covar, dName = "district",
smp_data = eusilcA_smp, pop_data = eusilcA_popAgg,
MSE = "nonparametric", popnsize = popNsize,B = 5, mtry = 5,
num.trees = 100, aggData = TRUE)
# SAEforest generics:
summary(model2)
summarize_indicators(model2, MSE = TRUE, CV = TRUE)
# Example 3:
# Calculating point + MSE estimates and passing arguments to the forest.
# Two additional custom indicators and the threshold is defined as a custom function of Y.
# Note that B is unrealistically low to improve example speed.
model3 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "nonparametric",
B = 5, mtry = 5, num.trees = 100, threshold = function(Y){0.5 *
median(Y)}, custom_indicator = list(my_max = function(Y,
threshold){max(Y)}, mean40 = function(Y, threshold){
mean(Y[Y<=quantile(Y,0.4)])}), smearing = FALSE)
# SAEforest generics:
summary(model3)
summarize_indicators(model3, MSE = FALSE, CV = TRUE, indicator = c("Gini", "my_max", "mean40"))
# Example 4:
# Calculating point + MSE estimates with random effect on the district level
# while the output is at state level.
# Note that B is unrealistically low to improve example speed
model4 <- SAEforest_model(Y = income, X = X_covar, dName = "district", smp_data = eusilcA_smp,
pop_data = eusilcA_pop, meanOnly = FALSE, MSE = "wild", B = 5,
num.trees = 50, aggregate_to = "state")
summary(model4)
summarize_indicators(model4, CV = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.