# strata.LH: Generalized Lavallee-Hidiroglou Method of Strata Construction In stratification: Univariate Stratification of Survey Populations

## Description

This function aims at constructing optimal strata with a generalized Lavallee-Hidiroglou (1998) method. The function uses Kozak's (2004) algorithm by default, but it can also apply Sethi's (1963) algorithm (argument algo="Sethi"). The function can take into account a discrepancy between the stratification variable X and the survey variable Y. It can consider a loglinear model with mortality between the variables (Baillargeon and Rivest, 2009). With Kozak's algorithm, two additional models are implemented: an heteroscedastic linear model and a random replacement model as in Rivest (2002). The determination of the optimal boundaries also incorporates, if desired, an anticipated non-response, a take-all stratum for the large units, a take-none stratum for the small units, and a certainty stratum to ensure that some specific units are in the sample.

Sethi's algorithm is not used by default because it can be numerically unstable, especially with a take-none stratum. Better results were obtained with Kozak's algorithm in our numerical experiments.

## Usage

 1 2 3 4 5 6 strata.LH(x, n = NULL, CV = NULL, Ls = 3, certain = NULL, alloc = list(q1 = 0.5, q2 = 0, q3 = 0.5), takenone = 0, bias.penalty = 1, takeall = 0, rh = rep(1, Ls), model = c("none", "loglinear", "linear", "random"), model.control = list(), initbh = NULL, algo = c("Kozak", "Sethi"), algo.control = list())

## Arguments

 x A vector containing the values of the stratification variable X for every unit in the population. n A numeric: the target sample size. It has no default value. The argument n or the argument CV must be input. CV A numeric: the target coefficient of variation or relative root mean squared error if takenone=1. It has no default value. The argument CV or the argument n must be input. Ls A numeric: the number of sampled strata (take-none and certain strata are not counted in Ls). The default is 3. certain A vector giving the position, in the vector x, of the units that must be included in the sample (see stratification-package). By default certain is NULL, which means that no units are chosen a priori to be in the sample. alloc A list specifying the allocation scheme. The list must contain 3 numerics for the 3 exponents q1, q2 and q3 in the general allocation scheme (see stratification-package). The default is Neyman allocation (q1=q3=0.5 and q2=0) takenone A numeric: the number of take-none strata (0 or 1). The default is 0, i.e. no take-none stratum is included. bias.penalty A numeric between 0 and 1 giving the penalty for the bias in the anticipated mean squared error (MSE) of the survey estimator (see stratification-package). This argument is relevant only if takenone=1. The default is 1. takeall A numeric: the number of take-all strata (one of {0, 1, ..., Ls-1}). The default is 0, i.e. no take-all stratum is included. rh A vector giving the anticipated response rates in each of the Ls sampled strata. A single number can be given if the rates do not vary among strata. The default is 1 in each stratum. model A character string identifying the model used to describe the discrepancy between the stratification variable X and the survey variable Y. It can be "none" if one assumes Y=X, "loglinear" for the loglinear model with mortality, "linear" for the heteroscedastic linear model or "random" for the random replacement model (see stratification-package for a description of these models). The default is "none", so the original Lavallee-Hidiroglou (1998) method of strata construction is used. The last two models "linear" and "random" are only available with Kozak's algorithm. model.control A list of model parameters (see stratification-package). The default values of the parameters correspond to the model Y=X. initbh A vector of initial stratum boundaries (see Details). algo A character string identifying which optimization algorithm is to be used. It can be "Kozak" for Kozak's (2004) algorithm or "Sethi" for Sethi's (1963) algorithm (see Details). The default is "Kozak" because it performs better and offers more options than Sethi's algorithm. algo.control A list of parameters to control the optimization algorithm (see Details).

## Details

CALCULATION OF THE OPTIMIZATION CRITERIA

Kozak's and Sethi's algorithms aim at finding optimal stratum boundaries. This optimality refers to a criteria :
the total sample size n if a target CV was given
the RRMSE if a target n was given (not available for Sethi's algorithm).
This criteria, no matter if it is n or the RRMSE, is a function of the stratum sample sizes, among others. But these stratum sample sizes can be non-integer (nhnonint obtained directly from applying the allocation rule) or integer (nh obtained from rounding the nhnonint), as mentioned in stratification-package. The optimization criteria does not vary much depending on the sample size used, but it is not exactly the same. Let's call opti.nh the criteria calculated with nh and opti.nhnonint the criteria calculated with nhnonint. Optimizing based on opti.nh vs opti.nhnonint might not give the same result. In fact, according to our numerical tests, it is a little better to optimize based on opti.nh. This is logical since the sample sizes used in practice, to apply a stratified design, are the integer one. Therefore, as of version 2.1-0 of the package, Kozak's algorithm uses by default the integer sample sizes nh to calculate the optimization criteria (but it can also use opti.nhnonint by setting idopti="nhnonint" in algo.control). However, Sethi's algorithm works with derivatives to perform optimization. Therefore, it does not treat the optimization as a discrete problem as Kozak's algorithm does. It can only work with a real optimization criteria, i.e. the criteria opti.nhnonint calculated with non-integer sample size. This is one more reason to favour Kozak's algorithm over Sethi's one.

INITIAL STRATUM BOUNDARIES

Both Kozak's and Sethi's algorithm are iterative and they need starting values. It can be given by the user with the initbh argument. If not given, the default initial boundaries are obtained with the function strata.cumrootf (cumulative root frequency method) for Kozak's algorithm. For Sethi's algorithm, the default value of initbh is the set of Ls-1 equidistant points along the range of the X-values (arithmetic starting points of Gunning and Horgan (2007)). Let's define d=(max(X)-min(X))/Ls, the arithmetic boundaries are bh = min(X) + h x d for h=1,…,Ls-1.

The default initial boundaries are not the same for the two algorithms because Kozak's algorithm uses by default the criteria opti.nh whereas Sethi's algorithm can only use opti.nhnonint. In our numerical experiments with the criteria opti.nh, the cumulative root frequency boundaries performed a little better than geometric, arithmetic or robust boundaries (for a definition of theses boundaries, see details about the trymany algorithm parameter). On the other hand, the arithmetic boundaries had good performances (Baillargeon and Rivest 2009) with the criteria opti.nhnonint. Let's note however that we never found, in any of our numerical experiments, initial boundaries clearly better than others.

The length of the vector initbh can be Ls-1 or L-1, where L is the total number of strata. When takenone=0, L=Ls and it does not make any difference. But for a stratified design with a take-none stratum, it means that one can give as initbh argument the vector of initial boundaries (b1, b2, ..., bL-1) or (b2, ..., bL-1). In the second option, the upper boundary of the take-none stratum is not given. In that case, it is set by default to the first percentile of x. If this first percentile is equal to the minimum value of x, this initial boundary would lead to an empty take-none stratum. In that case, the initial boundary of the take-none stratum is rather set to the second smallest value of x.

Sometimes, the specified initial boundaries are not suitable for Kozak's algorithm. As written below, this algorithm verifies at each iteration that the sampled strata contains at least minNh units and that they have positive sample sizes nh. If the initial boundaries do not meet these conditions, the initial optimization criteria is not comparable to the criteria for boundaries respecting the conditions. Therefore, in such a situation, a warning is produced and the algorithm is not run.

ALGORITHMS

Sethi: The formulas implemented for Sethi's algorithm are presented in Baillargeon, Rivest and Ferland (2007). As mentioned previously, this algorithm is available for a target CV only. Moreover, it works with derivatives to perform optimization. Therefore, it can only work with a real optimization criteria (opti.nhnonint).

Kozak: Kozak's algorithm is described in Kozak (2004). It starts at the initial boundaries initbh and at each iteration, it chooses a stratum boundary at random and a random modification for this boundary among the 2*maxstep possible alternatives. If this modification reduces the optimization criteria, creates sampled strata containing at least minNh units and leads to positive nh, it is accepted. Otherwise, the boundaries are not changed. If the boundaries remain unchanged for maxstill consecutive iterations, the algorithm has reached convergence and it stops.
Kozak's (2004) used the condition nh >= 2 instead of nh > 0. We chose to modify this requirement because we noticed that more severe conditions sometimes prevented the algorithm from selecting a path leading to the optimal solution. According to our numerical experiments, if one sets a posteriori to 2 the nh's that are equal to 1 at the end of the algorithm, the new sample size is smaller than or equal to the one obtained with the condition nh >= 2.
If no initbh argument is given, the algorithm is run with three different sets of initial boundaries (see details of the algorithm parameter trymany). If initbh is given, it uses these initial boundaries. Bot no matter how many initial boundaries are tried, the algorithm is run rep times with each. Running the algorithm many times helps having more stable results. Since the algorithm is random, two runs of the algorithm, event from the same starting point, can lead to different results. Setting rep to a large value and trying more than one set of initial boundaries increases the calculation time, but also increases the chance of finding the global minimum. also helps in finding the global minimum.

The optimization of stratum boundaries given a target CV or a target n is a discrete problem. The number of possible sets of boundaries is choose(Nu-1,L-1) when takenone=0, and choose(Nu,L-1) when takenone=1, where Nu stands for the number of unique values in the x-vector from which units in the certainty stratum, if any, heve been removed. When this number is not too large, trying them all is numerically possible. That's what the function strata.LH do, for Kozak's algorithm, when the number of possible solutions is lower than the minsol parameter in algo.control. This complete enumeration of all possible cases ensure that the global minimum is reached.

Note: Since version 2.2-0 of the package stratification, the modified Kozak's algorithm, which is a non-random version of Kozak's algorithm, is no more available because it never performed as well as the original Kozak's algorithm in our numerical experiments.

ALGORITHM PARAMETERS

The algo.control argument is a list to supply any of the following parameters which control the algorithm. Sethi's algorithm only uses the first argument maxiter.

maxiter

A numeric: the maximal number of iterations. The default is 500 for Sethi's algorithm and 10 000 for Kozak's. It is only used to prevent from infinite loops in case of non-convergence.

minsol

A numeric for Kozak's algorithm only: if the number of possible sets of boundaries is lower than minsol, a complete enumeration of the solutions is performed instead of running the algorithm. Every set of boundaries is tried, which ensures that the global minimum is reached. The default value of minsol is 10 000. This parameter has to take a value between 2 and 2 000 000.

idopti

A character string for Kozak's algorithm only: this argument determines which stratum sample sizes are going to be used to calculate the optimization criteria. It can take the value "nh" (criteria calculated with the integer sample sizes nh) or "nhnonint" (criteria calculated with the non-integer sample sizes nhnonint). The default value is "nh" since it gives slightly better results than idopti="nhnonint" and also because the integer sample sizes are the ones used in practice. When a complete enumeration is performed, the criteria is automatically calculated with integer stratum sample sizes. Note: Prior to version 2.1-0 of the package stratification, only the option idopti="nhnonint" was available (in fact the algorithm parameter idopti did not exist).

minNh

A numeric for Kozak's algorithm only: the minimum number of units required in each sampled stratum (no restriction is put on the take-none stratum, if included). minNh must be greater or equal to 2, which is the default.

maxstep

A numeric for Kozak's algorithm only: the maximal step for boundary modification (see the algorithm description above). The default is Nu/10, rounded up and truncated to 100 (Nu is the number of unique values in the x-vector from which units in the certainty stratum, if any, heve been removed). This default value is, for most populations, much larger than the initial suggestion by Kozak (2004) of a integer no bigger than 5. Our numerical experiments showed us that a maxstep value of about 3 (package stratification initial default value) is adequate when the optimization criteria is calculated with non-integer stratum sample sizes (opti.nhnonint). However, when this criteria is calculated with integer stratum sample sizes (opti.nh), the algorithm never do worst and sometimes reaches an even lower optimization criteria with a larger maxstep. The parameter maxstill must be increased consequently (see below). The downside of large maxstep and maxstill values is that many iterations are needed for the algorithm to converge, making it longer to run. We implemented a solution to this problem. This solution is based on the fact that when the algorithm is close to the solution, it can only accept boundaries modifications with small steps. The large steps are useful during the first iterations of the algorithm to let the boundaries move freely, but it becomes useless when convergence is close. Therefore, the values of maxstep and maxstill are brought down to 3 and 50, respectively, when new boundaries are accepted under these circumstances : the step in the boundary modification is lower than 3 in absolute value, more than 50 iterations without change happened before this change, the relative change in the optimization criteria is small. This rule is arbitrary, but it seems to work well so far. It gives results as good as without changing the values of maxstep and maxstill, but it runs much faster because it needs less iterations to converge.

maxstill

A numeric for Kozak's algorithm only: the maximal number of iterations without a change in the boundaries (see the algorithm description above). The default is maxstep*10, bounded between 50 and 500. It depends on the value of maxstep because we increased the default value of this parameter by making it a function of the number of unique values in the x-vector (see above). The upper bound of 500 ensures that the algorithm does not have to make too much iterations (which take time) to converge.

rep

A numeric for Kozak's algorithm only: the number of repetitions of the algorithm (see the algorithm description above). The default is 5. Since version 2.2-0 of the package stratification, the option rep="change" has been replaced by the option trymany=TRUE.

trymany

A logical for Kozak's algorithm only: if trymany=TRUE, three sets of initial boundaries are tried instead of one. These boundaries are cumulative root frequency (Dalenius and Hodges, 1959), geometric (Gunning and Horgan, 2004) and robust ones. Robust boundaries were created in order to have boundaries respecting as often as possible the conditions imposed on boundaries in Kozak's algorithm: strata containing at least minNh units and positive nh. Robust boundaries give an empty take-none stratum if such a stratum is requested, take-all strata as small as possible, and take-some strata with approximately the same number of unique X-values. When trymany=TRUE, the same values for the other algorithm parameters (the one given by the user or otherwise the default) are used for every set of initial boundaries. If trymany=FALSE, only the given or default (see above) initial boundaries are used. If a user give an initbh argument, but also select trymany=TRUE, the later is ignored and only the initbh initial boundaries are used.

CONVERGENCE

It is possible for the algorithm not to converge. In this case, a warning is printed. The only possible cause of non-convergence for Kozak's algorithm is to reach the maximum number of iterations before the stopping rule has been met. For the algorithm to converge, the argument maxiter has to be increased. On the other hand, non-convergence of Sethi's algorithm has several possible causes:

• a division by zero caused by an empty stratum can occur ;

• a division by zero caused by a 0 stratum variance can occur ;

• the square root of a negative number (negative discriminant) can occur ;

• the maximum number of iterations can be reached (often because the algorithm is caught in a loop of non-optimal sets of boundaries).

If a non-convergence happens, the user can try to change the initial boundaries or the model parameters. The user can also choose to work with Kozak's algorithm which should converge given an appropriate maximum number of iterations.

Let's note that even if the algorithm converges, it is not guaranteed that it has reached a global minimum. Local minimum can occur both with Sethi's and Kozak's algorithms (Rivest and Baillargeon, 2009).

## Value

 bh A vector of the L-1 optimal stratum boundaries found by the algorithm. Nh A vector of length L containing the population sizes Nh, i.e. the number of units in each stratum. nh A vector of length L containing the sample sizes nh, i.e. the number of units to sample in each stratum. See stratification-package for information about the rounding used to get these integer values. n The total sample size (sum(nh)). nhnonint A vector of length L containing the non-integer values of the sample sizes, obtained directly from applying the allocation rule (see stratification-package). certain.info A vector giving statistics for the certainty stratum (see stratification-package). It contains Nc, the number of units chosen a priori to be in the sample, and meanc, the anticipated mean of Y for these units. opti.criteria The final value of the criteria to optimize : either n if a target CV was given or the RRMSE if a target n was given. If the algorithm parameter idopti takes the value "nh" (the default), the stratum sample sizes used for the calculation of this criteria are the integer ones (nh). If the algorithm parameter idopti takes the value "nhnonint", the non-integer stratum sample sizes (nhnonint) are used. In other words, idopti determines here if opti.criteria is opti.nh or opti.nhnonint. meanh A vector of length L containing the anticipated means of Y in each stratum. varh A vector of length L containing the anticipated variances of Y in each stratum. mean A numeric: the anticipated global mean value of Y. RMSE A numeric: the root mean squared error (or standard error if takenone=0) of the anticipated global mean of Y. This is defined as the squared root of: (bias.penalty x bias of the mean)^2 + variance of the mean. RRMSE A numeric: the anticipated relative root mean squared error (or coefficient of variation if takenone=0) for the mean of Y, i.e. RMSE divided by mean. relativebias A numeric: the anticipated relative bias of the estimator, i.e. (bias.penalty x bias of the mean) divided by mean. If takenone=0, this numeric is zero. propbiasMSE A numeric: the proportion of the MSE attributable to the bias of the estimator, i.e. (bias.penalty x bias of the mean)^2 divided by the MSE of the mean. If takenone=0, this numeric is zero. stratumID A factor, having the same length as the input x, which values are either 1, 2, ..., L or "certain". The value "certain" is given to units a priori chosen to be in the sample. This factor identifies, for each observation, the stratum to which it has been assigned. takeall The number of take-all strata in the final solution. Note: It is possible that n_h=N_h for non take-all strata because the condition for an automatic addition of a take-all stratum is n_h>N_h. call The function call (object of class "call"). date A character string that contains the system date and time when the function ended. args A list of all the argument values input to the function or set by default. iter.detail Only if an iterative algorithm is run (thus not int the output when a complete enumeration is performed): a data frame giving, for each iteration of the algorithm: "bh": the stratum boundaries; "opti.nh": the value of the criteria to optimize calculated with integer stratum sample sizes (for Kozak's algorithm only); "opti.nhnonint": the value of the criteria to optimize calculated with non-integer stratum sample sizes; "takeall": the number of take-all strata for the corresponding boundaries, after making, if needed, an adjustment for non requested take-all strata; "step": the step in the boundary modification (for Kozak's algorithm only); "iter": the iteration identification number; "run": the run identification number (for Kozak's algorithm only) because when trymany=TRUE or rep>1 the algorithm is run more than once. For Kozak's algorithm, a row is added to iter.detail only for accepted boundaries, i.e. boundaries decreasing the optimization criteria while respecting the conditions on N_h and n_h (see Details for the algorithm description). niter Only if an iterative algorithm is run: the total number of iterations before convergence of the algorithm. For Kozak's algorithm, this number includes all the iterations, even the ones with discarded boundaries. niter is a vector if the algorithm was run more than once and if more than one run lead to the output solution. converge Only if an iterative algorithm is run: a logical indicating if the algorithm has converged (see Details). run.detail Only for Kozak's algorithm when it is run more than once: a data frame giving the final solution for every run of the algorithm. It contains, for each run: "bh": the stratum boundaries at convergence; "opti.nh": the value of the criteria to optimize calculated with integer stratum sample sizes; "opti.nhnonint": the value of the criteria to optimize calculated with non-integer stratum sample sizes; "takeall": the number of take-all strata for the corresponding boundaries, after making, if needed, an adjustment for non requested take-all strata; "niter": the number of iterations of the algorithm; "initbh.type": the type of initial boundaries requested ("initbh" for user-given boundaries, "cumrootf", "geo" or robust); "ibh": the initial boundaries used to run the algorithm; "rep": the repetition number (because the algorithm is run rep times for each set of initial boundaries tried). When trymany=TRUE, if a set of initial boundaries do not meet the conditions on N_h and n_h (see Details for the algorithm description), the algorithm is not run with these boundaries and no rows are added to run.detail. run.min Only for Kozak's algorithm when it is run more than once: the identification number of every algorithm run leading to the optimal plan, i.e. the rows of run.detail containing the proposed solution. sol.detail Only for Kozak's algorithm when the number of possible sets of boundaries is lower than minsol: a data frame giving information for the possible solutions fulfilling the conditions Nh >= minNh and nh > 0. It contains: "bh": the stratum boundaries; "Nh": the number of units in each stratum; "opti.nh": the value of the criteria to optimize calculated with integer stratum sample sizes; "opti.nhnonint": the value of the criteria to optimize calculated with non-integer stratum sample sizes; "takeall": the number of take-all strata for the corresponding boundaries, after making, if needed, an adjustment for non requested take-all strata. sol.min Only for Kozak's algorithm when the number of possible sets of boundaries is lower than minsol: the rows of sol.detail containing the optimal solution. nsol Only for Kozak's algorithm : the number of possible sets of boundaries (see Details).

## Author(s)

Sophie Baillargeon Sophie.Baillargeon@mat.ulaval.ca and
Louis-Paul Rivest Louis-Paul.Rivest@mat.ulaval.ca

## References

Baillargeon, S., Rivest, L.-P., Ferland, M. (2007). Stratification en enquetes entreprises : Une revue et quelques avancees. Proceedings of the Survey Methods Section, 2007 SSC Annual Meeting.

Baillargeon, S. and Rivest, L.-P. (2009). A general algorithm for univariate stratification. International Stratification Review, 77(3), 331-344.

Baillargeon, S. and Rivest L.-P. (2011). The construction of stratified designs in R with the package stratification. Survey Methodology, 37(1), 53-65.

Dalenius, T. and Hodges, J.L., Jr. (1959). Minimum variance stratification. Journal of the American Statistical Association, 54, 88-101.

Gunning, P. and Horgan, J.M. (2004). A new algorithm for the construction of stratum boundaries in skewed populations. Survey Methodology, 30(2), 159-166.

Gunning, P. and Horgan, J.M. (2007). Improving the Lavallee and Hidiroglou algorithm for stratification of skewed populations. Journal of Statistical Computation and Simulation, 77(4), 277-291.

Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, 6(5), 797-806.

Lavallee, P. and Hidiroglou, M.A. (1988). On the stratification of skewed populations. Survey Methodology, 14, 33-43.

Rivest, L.-P. (1999). Stratum jumpers: Can we avoid them? Proceedings of the Section on Survey Methods Research of the American Statistical Association, 64-72.

Rivest, L.-P. (2002). A generalization of the Lavallee and Hidiroglou algorithm for stratification in business surveys. Survey Methodology, 28(2), 191-198.

Sethi, V. K. (1963). A note on optimum stratification of populations for estimating the population means. The Australian Journal of Statistics, 5, 20-33.