R Package 'stratifyR

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Type Package

Title Optimal Stratification of Univariate Populations

Version 1.0-3

Date 2021-11-23

Author Karuna G. Reddy, M.G.M. Khan

Maintainer Karuna G. Reddy \<karuna.reddy@usp.ac.fj>

Description This implements the stratification of univariate populations under stratified sampling designs using the method of Khan et al. (2002) https://doi.org/10.1177/0008068320020518, Khan et al. (2008) https://www150.statcan.gc.ca/n1/pub/12-001-x/2008002/article/10761-eng.pdf and Khan et al. (2015) https://doi.org/10.1080/02664763.2015.1018674. It determines the Optimum Strata Boundaries (OSB) and Optimum Sample Sizes (OSS) for the study variable, $y$, using the best-fit frequency distribution of a survey variable (if data is available) or a hypothetical distribution (if data is not available). The method formulates the problem of determining the OSB as mathematical programming problem which is solved by using a dynamic programming technique. If a dataset of the population is available to the surveyor, the method estimates its best-fit distribution and determines the OSB and OSS under Neyman allocation directly. When the dataset is not available, stratification is made based on the assumption that the values of the study variable, $y$, are available as hypothetical realizations of proxy values of $y$ from recent surveys. Thus, it requires certain distributional assumptions about the study variable. At present, it handles stratification for the populations where the study variable follows a continuous distribution, namely, Pareto, Triangular, Right-triangular, Weibull, Gamma, Exponential, Uniform, Normal, Log-normal and Cauchy distributions. In this vignette, the two major functionalities in the package are applied to a number of real and simulated populations and to some hypothetical populations.

License GPL (>= 2)

LazyData yes

NeedsCompilation yes

Depends R (>= 3.4.4), MASS, fitdistrplus, actuar, triangle, mc2d, zipfR

NeedsCompilation yes

Repository CRAN

Date/Publication 2021-11-23 10:00:00

Introduction

\url{} The main aim of stratification is to produce estimators with a small variance when a population characteristic $(y)$ is under study. A simple method that can be used to create strata for this population, if $y$ itself is the stratification variable. The ideal situation is that the distribution of such a study variable is known and the OSB can be determined by placing boundaries on the range of this distribution at suitable cut-points. This problem of determining the OSB, when both the estimation and stratification variables are the same, was first discussed by @dalenius1950problem. He provided equations for the determination of stratum boundaries that minimize the variance of population estimates under optimal allocation. @Dalenius1957 futher proposed a solution to the problem by taking equal intervals of the cumulative square root of frequency scale of the stratification variable.

One of the many kinds of stratification methods that has been proposed in the literature is due to @Buhler1975. They formulated the problem of determining OSB as an optimization problem and developed a computational technique to solve the problem by using Dynamic Programming (DP). A good review of this method can be found in @khan2008determining. The procedure is also applied by @Khan2002, @khan2003theory, @Khan2005, @khan2008determining, @Khan2009, @khan2015determining, @khan2015designing for determining OSB for many different distributions. With the known frequency function of the study variable, they considered the problem of finding OSB as an equivalent problem of determining Optimum Strata Width (OSW), which is formulated as a Mathematical Programming Problem (MPP) and solved by using DP technique. They applied the technique to several univariate populations where the study variables followed different probability distributions. In this package, a univariate stratification technique is developed, which is based on the probability distribution assumed by the stratification variable as discussed by the authors above.

As implemented in this package, there are many advantages of using the method. The real advantage of the stratifyR package is that when the dataset of the study variable is not available, which may occur in practice, the package is still able to construct OSB and OSS based on the distributional assumption on the data. Moreover, the population survey can be a costly and time-consuming affair, hence, this approach also has the advantage of determining OSB for the study variable that is not available prior to conducting the survey. This, of course, requires certain distributional or parametric assumptions on the study variables, which can easily be obtained from recent or past surveys.

Other advantages of the method are that it leads to substantial gains in the precision of the estimates over other available methods. Results reveal that the variances get smaller with increasing number of strata $(L)$, and they get much smaller at a much faster rate than other available methods. Once the OSB have been determined, the Optimum Sample Sizes (OSS) can be easily calculated for each stratum using Neyman allocation (@Neyman1934). The algorithm may be a little slow, however, it does provide very efficient results which is probably the most important objective of our survey estimation efforts.

The Package in a Glance

The stratifyR package implements the DP technique (from various literature by Khan et al) as a stratification procedure for univariate populations, when the stratification variable follows a continuous probability distribution, namely: uniform, triangular, right-triangular, pareto, exponential, normal, log-normal, cauchy, weibull and gamma. The package is able to determine the OSB and OSS from real data and as well as hypothetical situations when the dataset is not available. The latter requires some assumptions about the distribution and its initial value, range, parameter values, fixed sample size, etc.

There are two major functions which basically solve the two types of stratification problems: strata.data() which carries out univariate stratification for those univariate populations where dataset is available and strata.distr() which performs stratification when dataset is not available prior to conducting the survey.

In the former case, data on the study variable, number of strata $(h)$, fixed sample size $(n)$ and population size $(N)$ are used as the input arguments to the strata.data() function in the package. In the latter case, strata.distr() function is called which requires the distribution to be assumed, its parameters, the inital value and the estimated range of the distribution, fixed sample $(n)$ and population sizes $(N)$. When executed, both the functions output the OSB and OSS, amongst other quantities such as stratum weight $(W_{h})$, stratum variance $(S_{h}^{2})$, stratum objective function values $(W_{h}S_{h})$, stratum sample sizes $(n_{h})$, stratum population sizes $(N_{h})$ and stratum sampling fraction $(f_{h})$.

The following sections show the general formulation of the problem of stratification, the DP solution procedure, the concept of Neyman allocation as the method of determining sample size and the overview of package functionalities. The package is then applied to numerous survey populations with real and simulated data to illustrate its application.

General Formulation of the Univariate Stratification Problem \label{formuni}

Let the target population of the variable under study be stratified into $L$ strata where the estimation of the mean of the study variable $(y)$ is of interest. If a simple random sample of size $n_{h}$ is to be drawn from $h^{th}$ stratum with sample mean $\bar{y}{h}$, then the stratified sample mean, $\bar{y}{st}$, is given by \begin{eqnarray} \bar{y}{st}=\sum{h=1}^{L}W_{h}\bar{y}{h}, \tag{1} \end{eqnarray} where $W{h}$ (stratum weight) is the proportion of the population contained in the $h^{th}$ stratum.

When the finite population correction factors are ignored, under the Neyman (1934) allocation, the variance of $\bar{y}{st}$ is given by \begin{eqnarray} Var(\bar{y}{st})=\dfrac{\left(\sum_{h=1}^{L}W_{h}S_{h}\right)^{2}}{n}, \tag{2} \end{eqnarray} where $S_{h}^{2}$ is the stratum variance for the study variable in the $h^{th}$ (where $h=1, 2, ..., L$) stratum and $n$ is the preassigned total sample size.

Let $f(y);\;a\leq y\leq b$ be the frequency function of the study variable, $y$, on which OSB are to be constructed. If the population mean of this study variable is estimated under Neyman allocation, then the problem of determining OSB is to cut up the range, $d=b-a$, at $(L-1)$ intermediate points $a=y_{0} \leq y_{1} \leq y_{2} \leq, ..., \leq y_{L-1} \leq y_{L}=b\,$ such that (2) is minimum. The lower and upper bounds of the study variable are denoted by $a$ and $b$ respectively and the cut-points $y_{0}, y_{1}, y_{2}, ..., y_{L-1}$ are the OSB.

For a fixed sample size $n$, minimizing the expression of the right hand side of equation (2) is equivalent to minimizing \begin{eqnarray} \sum_{h=1}^{L} W_{h}S_{h} \tag{3} \end{eqnarray}

If $f(y)$ is known and integrable, the stratum weight $(W_{h})$, stratum variance $(S_{h}^{2})$ and stratum mean $(\mu_{h})$ can be obtained as a function of the boundary points $y_{h}$ and $y_{h-1}$ by using the following expressions: \begin{eqnarray}\label{stratumweight} W_{h} = \int_{y_{h-1}}^{y_{h}} f(y)dy \tag{4} \end{eqnarray}

\begin{eqnarray} S_{h}^{2}= \dfrac{1}{W_{h}}\int_{y_{h-1}}^{y_{h}}y^{2}f(y)dy - \mu_{h}^{2} \tag{5} \end{eqnarray}

\begin{eqnarray} \textrm{where}\;\;\;\mu_{h}=\dfrac{1}{W_{h}}\int_{y_{h-1}}^{y_{h}}yf(y)dy \tag{6} \end{eqnarray}

where $(y_{h-1}, y_{h})$ are the boundaries of $h^{th}$ stratum.

Thus, the objective function in (3) could be expressed as a function of boundary points $y_{h}$ and $y_{h-1}$ only. We further define \begin{eqnarray} l_{h}=y_{h}-y_{h-1}; \; h=1,2, ..., L \tag{7} \end{eqnarray}

where $l_{h}\geq 0$ denotes the range or width of the $h^{th}$ stratum. Then, the range of the distribution, $d = b - a$, is expressed as a function of stratum width as: \begin{eqnarray}\label{215} \sum_{h=1}^{L}l_{h}=\sum_{h=1}^{L}(y_{h}-y_{h-1}) = b-a = y_{L} - y_{0} = d \tag{8} \end{eqnarray} The $h^{th}$ stratification point $y_{h};\;h=1,2,..., L\,$ is then expressed as $y_{h}=y_{h-1}+l_{h}$ and from (8), the problem can be treated as an equivalent problem of determining Optimum Strata Widths (OSW), $l_{1}, l_{2}, ..., l_{L}$. Due to the special nature of functions, the problem may be treated as a function of $l_{h}$ alone and can be expressed as: \begin{eqnarray}\label{genMPP} &\textrm{Minimize}& \;\;\; \sum_{h=1}^{L}\phi_{h}(l_{h}),\nonumber \ &\textrm{subject to}&\;\;\;\; \sum_{h=1}^{L}l_{h} = d,\nonumber \ &\textrm{and}& \;\;\;\; l_{h} \geq 0; \;\; h = 1,2,...,L. \tag{9} \end{eqnarray}

Dynamic Programming Technique as a Solution Procedure

Many multistage decision problems can be formulated as a Mathematical Programming Problem (MPP). The Dynamic Programming technique, developed by @RBellman1957, is a computational method which is well suited for solving MPPs that may be treated as a multistage decision problem. The DP technique determines the optimum solution of a multistage problem by decomposing it into stages, each stage comprising of a single stage. The advantage of the decomposition is that the optimization process at each stage involves one variable only, which simplifies the computational task by dealing with all variables simultaneously.

The solution to an MPP is achieved in a sequential manner starting from one stage problem, moving onto a two stage problem, to a three stage problem and so on until finally all stages are included. The solution for $n$ stages is obtained by adding the $n^{th}$ stage to the solution of $n - 1$ stages.

The basic concept of DP technique is contained in the principle of optimality proclaimed by @RBellman1957, which implies that given the initial state of a system, an optimal policy for the subsequent stages does not depend upon the policy adopted at the preceding stages. It determines the optimum solution of a multi-variable problem by decomposing it into stages, each stage compromising a single variable sub-problem. A dynamic programming model is basically a recursive equation which links the different stages of the problem in a manner which guarantees that each stage's optimal feasible solution is also optimum and feasible for the entire problem.

Consider the following sub-problem of (9) for first $k(<L)$ strata:

\begin{eqnarray} &\text{Minimize}& \;\;\;\; \sum_{h=1}^{k}\phi_{h}(l_{h}),\nonumber \ &\text{subject to}&\;\;\;\;\; \sum_{h=1}^{k}l_{h} = d_{k},\nonumber \ &\text{and}& \;\;\;\;\; l_{h} \geq 0; \;\; h = 1,2,...,k.\tag{10} \end{eqnarray}

where $d_{k}< d$ is the total width available for division into \textit{$k$} strata or the state value at stage $k$. Note that $d_{k} = d$ for $k = L$.

The transformation functions are given by:

\begin{eqnarray} d_{k}\;\;&=&\;\;l_{1}+l_{2}+...+l_{k},\ d_{k-1}\;\;&=&\;\;l_{1}+l_{2}+...+l_{k-1}\;=\;d_{k}-l_{k},\ d_{k-2}\;\;&=&\;\;l_{1}+l_{2}+...+l_{k-2}\;=\;d_{k-1}-l_{k-1},\ &\vdots&\;\;\;\;\;\;\;\;\;\;\;\;\vdots\ d_{2}\;\;&=&\;\;l_{1}+l_{2}\;=\;d_{3}-l_{3},\ d_{1}\;\;&=&\;\;l_{1}\;=\;d_{2}-l_{2}. \end{eqnarray}

Let $\Phi_{k}(d_{k})$ denote the minimum value of the objective function of MPP (10), that is,

\begin{equation} \Phi_{k}(d_{k}) = \text{min}\left[ \sum_{h=1}^{k}\phi_{h}(l_{h})\bigg|\sum_{h=1}^{k}l_{h}=d_{k}, \,\textrm{and}\,l_{h}\;\geq 0;\,h=1,2,...,k \;\textrm{and}\; 1\leq k \leq L \right].\nonumber \end{equation}

With the above definition of $\Phi_{k}(d_{k})$, (10) is equivalent to finding $\Phi_{L}(d)$ recursively by finding $\Phi_{k}(d_{k})$ for $k = 1, 2, ..., L$ and $0 \leq d_{k} \leq d.$

We can write:

\begin{equation} \Phi_{k}(d_{k}) = \text{min}\left[\phi_{k}(l_{k})+ \sum_{h=1}^{k-1}\phi_{h}(l_{h})\bigg|\sum_{h=1}^{k-1}l_{h}=d_{k}-l_{k}, \;\textrm{and}\;l_{h}\;\geq\;0;\;h=1,2,...,k\right].\nonumber \end{equation}

For a fixed value of $l_{k}$; $0 \leq l_{k} \leq d_{k}$,

\begin{equation} \Phi_{k}(d_{k}) = \phi_{k}(l_{k})+ \text{min}\left[ \sum_{h=1}^{k-1}\phi_{h}(l_{h})\bigg|\sum_{h=1}^{k-1}l_{h}=d_{k}-l_{k}, \;\textrm{and}\;l_{h}\;\geq 0;\;h=1,2,...k-1\;\textrm{and}\;\; 1\leq k \leq L\right].\nonumber \end{equation}

Using the Bellman's principle of optimality, a forward recursive equation can be written as:

\begin{equation} \Phi_{k}(d_{k}) = {\text{min} \atop 0 \leq l_{k} \leq d_{k}}\left[\phi_{k}(l_{k}) + \Phi_{k-1}(d_{k}-l_{k})\right],\;\;k\;\geq\;2.\tag{11} \end{equation}

For the first stage, that is, for $k=1$:

\begin{equation} \Phi_{1}(d_{1}) = \phi_{1}(d_{1})\; \Longrightarrow\;\;l_{1}^{*}=d_{1},\tag{12} \end{equation}

where $l_{1}^{}=d_{1}$ is the optimum width of the first stratum. The relations (11) and (12) are solved recursively for each $k=1, 2, ..., L\,$ and $0 \leq d_{k} \leq d$, and $\Phi_{L}(d)$ is obtained. From $\Phi_{L}(d)$ the optimum width of $L^{th}$ stratum, $l_{L}^{}$, is obtained. From $\Phi_{L-1}(d - l_{L}^{})$ the optimum width of $(L - 1)^{th}$ stratum, $l_{L-1}^{}$, is obtained and so on until $l_{1}^{*}$ is obtained.

The above algorithm of the Dynamic Programming solution procedure to solve the MPP given in (9) is summarized with the following steps:

  1. Start at $k=1$. Set $\Phi_{0}(d_{0})=0$.
  2. Calculate $\Phi_{1}(d_{1})$, the minimum value of RHS of $(12)$ for $l_{1}=d_{1},\; 0\leq l_{1}\leq d_{1}$, and $0\leq d_{1}\leq d$.
  3. Record $\Phi_{1}(d_{1})$ and $l_{1}$.
  4. For $k=2$, express the state variable as $d_{k-1}=d_{k}-l_{k}$.
  5. Set $\Phi_{k}(d_{k})=0$ if $l_{k}>d_{k}$, where $0\leq d_{k}\leq d$.
  6. Calculate $\Phi_{k}(d_{k})$, the minimum value of RHS of $(11)$ for $l_{k};\;0\leq l_{k}\leq d_{k}$.
  7. Record $\Phi_{k}(d_{k})$ and $l_{k}$.
  8. For $k\geq3, ..., L$, go to Step 4.
  9. At $k=L, \; \Phi_{L}(d)$ is obtained and hence the optimum value $l_{L}^{*}$ of $l_{L}$ is obtained.
  10. At $k=L-1$, using the backward calculation for $d_{L-1}=d-l_{L}^{}$, read the value of $\Phi_{L-1}(d_{L-1})$ and hence the optimum value $l_{L-1}^{}$ of $l_{L-1}$.
  11. Repeat Step $(10)$ until the optimum value $l_{1}^{*}$ of $l_{1}$ is obtained from $\Phi_{1}(d_{1})$.

Optimum Sample Sizes Using Neyman Allocation

When OSB $(y_{h}, y_{h-1})$ have been determined, the Optimum Sample Sizes (OSS) $n_{h}; h=1,2,...,L\,$ that minimizes the variance of the estimate can easily be computed. The sample size $n_{h}$ are obtained for a fixed total sample of size $n$ under the Neyman allocation for $h=1,2,...,L\,$ and given as follows:

\begin{equation}\label{Ney_alloc} n_{h} = n\,\frac{W_{h}S_{h}}{\sum_{h=1}^{L}W_{h}S_{h}} \tag{13} \end{equation}

where $W_{h}$ and $S_{h}$ are derived in terms of the optimum boundary points $(y_{h}, y_{h-1})$.

In Neyman allocation, the total sample size is proportional to the stratum size multiplied by the standard deviation of the stratum. If the variances are specified correctly, Neyman allocation will give an estimator with smaller variance compared to proportional allocation (@lohr2009sampling).

In equation (13), it is also worth noting that the OSB $(y_{h}, y_{h-1})$ are so obtained that $n_{h}$ must satisfy the restrictions:

\begin{equation}\label{res} 1\leq n_{h}\leq N_{h}, \tag{14} \end{equation}

where $N_{h}=NW_{h}$. Thus, the restriction (14) indicates that the $h^ {th}$ stratum must form with at least one unit and also avoid the problem of over-sampling.

Overview of Package Functionalities

For the numerical illustrations and application of the package, some of the real datasets such as 'sugarcane' of @khan2015designing, 'anaemia' of @Reddy2014; 'hies' and 'math' data are provided in the stratifyR. The 'quakes' and 'Boston' data provided in the datasets package in R are also used for illustration purposes. The stratifyR package is also tested on some published and commonly-used datasets such as 'UScities' and 'UScolleges' data from @Cochran1961, 'Debtors' data of @Gunning2004, 'REV84' variable for 'Swedish municipalities' data from @sardnal1992model and 'MRTS' variable from 'Statistics Canada Monthly Retail Trade Survey' together with 'SHS' data collected in 'Statistics Canada Survey of Household Spending'. For those distributions where real data is not found in literature, data is simulated to demonstrate the application of the package in this documentaion.

For a user, there are two different routes available in the package and these are basically dependent on the type of stratification problem to be solved. It could either be a data-based (i.e., when the dataset of the stratification variable is available) or a distribution-based (i.e., when dataset is not available) stratification problem. Whether stratification is based on data or not, the idea is that the problem is formulated as an MPP using the estimated (with available data) or assumed (with unavailable data) distribution of the data set. There are numerous functions created in the package, however, there are only a few major functions that are used by the two different types of problems being studied in univariate stratification.

If it is a data-based problem, the function used is strata.data() and the user has to provide as input arguments: the data, the number of strata ($L$) and the fixed sample size ($n$). For the distribution-based problem, the function used is strata.distr() and the user has to provide the name of the assumed distribution, number of strata ($L$), possible range of data (distance), fixed sample size ($n$) and the population size ($N$). The following two subsections delve a little deeper into the workings surrounding the two functions: strata.data() and strata.distr().

The Function strata.data()

\noindent\rule{16.5cm}{0.5pt} \begin{center} \textsf{strata.data} \hskip 4cm \textit{Univariate stratification of Suvey Populations Based on Data} \end{center} \noindent\rule{16.5cm}{0.5pt}

\textbf{Description}

This function computes the OSB, OSS, and other important quantities from univariate survey populations by employing the methodology proposed by @Khan2002 @khan2003theory, @Khan2005, @khan2008determining, @Khan2009, @Nand2009, @khan2015determining, @khan2015designing, @reddy2018optimum and @reddy2019optimal. Their method uses the estimated distribution of the data and formulates the problem of determining OSB as a Mathematical Programming Problem which is an optimization problem that is solved by the DP technque. The OSB obtained are optimal solutions that are used to calculate the OSS under Neyman allocation. The usage and arguments are as follows:

\textbf{Usage}

    strata.data(data, h, n, cost=FALSE, ch=NULL)

\textbf{Arguments}

    data - A vector: data containing every unit of the survey population
    h - A numeric: number of strata to be sampled. The default is 2
    n - A numeric: fixed total sample size
    cost - A logical: stratum cost. Default cost=FALSE. 
    ch - A numeric: denotes a vector of stratum costs. Default ch=NULL.

\textbf{Algorithm}

\begin{enumerate} \item strata.data() is invoked, which is like the main function which belongs to class "strata" which provides a sequence of codes that lead to the final results. The arguments that need to be specified during the invocation are: data, $h$ (number of strata) and $n$ (fixed total sample size). One can also provide cost and stratum costs if it involves a stratification problem with sampling cost. The function call is normally stored in an object, which is of class "strata". \item It creates a new environment called $my_env$ and stores the data, $h$ and $n$. Data is also scaled here by dividing with the maximum value of the data. Important quantities like maximum value for real data, initial value, final value and distance (range) for scaled data are stored in $my_env$. \item The get.dist() function takes in data and quantities stored in $my_env$ as arguments. From a set of ten different distributions (unif',Triangular', right-Triangular',gamma', weibull',norm', lnorm',exp', pareto' andcauchy'), it chooses the best-fit distribution by looking at the lowest AIC. Parameter estimates for the best-fit distribution together with the smallest AIC are returned as a list. \item When create.mat() function is called, it creates a 2D matrices from a set of defined constants to store values of the objective function. \item The data.optim() function then computes 3dp and 6dp solutions respectively for different values of the objective function values at different incremental progressions of the $y$ value on the range of the dataset. data.optim() has data.root() in it to calculate the objective function values which are stored in the 2D-matrices. \item The data.alloc() function computes the sample sizes by using Neyman allocation. The OSB obtained in the previous steps are used to calculate the stratum weights and stratum population sizes from the data - these are used to obtain the stratum sample sizes. \item The summary.strata() function defines the method for the "strata" class that has been created in the constructor function strata.data(). The function extracts all individual objects from the "strata" class object and combines them into dataframes before writing the formatted outputs to the console. \end{enumerate}

\textbf{Application}

To show the application of the strata.data() function, an example of the command used and its output from the package is given below. The problem uses the 'mag' variable from the 'quakes' data (with a population of $N=1000$) available from the datasets package in R. To construct a 2-strata solution with a fixed sample size of $n=300$, we use the following codes:

data(quakes)
head(quakes)
mag <- quakes$mag
length(mag)
hist(mag) #to see the distribution
library(stratifyR)
res <- strata.data(mag, h = 2, n=300) # a 2-strata solution
summary(res)

All calculations have been rounded off to 2 decimal places, hence, the individual stratum solutions provided in the tables may not always add up to the totals.

The Function strata.distr()

\noindent\rule{16.5cm}{0.5pt} \begin{center} \textsf{strata.distr} \hskip 2cm \textit{Univariate Stratification of Suvey Populations Based on Distributional Assumptions} \end{center} \noindent\rule{16.5cm}{0.5pt}

\textbf{Description}

This function is also used to compute the OSB, OSS, and other important quantities from univariate survey populations by employing the methodology proposed by Khan et al given earlier. Its algorithm is quite similar to that of the strata.data(), however, its functionality is applied to the case where the dataset of the population is not available and the distributonal assumptions of the study variable are strictly required. Another caveat for such distribution-based stratification is that the distr.alloc() function uses the probability density functions of the assumed distributions and integration rules presented by equations (4)-(6) to calculate the stratum sample sizes. It must be noted that this function works on ideal distributions that assumes the parameters chosen by the user. The usage and arguments are as follows:

\textbf{Usage}

    strata.distr(h, initval = NULL, dist = NULL, 
                 distr = c("pareto", "triangle", "rtriangle", "weibull", "gamma", 
                 "exp", "unif","norm", "lnorm", "cauchy"), params = c(shape=0, 
                 scale=0, rate=0, gamma=0, location=0, mean=0, sd=0, meanlog=0, 
                 sdlog=0, min=0, max=0, mode=0), n, N, cost=FALSE, ch=NULL)

\textbf{Arguments}

    h - A numeric: number of strata to be sampled
    initval - A numeric: initial value of the assumed distribution
    dist - A numeric: distance or range of the assumed distribution
    distr - A character: the assumed distribution of the hypothetical population 
    params - A list: parameters of the assumed distribution 
    n - A numeric: fixed total sample size  
    N - A numeric: fixed population size
    cost - A logical: stratum cost. Default cost=FALSE. 
    ch - A numeric: denotes a vector of stratum costs. Default ch=NULL.

\textbf{Algorithm}

The algorithm for strata.distr() is quite similar to the strata.data() for the contruction of OSB. It is only at the sample allocation (OSS) stage that the two are different. strata.distr() is the main function and once the OSB have been computed, this function uses the distr.alloc() function which uses the numerical integration rules for different distibutions (i.e., the equations (4)-(6)) to compute the OSS.

\textbf{Application}

To show the application of the strata.distr() function, let us construct a 2-strata solution assuming that the dataset of the stratification variable is not available but its distribution and estimated parameters are. Let us consider the 'depth' variable from the 'quakes' dataset from the datasets package, which has a Triangular distribution with parameters $min=39.99998, max=680, mode=39.99999$. It starts at an initial value of $40$ and has a distance (range) of $640$ with a fixed sample size of $n=300$ from a population of $N=1000$ seismic events. Thus, we use the following commands:

data(quakes) 
depth <- quakes$depth
hist(depth) #see distribution
min(depth); max(depth); d=max(depth)-min(depth);d #min, max and range of data 
# the 2-strata solution is
res <- strata.distr(h=2, initval=40, dist=640, distr = "triangle",
             params = c(min=39.99998, max=680, mode=39.99999), n=300, N=1000)
summary(res)

Application to Numerous Survey Populations

As discussed earlier, the stratifyR package is able to handle ten continuous distributions that are quite commonly-used in real-life situations. This section presents a brief overview of these distributions and the application of the proposed method of stratification using real or simulated data which follows a particular distribution. Examples for hypothetical distributions are also presented. For the sake of brevity, the mathematical formulations of the problem of determining the OSB and the DP solution procedure are presented only for the Pareto Type II variable. For all other distributions, only the examples are presented to illustrate the application of the package.

Stratification for a Survey Variable with Pareto Type II Distribution

Let the study variable \textit{y} follow the Pareto Type II distribution on the domain of [$0, +\infty$], its two-parameter probability density function with a state space $y\geq 0$ is given by: \begin{equation} f(y; s,a)=\dfrac{a\,s^{a}}{(y+s)^{a+1}}, \;\;\;\;\;a,s > 0 \tag{15} \end{equation}

where $\alpha > 0$ is the shape parameter and $s>0$ is the scale parameter of the distribution.

MPP Formulation for Pareto Type II Distribution

If the study variable $y$ follows Pareto II (or Lomax) distribution (i.e., $y\sim P(a, s)$) with density function given by (15). By using equations (4)-(6), the formulated MPP given in (10) could be expressed as: \begin{eqnarray} \textrm{Minimize} \;\;\; \sum_{h=1}^{L} \textit{SQRT} && \Biggl{as^{2a} \left[\dfrac{(y_{h-1}+l_{h}+s)^{a}-(y_{h-1}+s)^{a}}{(y_{h-1}+s)^{a}(y_{h-1}+l_{h}+s)^{a}}\right] \nonumber \[10pt] && \times\left[\dfrac{(y_{h-1}+l_{h}+s)^{2-a}}{2-a} - \dfrac{2s(y_{h-1}+l_{h}+s)^{1-a}}{1-a} \right. \nonumber\[10pt] && - \dfrac{s^2(y_{h-1}+l_{h}+s)^{-a}}{a} -\dfrac{(y_{h-1}+s)^{2-a}}{2-a}\nonumber\[10pt] && + \left. \dfrac{2s(y_{h-1}+s)^{1-a}}{1-a} + \dfrac{s^2(y_{h-1}+s)^{-a}}{a}\right] \nonumber\[10pt] && \times \dfrac{s^{2a}}{(1-a)^{2}}\left[\dfrac{a(y_{h-1}+l_{h})+s}{(y_{h-1}+l_{h}+s)^{a}} - \dfrac{ay_{h-1}+s}{(y_{h-1}+s)^{a}} \right]^{2}\Biggr}\nonumber\[15pt] \textrm{subject to} && \;\;\;\; \sum_{h=1}^{L}l_{h} = d,\nonumber \ \textrm{and} && \;\;\;\; l_{h} \geq 0; \;\; h = 1,2,...,L. \tag{16} \end{eqnarray}

where $d=y_{L}-y_{0}$, $a$ and $s$ are parameters of the Pareto Type II distribution.

DP Solution for Pareto Type II Distribution

To solve the MPP (16) using the DP technique as a solution procedure, we apply the algorithm, that is, the solution proceure using Dynamic Progrmming technique discussed earlier in Section 4. After substituting the quatity $y_{h-1}=y_{0} + d_{h} - l_{h}$, the recurrence relations (11) and (12) are reduced to:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ as^{2a}\left[\dfrac{(d_{1}+y_{0}+s)^{a}-(y_{0}+s)^{a}}{(y_{0}+s)^{a}(d_{1}+y_{0}+s)^{a}}\right] \nonumber\[10pt] &&\times\left[\dfrac{(d_{1}+y_{0}+s)^{2-a}}{2-a} - \dfrac{2s(d_{1}+y_{0}+s)^{1-a}}{1-a} \right. \nonumber\[10pt] &&- \dfrac{s^2(d_{1}+y_{0}+s)^{-a}}{a} -\dfrac{(y_{0}+s)^{2-a}}{2-a}\nonumber\[10pt] &&+\left. \dfrac{2s(y_{0}+s)^{1-a}}{1-a} + \dfrac{s^2(y_{0}+s)^{-a}}{a}\right] \nonumber\[10pt] &&\times \dfrac{s^{2a}}{(1-a)^{2}}\left[\dfrac{a(d_{1}+y_{0})+s}{(d_{1}+y_{0}+s)^{a}} - \dfrac{ay_{0}+s}{(y_{0}+s)^{a}} \right]^{2}\Biggr} \tag{17} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ as^{2a}\left[\dfrac{(d_{k}+y_{0}+s)^{a}-(d_{k}+l_{k}+y_{0}+s)^{a}}{(d_{k}+l_{k}+y_{0}+s)^{a}(d_{k}+y_{0}+s)^{a}}\right] \nonumber\[10pt] &&\times\left[\dfrac{(d_{k}+y_{0}+s)^{2-a}}{2-a} - \dfrac{2s(d_{k}+y_{0}+s)^{1-a}}{1-a} \right. \nonumber\[10pt] &&- \dfrac{s^2(d_{k}+y_{0}+s)^{-a}}{a} -\dfrac{(d_{k}+l_{k}+y_{0}+s)^{2-a}}{2-a}\nonumber\[10pt] &&+\left. \dfrac{2s(d_{k}+l_{k}+y_{0}+s)^{1-a}}{1-a} + \dfrac{s^2(d_{k}+l_{k}+y_{0}+s)^{-a}}{a}\right] \nonumber\[10pt] &&\times \dfrac{s^{2a}}{(1-a)^{2}}\left[\dfrac{a(d_{k}+y_{0})+s}{(d_{k}+y_{0}+s)^{a}}- \dfrac{a(d_{k}+l_{k}+y_{0})+s}{(d_{k}+l_{k}+y_{0}+s)^{a}} \right]^{2} \Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr}. \tag{18} \end{eqnarray}

The recurrence relations (17) and (18) are solved using the DP technique to determine the OSB.

A Numerical Example for Pareto Type II Distribution

A dataset for a univariate population of size $N=5000$ with the study variable that follows Pareto Type II distribution ($pareto_data$) was simulated using parameters $shape=5$ and $scale=8$ to demonstrate the application of the strata.data() function to determine the OSB and other quantites. The data exhibits a 2-parameter Pareto Type II distribution with the MLE estimates of the parameters as $shape=5.026907$ and $scale=8.191676$. The minimum and maximum values in the simulated data are $[y_{0}, y_{L}] = [0.0002193, 38.56871]$, which implies that $d=38.56849$.

To construct the OSB (a 2-strata solution, i.e., $h = 2$) for the $pareto_data$ with a fixed total sample size of $500$, we use the following codes:

set.seed(8235411)
pareto_data <- rpareto(5000, shape=5, scale=8)
head(pareto_data)
hist(pareto_data, breaks=100)
min(pareto_data); max(pareto_data); d=max(pareto_data)-min(pareto_data);d
fit <- fitdist(pareto_data, "pareto", start = list(shape = 1, scale = 500))
fit
res <- strata.data(pareto_data, h = 2, n=500) # a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Pareto Type II population. Let us assume from past knowledge that the study variable in the population follows Pareto Type II distribution with the given attributes such as the $shape=5.05, scale=8.20, initial\,value=0.15$ and $distance=38.55$. Then, if a sample of size $n=500$ is drawn from the population of size $N=5000$, we can execute the following command to obtain the results:

res <- strata.distr(h=2, initval=0.15, dist=38.55, distr = "pareto",
             params = c(shape=5.05, scale=8.20), n=500, N=5000)
summary(res)

Stratification for a Survey Variable with Triangular Distribution

Let the study variable \textit{y} follow Triangular distribution on the domain of [$a, b$], its three-parameter probability density function with a state space $y\geq 0$ is given by: \begin{equation} f(y) = \begin{cases} \dfrac{2(y-a)}{(b-a)(c-a)}; & y \in [a, c]\[10pt] \dfrac{2(b-y)}{(b-a)(b-c)}; & y \in (c,b] \ \end{cases} \tag{19} \end{equation}

where $a$ is the location parameter, $b$ is the scale parameter and $c$ is the shape parameter of the distribution.

Then, formulating the problem as an MPP and solving the recurrence relations as discussed above for Pareto II variate, the OSB are obtained.

DP Solution for Triangular Distribution

To solve the MPP formulated for Triangular distribution (19), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \sum_{h=1}^{\lambda_{1}} \dfrac{d_{1}^{2}\sqrt{d_{1}^{2}+6(y_{0}-a)d_{1}+6(y_{0}-a)^{2}}}{3\sqrt{2}(b-a)(c-a)} + \sum_{h=\lambda_{2}}^{L}\dfrac{d_{1}^{2}\sqrt{6(b-y_{0})^{2}-6(b-y_{0})d_{1}+d_{1}^{2}}}{3\sqrt{2}(b-a)(b-c)}\Biggr} \tag{20} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{\sum_{h=1}^{\lambda_{1}} \dfrac{l_{k}^{2}\sqrt{l_{k}^{2}+6a_{k}l_{k}+6a_{k}^{2}}}{3\sqrt{2}(b-a)(c-a)} +\sum_{h=\lambda_{2}}^{L}\dfrac{l_{k}^{2}\sqrt{6b_{k}^{2}-6b_{k}l_{k}+l_{k}^{2}}}{3\sqrt{2}(b-a)(b-c)}\Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr} \tag{21} \end{eqnarray}

where $a_{k}=d_{k}-l_{k}+y_{0}-a$ and $b_{k}=b-d_{k}+l_{k}-y_{0}+a$.

Substituting the values of $a$, $b$, $c$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Triangular Distribution

Data on Mathematics marks of first year students in a University in Fiji, with a size of $N=354$ called 'math' data is used to demonstrate the application of the stratifyR package on a Triangular population. In this example, the variable `final_marks' is used - it exhibits 3-parameter Triangular distribution with the parameters: $min = 6.205204$, $max=98.47165$ and $mode=53.999996$. The minimum and maximum values in the 'math' data are $[y_{0}, y_{L}] = [7, 97]$, which implies that $d=90$.

To construct the OSB ($h = 2$) for the `final_marks' data with a fixed total sample size of $150$, we use the following code:

data(math)
final_marks <- math$final_marks
hist(final_marks)
res <- strata.data(final_marks, h = 2, n=150) # a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Triangular population. Based on the assumption from past knowledge that the population follows Triangular distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

data(math)
final_marks <- math$final_marks
eps = 1e-8; a <- min(final_marks); b <- max(final_marks); c=b-a# and with mode=54
a;b;c
#find out the estimated parameters
fit <- fitdist(final_marks, distr = "triang", method="mle", lower=c(0,0),
               start = list(min = a-eps, max = b+eps, mode = 54))
fit
# 2-strata solution
res <- strata.distr(h=2, initval=7, dist=90, distr = "triangle",
      params = c(min=6.205364, max=98.469797, mode=53.999994), n=150, N=352)
summary(res)

Stratification for a Survey Variable with Right-Triangular Distribution

If a study variable \textit{y} follows the Right-Triangular distribution on the domain of [$a, b$], its two-parameter probability density function is given by:

\begin{equation} f(y) = \begin{cases} \dfrac{2(b-y)}{(b-a)^2}; & y \in [a, b]\ 0; & otherwise \ \end{cases} \tag{22} \end{equation}

where $a$ is the location parameter and $b$ is the scale parameter of the distribution.

Note that the Right-Triangular distribution is a special case of the Triangular distribution discussed in Section 7.2 where the parameters $a=c$, i.e., minimium value is equal to the mode. Thus, the density function (22) is a two-parameter distribution.

DP Solution for Right-Triangular Distribution

To solve the MPP formulated for Right-Triangular distribution (22), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \left[\dfrac{d_{1}(2(y_{0}-a)-d_{1})}{(b-a)^{2}}\right]^{2}\times\left[\dfrac{d_{1}^{2}(d_{1}^{2}-6(y_{0}-a)d_{1}+6a_{h}^{2})}{18(2(y_{0}-a)-d_{1})^{2}}\right]\Biggr} \tag{23} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{\left[\dfrac{l_{k}(2a_{k}-l_k)}{(b-a)^{2}}\right]^{2} \times\left[\dfrac{l_{k}^{2}(l_{k}^{2}-6a_{k}l_{k}+6a_{k}^{2})}{18(2a_{k}-l_{k})^{2}}\right]\Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr} \tag{24} \end{eqnarray}

where $a_{k}=b-d_{k}+l_{k}-y_{0}$.

Substituting the values of $a$, $b$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Right-Triangular Distribution

A data following Right-Triangular distribution, of size $N=5000$, was simulated to demonstrate the application of the stratifyR package. The simulated data takes in three parameters where $a=c$ to indicate that it's a Right-Triangular distribution. Upon fitting the data, its best-fit distribution exhibits a three-parameter Triangular distribution with the parameters $min = 1.987776$, $max=7.935599$ and $mode=2.026685$. Note that because the $min$ and $mode$ parameters are very close to each and not exactly the same, it is treated as a Triangular distribution even thought the data simulated was for a Right-Triangular distribution. The minimum and maximum values in the simulated data are $[y_{0}, y_{L}] = [2.000052, 7.83871]$ with $d=5.838658$.

To construct the OSB ($h = 2$) for the Right-Triangular data with a fixed total sample size of $500$, we use the following code:

#Generate RT data
set.seed(12546)
data <- rtriangle(n=1000, a=2, b=8, c=2) #right-triangular since a=c
hist(data)
res <- strata.data(data, h = 2, n=500) # a 2-strata solution
summary(res)

We see in the above example that the simulated data is a Right-Triangular distribution, however, when data is fitted using MLE method in the package, it turns out that it best-fits Triangular distribution because the $min$ is not exactly equal to $mode$.

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Right-Triangular population. Based on the assumption from past knowledge that the population follows Right-Triangular distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

res <- strata.distr(h=2, initval=1.007202, dist=0.992781, distr = "rtriangle",
         params = c(min=2, max=10, mode=2), n=500, N=1000)
summary(res)

The results show that this fits a two-paramter Right-Triangular distribution. Do note that in the above command, one has to specify all three parameters $min=2, max=10$ and $mode=2$, even for a Right-Triangular distribution, where $min=mode$.

Stratification for a Survey Variable with Weibull Distribution

If the study variable \textit{y} follows the Weibull distribution on the interval $[y_{0}, y_{L}]$, its two-parameter probability density function with a state space $y\geq 0$ is given by: \begin{equation} f(y; \theta, r)=\dfrac{r}{\theta}\left(\dfrac{y}{\theta}\right)^{r-1}e^{-\left(y/\theta\right)^{r}}, \;\;\;\;\;y\geq 0 \tag{25} \end{equation}

where $r > 0$ is the shape parameter and $\theta > 0$ is the scale parameter of the distribution.

DP Solution for Weibull Distribution

To solve the MPP formulated for Weibull distribution (25), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \theta^{2}\,\Gamma\left(1+\dfrac{2}{r} \right)\left[e^{-\left(\frac{y_{0}}{\theta}\right)^{r}}-e^{-\left(\frac{d_{1}+y_{0}}{\theta}\right)^{r}}\right] \nonumber \[10pt] &&\times \left[Q\left(1+\dfrac{2}{r},\left(\dfrac{y_{0}}{\theta}\right)^{r}\right)- Q\left(1+\dfrac{2}{r},\left(\dfrac{d_{1}+y_{0}}{\theta}\right)^{r}\right)\right]\nonumber\[10pt] &&-\theta^{2}\left[\Gamma\left(1+\dfrac{1}{r}\right)\left [Q\left(1+\dfrac{1}{r},\left(\dfrac{y_{0}}{\theta}\right)^{r}\right)\right.\right.\nonumber\[10pt] &&-Q\left.\left.\left(1+\dfrac{1}{r},\left(\dfrac{d_{1}+y_{0}}{\theta}\right)^{r}\right)\right]\right]^{2}\Biggr} \tag{26} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \theta^{2}\,\Gamma\left(1+\dfrac{2}{r} \right)\left[e^{-\left(\frac{d_{k}-l_{k}+y_{0}}{\theta}\right)^{r}}-e^{-\left(\frac{d_{k}+y_{0}}{\theta}\right)^{r}}\right]\nonumber\[10pt] &&\times \left[Q\left(1+\dfrac{2}{r},\left(\dfrac{d_{k}-l_{k}+y_{0}}{\theta}\right)^{r}\right)- Q\left(1+\dfrac{2}{r},\left(\dfrac{d_{k}+y_{0}}{\theta}\right)^{r}\right)\right]\nonumber\[15pt] &&-\theta^{2}\left[\Gamma\left(1+\dfrac{1}{r}\right)\left [Q\left(1+\dfrac{1}{r},\left(\dfrac{d_{k}-l_{k}+y_{0}}{\theta}\right)^{r}\right)\right.\right.\nonumber\[15pt] &&-Q\left.\left.\left(1+\dfrac{1}{r},\left(\dfrac{d_{k}+y_{0}}{\theta}\right)^{r}\right)\right]\right]^{2}\Biggr}+\Phi_{k-1}(d_{k}-l_{k})\Biggr}. \tag{27} \end{eqnarray}

The recurrence relations (26) and (27) are solved using the DP technique to determine the OSB.

A Numerical Example for Weibull Distribution

A health data of size $N=724$, called 'anaemia' data (@Reddy2014), is used to demonstrate the application of the stratifyR package on Weibull population. The 'anaemia' data comes from the National Nutritional Survey on the ``Micronutrient Status of Women in Fiji" and has many variables such as level of Iron, Folate, Zinc, etc. In this example, the variable Iron is used since it exhibits a 2-parameter Weibull distribution with the shape and scale parameters as $r = 2.144586$ and $\theta = 13.790744$ respectively. The minimum and maximum values are $[y_{0}, y_{L}] = [1.5, 34.7]$, which implies that $d=33.2$.

To construct the OSB ($h = 2$) for the Iron data with a fixed total sample size of $500$, we use the following codes:

data(anaemia) #using the anaemia data
Iron <- anaemia$Iron
hist(Iron)
res <- strata.data(Iron, h = 2, n=500) # a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Weibull population. Based on the assumption from past knowledge that the population follows Weibull distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

res <- strata.distr(h=2, initval=2.9, dist=55.9, distr = "weibull",
       params = c(shape=2.144586, scale=13.790744), n=500, N=5000)
summary(res)

Stratification for a Survey Variable with Gamma Distribution

If the study variable \textit{y} follows the Gamma distribution (i.e., $y\sim \Gamma(r, \theta)$) on the interval $[y_{0}, y_{L}]$, it has the following two-parameter probability density function:

\begin{equation} f(y;r,\theta)=\dfrac{1}{\theta^{r}\Gamma(r)}\,y^{r-1}e^{-\dfrac{y}{\theta}},\;\;\;\;\;\;\;y>0;\;\;r,\theta>0, \tag{28} \end{equation}

where \textit{r} is a shape parameter and $\theta$ is the scale parameter and $\Gamma(r)$ is a Gamma function defined by

\begin{equation}\label{222} \Gamma(r)=\int_{0}^{\infty}t^{r-1}e^{-t}\,dt,\;\;\;\;\;\; r > 0. \tag{29} \end{equation}

The function in equation $(\ref{222})$ is also defined by an upper incomplete gamma function $\Gamma(r,x)$ and a lower incomplete gamma function $\gamma(r,x)$, respectively, as follows:

\begin{eqnarray} \Gamma(r,y)=\int_{y}^{\infty}t^{r-1}e^{-t}\,dt;\label{223} \tag{30}\ \gamma(r,y)=\int_{0}^{y}t^{r-1}e^{-t}\,dt.\label{224} \tag{31} \end{eqnarray}

There also exist regularized/normalised incomplete Gamma functions which give a value restricted between $0$ and $1$ and can be stated as:

\begin{eqnarray} Q(r,y)=\dfrac{1}{\Gamma(r)}\int_{y}^{\infty}t^{r-1}e^{-t}\,dt,\;\;\;\;\;\;r,y>0;\;\;\Gamma(r)\neq0;\label{225} \tag{32}\ P(r,y)=\dfrac{1}{\Gamma(r)}\int_{0}^{y}t^{r-1}e^{-t}\,dt,\;\;\;\;\;\;r,y>0;\;\;\Gamma(r)\neq0,\label{226} \tag{33} \end{eqnarray}

where $Q(r,y)$ denotes the Upper Regularized Incomplete Gamma function while $P(r,y)$ denotes regularized Lower Incomplete Gamma function (@abramowitz1972handbook, Chapter 6). Note that $Q(r,y)=1-P(r,y)$.

DP Solution for Gamma Distribution

To solve the MPP formulated for Gamma distribution (28), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \theta^{2}\,r(r+1)\left[ Q\left(r,\dfrac{y_{0}}{\theta}\right)-Q\left(r,\dfrac{d_{1}+y_{0}}{\theta}\right)\right]\nonumber\[10pt] &&\times\left[Q\left(r+2,\dfrac{y_{0}}{\theta}\right)-Q\left(r+2,\dfrac{d_{1}+y_{0}}{\theta}\right) \right]\nonumber\[10pt] &&-\theta^{2}r^{2}\left[ Q\left(r+1,\dfrac{y_{0}}{\theta}\right)-Q\left(r+1,\dfrac{d_{1}+y_{0}}{\theta}\right)\right]^{2}\Biggr} \tag{34} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \theta^{2}\,r(r+1)\left[ Q\left(r,\dfrac{d_{k}-l_{k}+y_{0}}{\theta}\right.\right)\nonumber\[10pt] &&-\left.Q\left(r,\dfrac{d_{k}+y_{0}}{\theta}\right)\right] \times\left[ Q\left(r+2,\dfrac{d_{k}-l_{k}+y_{0}}{\theta}\right.\right)\nonumber\[10pt] &&-\left.Q\left(r+2,\dfrac{d_{k}+y_{0}}{\theta}\right) \right]-\theta^{2}r^{2}\times\left[ Q\left(r+1,\dfrac{d_{k}-l_{k}+y_{0}}{\theta}\right)\right.\nonumber\[10pt] &&-\left.Q\left(r+1,\dfrac{d_{k}+y_{0}}{\theta}\right)\right]^{2}\Biggr}+\Phi_{k-1}(d_{k}-l_{k})\Biggr}. \tag{35} \end{eqnarray}

The recurrence relations (34) and (35) are solved using the DP technique to determine the OSB.

A Numerical Example for Gamma Distribution

Again, the health data of size $N=724$, derived from the 'National Nutritional Survey' on the 'Micronutrient Status of Women in Fiji' is used to demonstrate the application of the stratifyR package on Gamma population. In this example, the variable Folate is used since it exhibits a 2-parameter Gamma distribution with the shape and scale parameters as $r = 6.9922$ and $\theta = 2.5785$ respectively. The minimum and maximum values are $[y_{0}, y_{L}] = [4.9, 45.4]$, which implies that $d=40.5$.

To construct the OSB ($h = 2$) for the Folate data with a fixed total sample size of $500$, we use the following codes:

data(anaemia)
Folate <- anaemia$Folate
hist(Folate)
res <- strata.data(Folate, h = 2, n=500) # a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Gamma population. Based on the assumption from past knowledge that the population follows Gamma distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

res <- strata.distr(h=2, initval=0.5, dist=50, distr = "gamma",
       params = c(shape=3.835768, rate=0.340328), n=500, N=12000)
summary(res)

Stratification for a Survey Variable with Exponential Distribution

If the study variable \textit{y} follows the Exponential distribution on the interval $[y_{0}, y_{L}]$, its one-parameter probability density function is given by:

\begin{equation} f(y; \lambda)= \begin{cases} \lambda\mathrm{e}^{-\lambda y}; & y > 0\ \;\;\;\;0; & elsewhere \ \end{cases} \tag{36} \end{equation}

where $\lambda$ is the continuous rate parameter (or the inverse scale parameter).

DP Solution for Exponential Distribution

To solve the MPP formulated for Exponential distribution (36), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \left(e^{-\lambda y_{0}}\right)^{2}\left[\dfrac{1}{\lambda^2}\left(1-e^{-\lambda l_{k}}\right)^{2}-l_{k}^{2}e^{-\lambda l_{k}}\right]\Biggr} \tag{37} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \left(e^{-\lambda (d_{k}-l_{k}+y_{0})}\right)^{2}\left[\dfrac{1}{\lambda^2}\left(1-e^{-\lambda l_{k}}\right)^{2}-l_{k}^{2}e^{-\lambda l_{k}}\right]\Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr} \tag{38} \end{eqnarray}

Substituting the values of $\lambda$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Exponential Distribution

A data following Exponential distribution (herein called $Exp$ data), of size $N=10000$ was simulated to demonstrate the application of the stratifyR package on an Exponential population. The data exhibits an Exponential distribution with the parameter $rate = 1.359205$. The minimum and maximum values in the simulated data are $[y_{0}, y_{L}] = [5.747904e-05, 8.016871]$, which implies that $d=8.016814$.

To construct the OSB ($h = 2$) for the data that follows Exponential distribution with a fixed total sample size of $500$, we use the following codes:

set.seed(28951)
data <- rexp(5000, rate = 1.36)
hist(data)
res <- strata.data(data, h = 2, n=500) # a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Exponential population. Based on the assumption from past knowledge that the population follows Exponential distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

set.seed(28951)
data <- rexp(5000, rate = 1.36)
min(data); max(data); d=max(data)-min(data);d
fit <- fitdist(data, distr="exp", method="mle")
fit
res <- strata.distr(h=2, initval=5.748e-05, dist=8.017, distr = "exp", 
             params = c(rate=1.36), n=500, N=5000) #a 2-strata solution
summary(res)

Stratification for a Survey Variable with Uniform Distribution

If the study variable \textit{y} follows the Uniform distribution on the interval $[y_{0}, y_{L}]$, its two-parameter probability density function is given by:

\begin{equation} f(y; a,b)= \begin{cases} \dfrac{1}{b-a}; & y > 0\ \;\;\;\;0; & otherwise \ \end{cases} \tag{39} \end{equation}

where $a$ and $b$ are the continuous boundary parameters, i.e., $a$ and $b$ are the minimum and maximum values respectively.

DP Solution for Uniform Distribution

To solve the MPP formulated for Uniform distribution (39), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \sum\limits_{h=1}^{L} \dfrac{d_{1}^{2}}{2\sqrt{3}(b-a)}\Biggr} \tag{40} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \sum\limits_{h=1}^{L} \dfrac{l_{k}^{2}}{2\sqrt{3}(b-a)}\Biggr}+ \Phi_{k-1}(d_{k}-l_{k})\Biggr} \tag{41} \end{eqnarray}

Substituting the values of $a$, $b$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Uniform Distribution

A data following Uniform distribution of size $N=5000$ was simulated to demonstrate the application of the stratifyR package on a Uniform population. The data for Uniform distribution is simulated with the parameters $min = 2$ and $max=15$. When fitted, the minimum and maximum values in the simulated data are $[y_{0}, y_{L}] = [2.006522, 14.99764]$, which implies that $d=12.99112$.

To construct the OSB ($h = 2$) for the Uniformly-distributed data with a fixed total sample size of $450$, we use the following codes:

set.seed(15669)
data <- runif(5000, min = 2, max = 15)
hist(data)
res <- strata.data(data, h = 2, n=450) # a 2-strata solution
summary(res)

Note that the above results indicate that the best-fit distribution is Triangular and not Uniform. This is because when stratifyR package assesses the data to ascertain the best-fit distribution, it is not able to calculate its AIC value for Uniform distribution. Hence, the distribution that gives the lowest AIC is taken to be the best-fit distribution. Since AIC for Uniform distribution is not available, the next best-fit distribution (Triangular) is be chosen. The results, you would find, are still quite accurate!

Similarly, we can apply the strata.distr() function for the Uniform distribution where the arguments can be assumed from past knowledge. To construct the OSB for $h = 2$ for a hypothetical variable that follows Uniform distribution (with parameters $min=3$ and $max=15$) with a fixed total sample size of $450$ from a population of $5000$, the following command is used:

# For a hypothetical uniform distribution, it does give a result
res <- strata.distr(h=2, initval=3, dist=12, distr = "unif",
                 params = c(min=3, max=15), n=450, N=5000)
summary(res)

Stratification for a Survey Variable with Normal Distribution

If the study variable \textit{y} follows the Normal distribution on the interval $[y_{0}, y_{L}]$, it has the following two-parameter probability density function: \begin{equation} f(y;\mu, \sigma)=\dfrac{1}{\sigma\sqrt{2\pi}}\,exp\left{-\dfrac{1}{2}\left(\dfrac{y-\mu}{\sigma}\right)^{2}\right},\;\;\;\;\;\ -\infty < y < \infty \tag{42} \end{equation}

where $\sigma > 0$ is a scale parameter and $\mu$ is the location parameter.

The following definitions of error function are worth noting since they are needed to simplify the integrations used to derive the stratum weight, mean and variance due to normal distribution. \begin{eqnarray} erf(z) = \dfrac{2}{\sqrt{\pi}}\int_{0}^{z}exp\left{-y^{2}\right}\,dy \tag{43} \end{eqnarray} It can also be written as \begin{eqnarray}\label{erf} \dfrac{1}{\sqrt{2\pi}}\int_{0}^{z}exp\left{-\dfrac{1}{2}y^{2}\right}\,dy = \dfrac{1}{2}\,erf\left(\dfrac{z}{\sqrt{2}}\right) \tag{44} \end{eqnarray}

DP Solution for Normal Distribution

To solve the MPP formulated for Normal distribution (42), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{ \dfrac{\sigma^{2}}{2\sqrt{2\pi}} \left[erf\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma\sqrt{2}}\right) - erf\left(\dfrac{y_{0}-\mu}{\sigma\sqrt{2}}\right)\right]\nonumber \[10pt] && \times\left[\left(\dfrac{y_{0}-\mu}{\sigma}\right)exp\left(-\left(\dfrac{y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)- \left(\dfrac{d_{1}+y_{0}-\mu}{\sigma}\right)exp\left(-\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)\right]\nonumber\[10pt] && + \dfrac{\sigma^{2}}{4}\left[erf\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma\sqrt{2}}\right) - erf\left(\dfrac{y_{0}-\mu}{\sigma\sqrt{2}}\right)\right]^{2}\nonumber\[10pt] && - \dfrac{\sigma^{2}}{2\pi}\left[exp\left(-\left(\dfrac{y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)- exp\left(-\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)\right]^{2}\Biggr} \tag{45} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \dfrac{\sigma^{2}}{2\sqrt{2\pi}} \left[ erf\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma\sqrt{2}}\right) - erf\left(\dfrac{(d_{k}-l_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right) \right]\nonumber \[10pt] && \times\left[\left(\dfrac{(d_{k}-l_{k}+y_{0})-\mu}{\sigma}\right)exp\left(-\left(\dfrac{(d_{k}-l_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right)^{2}\right)\right. \nonumber\[10pt] && - \left. \left(\dfrac{d_{k}+y_{0}-\mu}{\sigma}\right)exp\left(-\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)\right]\nonumber\[10pt] && + \dfrac{\sigma^{2}}{4}\left[erf\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma\sqrt{2}}\right) - erf\left(\dfrac{(d_{k}-l_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right)\right]^{2}\nonumber\[10pt] && - \dfrac{\sigma^{2}}{2\pi}\left[exp\left(-\left(\dfrac{(d_{k}-l_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right)^{2}\right)-exp\left(-\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma\sqrt{2}}\right)^{2}\right)\right]^{2}\Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr} \tag{46} \end{eqnarray}

Substituting the values of $\mu$, $\sigma$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Normal Distribution

A data following Normal distribution (herein called $Norm$ data), of size $N=5000$ was simulated to demonstrate the application of the stratifyR package on a Normal population. The data exhibits an Normal distribution with the parameters $mean = 16.010776$ and $sd=1.662357$. The minimum and maximum values in the simulated data are $[y_{0}, y_{L}] = [9.923816, 22.51267]$, which implies that $d=10.62118$.

To construct the OSB for $h = 2$ using the $Norm$ data with a fixed total sample size of $580$, the command below can be used:

set.seed(89821)
data <- rnorm(5000, mean = 16, sd = 1.65)
hist(data)
res <- strata.data(data, h = 2, n=500) #construct a 2-strata solution
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Normal population. Based on the assumption from past knowledge that the population follows Normal distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

set.seed(89821)
data <- rnorm(5000, mean = 16, sd = 1.65)
min(data); max(data); d=max(data)-min(data);d
fit <- fitdist(data, distr="norm", method="mle")
fit
res <- strata.distr(h=2, initval=9.923816, dist=12.58885, distr = "norm",
             params = c(mean=16.010776, sd=1.662357), n=500, N=5000)
summary(res)

Stratification for a Survey Variable with Log-Normal Distribution

If the study variable \textit{y} follows the Log-normal distribution on the interval $[y_{0}, y_{L}]$, it has the following two-parameter probability density function:

\begin{equation} f(y;\mu,\sigma)=\dfrac{1}{y\sigma\sqrt{2\pi}}\,exp\left{-\dfrac{1}{2}\left(\dfrac{ln(y)-\mu}{\sigma}\right)^{2}\right},\;\;\;\;\;\ y>0 \tag{47} \end{equation}

where $\sigma > 0$ is a scale parameter and $\mu$ is the location parameter.

DP Solution for Log-Normal Distribution

To solve the MPP formulated for Log-Normal distribution (47), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{\frac{1}{4}exp\left(2\mu+2\sigma^{2}\right)\left[ erf\left(\frac{ln(d_{1}+y_{0})-\mu-2\sigma^{2}}{\sigma\sqrt{2}}\right)\right.\nonumber \[10pt] && \left. - erf\left(\frac{ln(y_{0})-\mu-2\sigma^{2}}{\sigma\sqrt{2}}\right) \right] \left[ erf\left(\frac{ln(d_{1}+y_{0})-\mu}{\sigma\sqrt{2}}\right) - erf\left(\frac{ln(y_{0})-\mu}{\sigma\sqrt{2}}\right)\right]\nonumber \[10pt] && -\frac{1}{4}exp\left(2\mu+\sigma^{2}\right)\left[erf\left(\frac{ln(d_{1}+y_{0})-\mu-\sigma^{2}}{\sigma\sqrt{2}}\right) - erf\left(\frac{ln(y_{0})-\mu-\sigma^{2}}{\sigma\sqrt{2}}\right) \right]^{2}\Biggr} \tag{48} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \frac{1}{4}exp\left(2\mu+2\sigma^{2}\right)\left[erf\left(\frac{ln(d_{k}+y_{0})-\mu-2\sigma^{2}}{\sigma\sqrt{2}}\right)\right.\nonumber \[10pt] && - \left. erf\left(\frac{ln(d_{k}-l_{k}+y_{0})-\mu-2\sigma^{2}}{\sigma\sqrt{2}}\right) \right] \left[ erf\left(\frac{ln(d_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right)\right.\nonumber \[10pt] && -\left. erf\left(\frac{ln(d_{k}-l_{k}+y_{0})-\mu}{\sigma\sqrt{2}}\right) \right]-\frac{1}{4}exp\left(2\mu+\sigma^{2}\right)\nonumber \[10pt] &&\times \left[ erf\left(\frac{ln(d_{k}+y_{0})-\mu-\sigma^{2}}{\sigma\sqrt{2}}\right) - erf\left(\frac{ln(d_{k}-l_{k}+y_{0})-\mu-\sigma^{2}}{\sigma\sqrt{2}}\right) \right]^{2}\Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr}. \tag{49} \end{eqnarray}

Substituting the values of $\mu$, $\sigma$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Log-Normal Distribution

The 'hies' data of size $N=3566$ is used to demonstrate the application of the stratifyR package on Log-normal population. The 'hies' data comes from the HIES survey conducted in Fiji in the year 2010. The data contains only two aspects of the survey, namely Income and Expenditure. In this example, the variable Expenditure is used since it exhibits a 2-parameter Log-normal distribution with the shape and scale parameters as $meanlog = 9.2804934$ and $sdlog = 0.6917842$ respectively. The minimum and maximum values are $[y_{0}, y_{L}] = [991.24, 136539.1]$, which implies that $d=135547.8$.

To construct the OSB ($h = 2$) for the Iron data with a fixed total sample size of $500$, we use the following codes:

data(hies)
Expenditure <- hies$Expenditure
head(Expenditure);length(Expenditure)
hist(Expenditure)
min(Expenditure); max(Expenditure); d=max(Expenditure)-min(Expenditure);d
fit <- fitdist(Expenditure, distr="lnorm", method="mle")
fit
res <- strata.data(Expenditure, h = 2, n=500) 
summary(res)

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Log-normal population. Based on the assumption from past knowledge that the population follows Log-normal distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

res <- strata.distr(h=2, initval=10, dist=188, distr = "lnorm",
             params = c(meanlog=3.23, sdlog=0.65), n=500, N=1588)
summary(res)

Stratification for a Survey Variable with Cauchy Distribution

If the study variable \textit{y} follows the Cauchy distribution on the interval $[y_{0}, y_{L}]$, its two-parameter probability density function is given by:

\begin{eqnarray} f(y; \mu, \sigma)=\dfrac{1}{{\pi}\sigma\left[1+\left(\dfrac{y-\mu}{\sigma}\right)^2\right]} \;\;\;\;-\infty < y < +\infty \tag{50} \end{eqnarray}

where $\mu$ is the location parameter which specifies the location of the peak of the distribution and $\sigma$ is the scale parameter, which specifies the half-width at half maximum of the distribution.

DP Solution for Cauchy Distribution

To solve the MPP formulated for Cauchy distribution (50), we apply the algorithm using the DP technique discussed in Section 4. The recurrence relations used to determine the OSB are given by:

For the first stage, $k=1$, at $l_{1}^{*}=d_{1}$:

\begin{eqnarray} \Phi_{1}d_{1}&=& \textrm{Sqrt}\Biggl{\dfrac{1}{\pi^{2}}\left[\tan^{-1}\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma}\right)-\tan^{-1}\left(\dfrac{y_{0}-\mu}{\sigma}\right)\right]\nonumber\[10pt] &&\times\left[\mu\sigma\ln\left(\left(d_{1}+y_{0}-\mu\right)^2+\sigma^2\right)+\left(\mu^2-\sigma^2\right)\tan^{-1}\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma}\right)\right.\nonumber\[10pt] &&+\sigma\left(d_{1}+y_{0}\right)-\mu\sigma\ln\left(\left(y_{0}-\mu\right)^2+\sigma^2\right)\nonumber\[10pt] &&-\left.\left(\mu^2-\sigma^2\right)\tan^{-1}\left(\dfrac{y_{0}-\mu}{\sigma}\right)-\sigma y_{0}\right]\nonumber\[10pt] &&-\left[\dfrac{1}{4\pi^{2}}\left[\sigma\ln\left(\left(d_{1}+y_{0}-\mu\right)^2+\sigma^2\right)+2\mu\tan^{-1}\left(\dfrac{d_{1}+y_{0}-\mu}{\sigma}\right)\right.\right.\nonumber\[10pt] &&-\left.\left.\sigma\ln\left(\left(y_{0}-\mu\right)^2+\sigma^2\right)-2\mu\tan^{-1}\left(\dfrac{y_{0}-\mu}{\sigma}\right)\right]\right]^{2}\Biggr} \tag{51} \end{eqnarray}

And for the stages $k\geq2$:

\begin{eqnarray} \Phi_{k}d_{k}&=&\min_{0\leq l_{k}\leq d_{k}}\Biggl{ \textrm{Sqrt}\Biggl{ \dfrac{1}{\pi^{2}}\left[\tan^{-1}\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma}\right)-\tan^{-1}\left(\dfrac{d_{k}-l_{k}+y_{0}-\mu}{\sigma}\right)\right]\nonumber\[10pt] &&\times\left[\mu\sigma\ln\left(\left(d_{k}+y_{0}-\mu\right)^2+\sigma^2\right)+\left(\mu^2-\sigma^2\right)\tan^{-1}\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma}\right)\right.\nonumber\[10pt] &&+\sigma\left(d_{k}+y_{0}\right)-\mu\sigma\ln\left(\left(d_{k}-l_{k}+y_{0}-\mu\right)^2+\sigma^2\right)\nonumber\[10pt] &&-\left.\left(\mu^2-\sigma^2\right)\tan^{-1}\left(\dfrac{d_{k}-l_{k}+y_{0}-\mu}{\sigma}\right)-\sigma y_{h-1}\right]\nonumber\[10pt] &&-\left[\dfrac{1}{4\pi^{2}}\left[\sigma\ln\left(\left(d_{k}+y_{0}-\mu\right)^2+\sigma^2\right)+2\mu\tan^{-1}\left(\dfrac{d_{k}+y_{0}-\mu}{\sigma}\right)\right.\right.\nonumber\[10pt] &&-\left.\left.\sigma\ln\left(\left(d_{k}-l_{k}+y_{0}-\mu\right)^2+\sigma^2\right)-2\mu\tan^{-1}\left(\dfrac{d_{k}-l_{k}+y_{0}-\mu}{\sigma}\right)\right]\right]^{2} \Biggr} + \Phi_{k-1}(d_{k}-l_{k})\Biggr}. \tag{52} \end{eqnarray}

Substituting the values of $\mu$, $\sigma$, $y_{0}$ and $d$, the OSW ($l_{h}^{}$) and the OSB ($y_{h}^{}=y_{h-1}^{}-l_{h}^{}$) are obtained by executing the strata.data() function.

A Numerical Example for Cauchy Distribution

The Boston data from the MASS package is used to demonstrate the application for Cauchy population. In this example, the variable 'black' is used since it exhibits a 2-parameter Cauchy distribution with the location and scale parameters as $location = 393.864307$ and $scale = 4.710457$ respectively. The minimum and maximum values are $[y_{0}, y_{L}] = [0.32, 396.9]$, which implies that $d=396.58$.

data(Boston) #Housing Values in Suburbs of Boston
black = Boston$black
hist(black)
min(black); max(black); d=max(black)-min(black);d
fit <- fitdist(black, distr="cauchy", method="mle")
fit
res <- strata.data(black, h = 2, n=500)
summary(res)

Please note that for Cauchy distributions, one might get the error messages "simpleError in optim()...". You can ignore this error message which simply indicates that when the data is fitted to all ten distributions, it gives errors while computing the parameters of those distributions which are not defined in the negative region. This happens with Cauchy distribution because it is defined on both positive and negative regions.

Similarly, in order to find the OSB and other quantities, we can apply the strata.distr() function to a Cauchy population. Based on the assumption from past knowledge that the population follows Cauchy distribution with the given attributes such as the initial value, distance, parameters, etc., we can execute the following command to obtain the results:

#for a cauchy distribution with initial value of x0=-1, d=2 and 
#location and scale parameters 0 and 1 respectively
res <- strata.distr(h=2, initval=-1, dist=2, distr = "cauchy",
             params = c(location=0, scale=1), n=500, N=5000)
summary(res)

References



Try the stratifyR package in your browser

Any scripts or data that you put into this service are public.

stratifyR documentation built on Dec. 11, 2021, 9:25 a.m.