SkewedTwoPointSample: Create a sample with skewed two or three point distribution

Description Usage Arguments Details Value Author(s)

View source: R/00.Miscellaneous.R

Description

Create a sample with skewed two or three point distribution with a pre-specified non-negative mean and variance. The purpose is to create a two-point or three-point distribution that take on only positive values having a specified sample mean (m) and sample variance (v). By definition of the sample mean, we need to have at least one value below the mean (p1) and at least one value above the mean (p2). As this function is intended to reproduce a distribution for weights, the lower value cannot be negative (p1 > 0). This is achieved by placing more subjects at p1 than p2 such that p1 is closer to the mean (distance d1 = m - p1) than p2 (distance d2 = p2 - m). The algorithm tries to put progressively more people at p1 until it achieves p1 > 0. See also the details.

Usage

1
SkewedTwoPointSample(n, m, v, max_r = 10^3)

Arguments

n

Required number of individuals in the sample. If 0, an empty row data frame is returned.

m

Required positive sample mean. As this is used to reconstruct a weight distribution, it has to be positive. NaN created by 0/0 is allowed if n = 0.

v

Required non-negative sample variance. As the weights can be the same for all individuals, zero is permitted.

max_r

Maximum for the ratio of the individuals below the mean to the individuals above the mean. The algorithm starts at r = 1 and proceed until the desired results are obtained, max_r is reached, or an error is encountered.

Details

For a given sample size n, sample mean m, and sample variance v, we need to find values p1 and p2 satisfying 0 < p1 < m < p2 that result in a sample with the desired n, m, and v. The desired squared sum is (n-1)v. The algorithm proceed as follows. Initialize r = 1. Obtain maximum K such that K(r+1) <= n. Let n0 = n - K(r+1). These n0 subjects are reminder subjects to be placed at the sample mean m. Let n1 = Kr. These n1 subjects are to be placed at p1. Let n2 = K. These n2 subjects are to be placed at p2. Notice that n1 / n2 = Kr / K = r. Because of this group size ratio, to maintain the mean m, the distances of p1 and p2 from the mean must have the inverse ratio. That is d1 / d2 = 1 / r and d2 = rd1. Note when r = 1, we have the same number of subjects at both p1 and p2, thus, d1 = d2 and the mean m is at the mid-point between p1 and p2. As r increases, the mean m is progressively closer to p1. Because the mean is at m, the total squared sum contribution of these (n1 + n2) subjects is n1 d1^2 + n2 d2^2 = n1 d1^2 + n2 (rd1)^2. This expression with an unknown d1 is equated with the desired squared sum (n-1)v and solved for d1. This results in d1^2 = (n-1)v / (n1 + r^2 n2). The positive root is the solution we want for d1 (it is a function of r). Importantly, we can ignore the n0 subjects placed at the mean because they all have 0 squared sum contributions. This d1 may result in negative p1 if d1 > m. If this is the case, r is incremented by 1 and algorithm is rerun. This will continue until finding the smallest r that gives a desired all positive two-/three-point sample distribution, r_max is reached, or sample size n is exhausted before finding r (sample size is too small to achieve the very small mean m and high variance v).

Value

data frame with a W column containing the values and a count column containing the number of subjects having value W.

Author(s)

Kazuki Yoshida


kaz-yos/distributed documentation built on May 27, 2019, 4:50 a.m.