Estimating hidden population size using RDS data

Description

posteriorsize computes the posterior distribution of the population size based on data collected by Respondent Driven Sampling. The approach approximates the RDS via the Sequential Sampling model of Gile (2008). As such, it is referred to as the Sequential Sampling - Population Size Estimate (SS-PSE). It uses the order of selection of the sample to provide information on the distribution of network sizes over the population members.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
posteriorsize(s, median.prior.size = NULL, interval = 10, burnin = 5000,
  maxN = NULL, K = max(s, na.rm = TRUE), samplesize = 1000,
  quartiles.prior.size = NULL, mean.prior.size = NULL,
  mode.prior.size = NULL, priorsizedistribution = c("beta", "flat",
  "nbinom", "pln", "supplied"), effective.prior.df = 1,
  sd.prior.size = NULL, mode.prior.sample.proportion = NULL, alpha = NULL,
  degreedistribution = c("cmp", "nbinom", "pln"), mean.prior.degree = NULL,
  sd.prior.degree = NULL, max.sd.prior.degree = 4, df.mean.prior = 1,
  df.sd.prior = 3, Np = 0, nk = NULL, n = length(s), muproposal = 0.1,
  sigmaproposal = 0.15, burnintheta = 500, parallel = 1,
  parallel.type = "MPI", seed = NULL, maxbeta = 120, dispersion = 0,
  supplied = list(maxN = maxN), verbose = TRUE)

Arguments

s

vector of integers; the vector of degrees from the RDS in order they are recorded.

median.prior.size

scalar; A hyperparameter being the mode of the prior distribution on the population size.

interval

count; the number of proposals between sampled statistics.

burnin

count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number.

maxN

integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution.

K

count; the maximum degree for an individual. This is usually calculated as round(quantile(s,0.80)).

samplesize

count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000.

quartiles.prior.size

vector of length 2; A pair of hyperparameters being the lower and upper quartiles of the prior distribution on the population size. For example,
quartiles.prior.size=c(1000,4000) corresponds to a prior where the lower quartile (25%) is 1000 and the upper (75%) is 4000.

mean.prior.size

scalar; A hyperparameter being the mean of the prior distribution on the population size.

mode.prior.size

scalar; A hyperparameter being the mode of the prior distribution on the population size.

priorsizedistribution

character; the type of parametric distribution to use for the prior on population size. The options are beta (for a Beta prior on the sample proportion (i.e. n/N)), flat (uniform), nbinom (Negative-Binomial), and pln (Poisson-log-normal). The default is beta.

effective.prior.df

scalar; A hyperparameter being the effective number of samples worth of information represented in the prior distribution on the population size. By default this is 1, but it can be greater (or less!) to allow for different levels of uncertainty.

sd.prior.size

scalar; A hyperparameter being the standard deviation of the prior distribution on the population size.

mode.prior.sample.proportion

scalar; A hyperparameter being the mode of the prior distribution on the sample proportion n/N.

alpha

scalar; A hyperparameter being the first parameter of the beta prior model for the sample proportion. By default this is NULL, meaning that 1 is chosen. it can be any value at least 1 to allow for different levels of uncertainty.

degreedistribution

count; the parametric distribution to use for the individual network sizes (i.e., degrees). The options are cmp, nbinom, and pln. These correspond to the Conway-Maxwell-Poisson, Negative-Binomial, and Poisson-log-normal. The default is cmp.

mean.prior.degree

scalar; A hyper parameter being the mean degree for the prior distribution for a randomly chosen person. The prior has this mean.

sd.prior.degree

scalar; A hyper parameter being the standard deviation of the degree for a randomly chosen person. The prior has this standard deviation.

max.sd.prior.degree

scalar; The maximum allowed value of sd.prior.degree. If the passed or computed value is higher, it is reduced to this value. This is done for numerical stability reasons.

df.mean.prior

scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior.

df.sd.prior

scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation.

Np

integer; The overall degree distribution is a mixture of the Np rates for 1:Np and a parametric degree distribution model truncated below Np. Thus the model fits the proportions of the population with degree 1:Np each with a separate parameter. This should adjust for an lack-of-fit of the parametric degree distribution model at lower degrees, although it also changes the model away from the parametric degree distribution model.

nk

vector; the vector of counts for the number of people in the sample with degree k. This is usually computed from s automatically as tabulate(s,nbins=K) and not usually specified by the user.

n

vector; the vector of counts for the number of people in the sample with degree k. This is usually computed from s automatically and not usually specified by the user.

muproposal

scalar; The standard deviation of the proposal distribution for the mean degree.

sigmaproposal

scalar; The standard deviation of the proposal distribution for the standard deviation of the degree.

burnintheta

count; the number of proposals in the Metropolis-Hastings sub-step for the degree distribution parameters (θ) before any MCMC sampling is done. It typically is set to a modestly large number.

parallel

count; the number of parallel processes to run for the Monte-Carlo sample. This uses PVM or MPI. The default is 1, that is not to use parallel processing.

parallel.type

The type of parallel processing to use. The options are "PVM" or "MPI". This requires the corresponding type to be installed.

seed

integer; random number integer seed. Defaults to NULL to use whatever the state of the random number generator is at the time of the call.

maxbeta

scalar; The maximum allowed value of the beta parameter. If the implied or computed value is higher, it is reduced to this value. This is done for numerical stability reasons.

dispersion

scalar; dispersion to use in the reported network size compared to the actual network size.

supplied

list; If supplied, is a list with components maxN and sample. In this case supplied is a matrix with a column named N being a sample from a prior distribution for the population size. The value maxN specifies the maximum value of the population size, a priori.

verbose

logical; if this is TRUE, the program will print out additional information, including goodness of fit statistics.

Value

posteriorsize returns a list consisting of the following elements:

pop

vector; The final posterior draw for the degrees of the population. The first n are the sample in sequence and the reminder are non-sequenced.

K

count; the maximum degree for an individual. This is usually calculated as twice the maximum observed degree.

n

count; the sample size.

samplesize

count; the number of Monte-Carlo samples to draw to compute the posterior. This is the number returned by the Metropolis-Hastings algorithm.The default is 1000.

burnin

count; the number of proposals before any MCMC sampling is done. It typically is set to a fairly large number.

interval

count; the number of proposals between sampled statistics.

mu

scalar; The hyper parameter mean.prior.degree being the mean degree for the prior distribution for a randomly chosen person. The prior has this mean.

sigma

scalar; The hyper parameter sd.prior.degree being the standard deviation of the degree for a randomly chosen person. The prior has this standard deviation.

df.mean.prior

scalar; A hyper parameter being the degrees-of-freedom of the prior for the mean. This gives the equivalent sample size that would contain the same amount of information inherent in the prior.

df.sd.prior

scalar; A hyper parameter being the degrees-of-freedom of the prior for the standard deviation. This gives the equivalent sample size that would contain the same amount of information inherent in the prior for the standard deviation.

Np

integer; The overall degree distribution is a mixture of the 1:Np rates and a parametric degree distribution model truncated below Np. Thus the model fits the proportions of the population with degree 1:Np each with a separate parameter. This should adjust for an lack-of-fit of the parametric degree distribution model at lower degrees, although it also changes the model away from the parametric degree distribution model.

muproposal

scalar; The standard deviation of the proposal distribution for the mean degree.

sigmaproposal

scalar; The standard deviation of the proposal distribution for the standard deviation of the degree.

N

vector of length 5; summary statistics for the posterior population size.

MAP

maximum aposteriori value of N

Mean AP

mean aposteriori value of N

Median AP

median aposteriori value of N

P025

the 2.5th percentile of the (posterior) distribution for the N. That is, the lower point on a 95% probability interval.

P975

the 97.5th percentile of the (posterior) distribution for the N. That is, the upper point on a 95% probability interval.

maxN

integer; maximum possible population size. By default this is determined from an upper quantile of the prior distribution.

sample

matrix of dimension samplesize\times 10 matrix of summary statistics from the posterior. this is also an object of class mcmc so it can be plotted and summarized via the mcmc.diagnostics function in the ergm package (and also the coda package). The statistics are:

N

population size.

mu

scalar; The mean degree for the prior distribution for a randomly chosen person. The prior has this mean.

sigma

scalar; The standard deviation of the degree for a randomly chosen person. The prior has this standard deviation.

degree1

scalar; the number of nodes of degree 1 in the population (it is assumed all nodes have degree 1 or more).

lambda

scalar; This is only present for the cmp model. It is the λ parameter in the standard parametrization of the Conway-Maxwell-Poisson model for the degree distribution.

nu

scalar; This is only present for the cmp model. It is the ν parameter in the standard parametrization of the Conway-Maxwell-Poisson model for the degree distribution.

lpriorm

vector; the vector of (log) prior probabilities on each value of m=N-n - that is, the number of unobserved members of the population. The values are n:(length(lpriorm)-1+n).

burnintheta

count; the number of proposals in the Metropolis-Hastings sub-step for the degree distribution parameters (θ) before any MCMC sampling is done. It typically is set to a modestly large number.

verbose

logical; if this is TRUE, the program printed out additional information, including goodness of fit statistics.

predictive.degree.count

vector; a vector of length the maximum degree (K) (by default
K=2*max(sample degree)). The kth entry is the posterior predictive number persons with degree k. That is, it is the posterior predictive distribution of the number of people with each degree in the population.

predictive.degree

vector; a vector of length the maximum degree (K) (by default
K=2*max(sample degree)). The kth entry is the posterior predictive proportion of persons with degree k. That is, it is the posterior predictive distribution of the proportion of people with each degree in the population.

MAP

vector of length 6 of MAP estimates corresponding to the output sample. These are:

N

population size.

mu

scalar; The mean degree for the prior distribution for a randomly chosen person. The prior has this mean.

sigma

scalar; The standard deviation of the degree for a randomly chosen person. The prior has this standard deviation.

degree1

scalar; the number of nodes of degree 1 in the population (it is assumed all nodes have degree 1 or more).

lambda

scalar; This is only present for the cmp model. It is the λ parameter in the standard parametrization of the Conway-Maxwell-Poisson model for the degree distribution.

nu

scalar; This is only present for the cmp model. It is the ν parameter in the standard parametrization of the Conway-Maxwell-Poisson model for the degree distribution.

mode.prior.sample.proportion

scalar; A hyperparameter being the mode of the prior distribution on the sample proportion n/N.

median.prior.size

scalar; A hyperparameter being the mode of the prior distribution on the population size.

mode.prior.size

scalar; A hyperparameter being the mode of the prior distribution on the population size.

mean.prior.size

scalar; A hyperparameter being the mean of the prior distribution on the population size.

quartiles.prior.size

vector of length 2; A pair of hyperparameters being the lower and upper quartiles of the prior distribution on the population size.

degreedistribution

count; the parametric distribution to use for the individual network sizes (i.e., degrees). The options are cmp, nbinom, and pln. These correspond to the Conway-Maxwell-Poisson, Negative-Binomial, and Poisson-log-normal. The default is cmp.

priorsizedistribution

character; the type of parametric distribution to use for the prior on population size. The options are beta (for a Beta prior on the sample proportion (i.e. n/N), nbinom (Negative-Binomial), pln (Poisson-log-normal), flat (uniform), and continuous (the continuous version of the Beta prior on the sample proportion. The default is beta.

Details on priors

The best way to specify the prior is via the hyperparameter mode.prior.size which specifies the mode of the prior distribution on the population size. You can alternatively specify the hyperparameter median.prior.size which specifies the median of the prior distribution on the population size, or mean.prior.sample proportion which specifies the mean of the prior distribution on the proportion of the population size in the sample or mode.prior.sample proportion which specifies the mode of the prior distribution on the proportion of the population size in the sample. Finally, you can specify quartiles.prior.size as a vector of length 2 being the pair of lower and upper quartiles of the prior distribution on the population size.

References

Gile, Krista J. (2008) Inference from Partially-Observed Network Data, Ph.D. Thesis, Department of Statistics, University of Washington.

Gile, Krista J. and Handcock, Mark S. (2010) Respondent-Driven Sampling: An Assessment of Current Methodology, Sociological Methodology 40, 285-327.

Gile, Krista J. and Handcock, Mark S. (2014) sspse: Estimating Hidden Population Size using Respondent Driven Sampling Data R package, Los Angeles, CA. Version 0.5, http://hpmrg.org.

Handcock MS (2003). degreenet: Models for Skewed Count Distributions Relevant to Networks. Statnet Project, Seattle, WA. Version 1.2, http://statnetproject.org.

Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2014) Estimating Hidden Population Size using Respondent-Driven Sampling Data, Electronic Journal of Statistics, 8, 1, 1491-1521

Handcock, Mark S., Gile, Krista J. and Mar, Corinne M. (2015) Estimating the Size of Populations at High Risk for HIV using Respondent-Driven Sampling Data, Biometrics.

See Also

network, statnet, degreenet

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Not run: 
N0 <- 200
n <- 100
K <- 10

# Create probabilities for a Waring distribution
# with scaling parameter 3 and mean 5, but truncated at K=10.
probs <- c(0.33333333,0.19047619,0.11904762,0.07936508,0.05555556,
           0.04040404,0.03030303,0.02331002,0.01831502,0.01465201)
probs <- probs / sum(probs)

# Look at the degree distribution for the prior
# Plot these if you want
# plot(x=1:K,y=probs,type="l")
# points(x=1:K,y=probs)
#
# Create a sample
#
set.seed(1)
pop<-sample(1:K, size=N0, replace = TRUE, prob = probs)
s<-sample(pop, size=n, replace = FALSE, prob = pop)

out <- posteriorsize(s=s,interval=10)
plot(out, HPD.level=0.9,data=pop[s])
summary(out, HPD.level=0.9)
# Let's look at some MCMC diagnostics
plot(out, HPD.level=0.9,mcmc=TRUE)

## End(Not run)