In networkreporting: Tools for using Network Reporting Estimators

Analyzing respondent-driven sampling (RDS) data using the networkreporting package

Introduction

The networkreporting package has tools for analyzing data that were collected using respondent-driven sampling (RDS).

This introduction will assume that you already have the networkreporting package installed. If you don't, please refer to the introductory vignette ("getting started") for instructions on how to do this.

Review of respondent driven sampling

TODO - eventually, this will be a very brief review of RDS, just enough to explain the two bootstrap options as they get used with generalized scale-up

In the descriptions below, we will need some notation to refer to the waves at each stage

TODO FINISH...

Preparing data

library(networkreporting)
library(plyr)
library(ggplot2) # we'll use qplot from ggplot2 for plots
theme_set(theme_minimal())

data.dir <- file.path("~", "Dropbox", "dennis_matt", "data", "processed")

# eventually, a sample RDS dataset will be included with the
# package. for now, this will only work on a few machines
# and only after gnsum-curitiba-goc-prep.r has been run
load(file.path(data.dir, "gnsum-curitiba-goc-data.RData"))

We obtain a sample of the hidden population using respondent-driven sampling by starting with a set of seeds and asking each of them to refer us to hidden population members she is connected to. The process then repeats, wave after wave, until we are finished. The data that result can be thought of as a tree, also called a chain, starting from each seed respondent.

Although we usually think of respondent-driven sampling data as being composed of these chains, the dataset itself typically arrives as a rectangular array. The make.chain function allows us to take one of these rectangular arrays and turn it into a linked chain structure. (If you are not familiar with linked data structures, it suffices to understand that the rest of the respondent-driven sampling functions we will want to use require the data to have been formatted in this way.)

The main requirement of make.chain is that the rows of the dataset have ids that allow us to determine whether or not one row is the parent of another. You can pass in your own function which, given two ids, returns TRUE if the second id is the parent of the first and FALSE otherwise. The default is to assume that we are using the id scheme from the Curitiba study [TODO CITE].

TODO -- more detail on how ids work, etc; need to be able to parse the ids in a way that describes the tree... TODO -- chain summary function?

## build the nomination chains based on the
## ids in the data
seed.ids <- c("1", "2", "3", "4", "5")
chains <- llply(seed.ids,
                ## NB: make.chain is not currently exported, so we
                ## have to specify the package namespace explicitly
                networkreporting:::make.chain,
                survey.data)

TODO - describe the trait we want to measure and the mixing setup

Variance estimates from bootstrap of Salganik [TODO CITE]

The bootstrap of Salganik [TODO CITE] accounts for the non-independent nature of RDS observations by characterizing each respondent in terms of a small number of categorical states. For example, we might make a note of whether each respondent is male or female and whether or not each respondent has a large or small number of network connections. Essentially, this models the referral process as a Markov Chain, where the states given by the respondents' traits and transitions given by one respondent referring another, mod the respondents' traits. [TODO EXPLAIN MUCH BETTER; CITE DESCRIPTION IN WEIR ET AL]

The next step, then, is to specify the traits that we think are most salient in driving the referral process, and to then estimate the mixing model based on those traits:

## pick a set of traits and estimate the mixing
## info for them
these.traits <- c("netsize.bss.big")

mm <- networkreporting:::estimate.mixing(nonseed.survey.data, parent.data, these.traits)

We also need the empirical distribution of respondent degrees for each trait in the data. We can compute this with estimate.degree.distns. Note that we specify that we want to retain total.alters and total.aware from the original data; we will need these to compute our estimates once we have the bootstrap resamples.

dd <- networkreporting:::estimate.degree.distns(survey.data,
                                                d.hat.vals="netsize.bss",
                                                traits=these.traits,
                                                keep.vars=c("total.alters",
                                                            "total.aware"))

Finally, we take the actual bootstrap resamples using rds.mc.boot.draws.

num.reps <- 100
boot.mc.dat <- networkreporting:::rds.mc.boot.draws(chains,
                                                    mm,
                                                    dd,
                                                    num.reps=num.reps)

Now that we have the bootstrap resamples, we can produce an estimate for each one.

## use the RDS-II estimator on each bootstrap resample

## get estimates of the total number of alters and the total number of
## alters aware that the ego is a heavy drug user for each bootstrap rep
boot.mc.total.ests <- ldply(boot.mc.dat,
                            rdsII.estimator,
                            ## now we use the name of the resampled degree
                            ## (degree) instead of the original one
                            d.hat.vals="degree",
                            y.vals="total.alters",
                            missing="complete.obs")

boot.mc.aware.ests <- ldply(boot.mc.dat,
                            rdsII.estimator,
                            d.hat.vals="degree",
                            y.vals="total.aware",
                            missing="complete.obs")

boot.mc.ests <- data.frame(estimate=boot.mc.aware.ests[,1] / boot.mc.total.ests[,1],
                           estimator="mc")

Variance estimates form bootstrap of Weir et al [TODO CITE]

Alternatively, we can obtain bootstrap resamples using the method of Weir et al [TODO CITE]. In this method, instead of randomly choosing a seed for each bootstrap resample, the original seeds used in the actual data collection are considered fixed.

The structure of each chain in the original data is preserved, but within each bootstrap resample, the traits of the referrals are drawn from a Markov Model in the same way as they were above. For example, suppose we have two traits, 0 and 1. The steps are:

We can obtain bootstrap resamples using thie technique by calling rds.chain.boot.draws.

boot.chain.dat <- networkreporting:::rds.chain.boot.draws(chains,
                                                          mm,
                                                          dd,
                                                          num.reps=num.reps)

And we next produce an estimate for each one.

## use the RDS-II estimator on each bootstrap resample

## get estimates of the total number of alters and the total number of
## alters aware that the ego is a heavy drug user for each bootstrap rep
boot.chain.total.ests <- ldply(boot.chain.dat,
                               rdsII.estimator,
                               ## now we use the name of the resampled degree
                               ## (degree) instead of the original one
                               d.hat.vals="degree",
                               y.vals="total.alters",
                               missing="complete.obs")

boot.chain.aware.ests <- ldply(boot.chain.dat,
                               rdsII.estimator,
                               d.hat.vals="degree",
                               y.vals="total.aware",
                               missing="complete.obs")

boot.chain.ests <- data.frame(estimate=boot.chain.aware.ests[,1] / boot.chain.total.ests[,1],
                              estimator="chain")

Now we can compare the estimated visibility using the two bootstraps.

both.ests <- rbind(boot.chain.ests, boot.mc.ests)

comp.plot <- ggplot(both.ests) +
             geom_density(aes(x=estimate, color=estimator, group=estimator))