Introduction to samplingin

knitr::opts_chunk$set(collapse = T, comment = "#>")
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(samplingin)
library(dplyr)
library(magrittr)
set.seed(114)

Population DataFrame

We'll use the dataset pop_dt. The dataset contains tabulation of Indonesia's population based on the results of the 2020 population census by regency/city and gender from BPS-Statistics Indonesia https://sensus.bps.go.id/main/index/sp2020.

dim(pop_dt)

pop_dt %>% head()

Allocation DataFrame

The dataset used is alokasi_dt which is a dataset consisting of sample allocations for each province for sampling purposes.

dim(alokasi_dt)

alokasi_dt

Simple Random Sampling (SRS)

A simple random sample is a randomly selected subset of a population. In this sampling method, each member of the population has an exactly equal chance of being selected.

The following is the syntax for simple random sampling. Use parameter method = 'srs'

dtSampling_srs = doSampling(
  pop     = pop_dt,
  alloc   = alokasi_dt,
  nsample = "n_primary",
  seed    = 7891,
  method  = "srs",
  ident   = c("kdprov"),
  type    = "U"
)

Displaying the primary sampling result

Population Sampled

head(dtSampling_srs$pop)

Units Sampled

head(dtSampling_srs$sampledf)

dtSampling_srs$sampledf %>% nrow

Sampling Details

head(dtSampling_srs$details)

Systematic Random Sampling

Systematic random sampling is a method to select samples at a particular preset interval. Using population and allocation data that has been provided previously, we will carry out systematic random sampling by utilizing the doSampling function from samplingin package. Use parameter method = 'systematic'

Primary Units Sampling

The following is the syntax for sampling the primary units

dtSampling_u = doSampling(
  pop     = pop_dt,
  alloc   = alokasi_dt,
  nsample = "n_primary",
  seed    = 2,
  method  = "systematic",
  ident   = c("kdprov"),
  type    = "U"
)

Displaying the primary sampling result

Population Sampled

head(dtSampling_u$pop)

Units Sampled

head(dtSampling_u$sampledf)

dtSampling_u$sampledf %>% nrow

Sampling Details

head(dtSampling_u$details)

Secondary Units Sampling

To perform sampling for secondary units, we utilize the population results from prior sampling, which have been marked for the selected primary units. Parameters in doSampling are added with is_secondary=TRUE.

alokasi_dt_p = alokasi_dt %>% 
  mutate(n_secondary = 2*n_primary)

dtSampling_p = doSampling(
  pop     = dtSampling_u$pop,
  alloc   = alokasi_dt_p,
  nsample = "n_secondary",
  seed    = 243,
  method  = "systematic",
  ident   = c("kdprov"),
  type    = "P",
  is_secondary = TRUE
)

It can be seen that there are still 2 units that have not been selected as samples. To view the allocation that has not yet been selected as samples, it is as follows:

dtSampling_p$details %>% 
  filter(n_deficit>0)

Displaying the secondary sampling result

Population Sampled

head(dtSampling_p$pop)

Flags for primary and secondary units

dtSampling_p$pop %>% count(flags)

Units Sampled

head(dtSampling_p$sampledf)

dtSampling_p$sampledf %>% nrow

Sampling Details

head(dtSampling_p$details)

PPS Systematic Sampling

PPS systematic sampling is a method of sampling from a finite population in which a size measure is available for each population unit before sampling and where the probability of selecting a unit is proportional to its size. Units with larger sizes have more chance to be selected. We will use doSampling function with parameter method = 'pps' and auxVar = 'Total' for its auxiliary variable.

dtSampling_pps = doSampling(
  pop     = pop_dt,
  alloc   = alokasi_dt,
  nsample = "n_primary",
  seed    = 321,
  method  = "pps",
  auxVar  = "Total",
  ident   = c("kdprov"),
  type    = "U"
)

Displaying the PPS sampling result

Population Sampled

head(dtSampling_pps$pop)

Units Sampled

head(dtSampling_pps$sampledf)

dtSampling_pps$sampledf %>% nrow

Sampling Details

head(dtSampling_pps$details)

Sampling using Stratification

For sampling that utilizes stratification, the doSampling function includes additional parameter called strata. The strata variable must be available in the population and the allocation being used. For example, in the pop_dt data, information about strata is added, namely strata_kabkot, which indicates information about districts (strata_kabkot = 1) and cities (strata_kabkot = 2).

pop_dt_strata = pop_dt %>% 
  mutate(
    strata_kabkot = ifelse(substr(kdkab,1,1)=='7', 2, 1)
  )

alokasi_dt_strata = pop_dt_strata %>% 
  group_by(kdprov,strata_kabkot) %>% 
  summarise(
    jml_kabkota = n()
  ) %>% 
  ungroup %>% 
  left_join(
    alokasi_dt %>% 
      select(kdprov,n_primary) %>% 
      rename(n_alloc = n_primary)
  )

alokasi_dt_strata = alokasi_dt_strata %>%
  get_allocation(n_alloc = "n_alloc", group = c("kdprov"), pop_var = "jml_kabkota")

dtSampling_strata = doSampling(
  pop     = pop_dt_strata,
  alloc   = alokasi_dt_strata,
  nsample = "n_primary",
  seed    = 3512,
  method  = "systematic",
  strata  = "strata_kabkot",
  ident   = c("kdprov"),
  type    = "U"
)

Displaying the sampling result with stratification

Population Sampled

head(dtSampling_strata$pop)

Units Sampled

head(dtSampling_strata$sampledf)

dtSampling_strata$sampledf %>% nrow

dtSampling_strata$sampledf %>% count(strata_kabkot)

Sampling Details

head(dtSampling_strata$details)

Sampling with Implicit Stratification

So that the characteristics of the selected sample are distributed according to certain variables, sampling sometimes employs implicit stratification. For instance, if you aim to obtain samples distributed according to the total population, you can add the parameter implicitby = 'Total' when conducting sampling.

dtSampling_implicit = doSampling(
  pop        = pop_dt_strata,
  alloc      = alokasi_dt_strata,
  nsample    = "n_primary",
  seed       = 3512,
  method     = "systematic",
  strata     = "strata_kabkot",
  implicitby = "Total",
  ident      = c("kdprov"),
  type       = "U"
)

Displaying the sampling result with implicit stratification

Population Sampled

head(dtSampling_implicit$pop)

Units Sampled

head(dtSampling_implicit$sampledf)

dtSampling_implicit$sampledf %>% nrow

dtSampling_implicit$sampledf %>% count(strata_kabkot)

Sampling Details

head(dtSampling_implicit$details)

Sampling with Predetermined Random Number

Sometimes, the random numbers for sampling have already been determined beforehand. Thus, for sampling using those predetermined random numbers, the samplingin package accommodates this by adding the parameter predetermined_rn, which takes the value of the variable storing the predetermined random numbers. For example, if the random numbers are stored in the allocation data frame under the variable name arand, thus we add predetermined_rn = 'arand'

set.seed(988)
alokasi_dt_arand = alokasi_dt_strata %>%
  mutate(arand = runif(n(),0,1))

alokasi_dt_arand %>% as.data.frame %>% head(10)

dtSampling_prn = doSampling(
  pop        = pop_dt_strata,
  alloc      = alokasi_dt_arand,
  nsample    = "n_primary",
  seed       = 974,
  method     = "systematic",
  strata     = "strata_kabkot",
  predetermined_rn = "arand",
  ident      = c("kdprov"),
  type       = "U"
)

Displaying the sampling result with predetermined random number

Population Sampled

head(dtSampling_prn$pop)

Units Sampled

head(dtSampling_prn$sampledf)

dtSampling_prn$sampledf %>% nrow

Sampling Details

head(dtSampling_prn$details)

Allocate predetermined allocations to smaller levels

One of the supporting functions in the samplingin package is get_allocation. This function aims to allocate sample allocations to lower levels using the proportional allocation method based on the square root of the specified variable.

For example, sample allocations are available at the Province level, which will be allocated to lower levels such as Districts/Cities using the proportional allocation method based on the square root of the total population (Total).

set.seed(242)
alokasi_prov = alokasi_dt %>%
  select(-jml_kabkota, -n_primary) %>%
  mutate(init_alloc = as.integer(runif(n(), 100, 200))) %>%
  as.data.frame()

alokasi_prov %>% head(10)

alokasi_prov %>% 
  summarise(sum(init_alloc))

alokasi_kab = pop_dt %>%
  left_join(alokasi_prov) %>%
  get_allocation(n_alloc = "init_alloc", group = c("kdprov"), pop_var = "Total") %>%
  as.data.frame()

alokasi_kab %>% head(10)

alokasi_kab %>% summarise(sum(n_primary))

alokasi_kab %>% 
  group_by(kdprov) %>% 
  summarise(sum(n_primary))

# check 

all.equal(
  alokasi_prov, alokasi_kab %>% 
  group_by(kdprov) %>% 
  summarise(init_alloc=sum(n_primary)) %>% 
  ungroup() %>% 
  as.data.frame()
)


Try the samplingin package in your browser

Any scripts or data that you put into this service are public.

samplingin documentation built on Sept. 28, 2024, 1:07 a.m.