knitr::opts_chunk$set( eval = FALSE, echo = TRUE, message = FALSE, warning = FALSE, purl = FALSE )
This vignette provides a complete workflow for downloading Brazil's quarterly PNADC (Pesquisa Nacional por Amostra de Domicilios Continua) microdata and preparing it for mensalization. The workflow covers three steps:
PNADcIBGE packagePNADCperiods packageIf you already have PNADC data and want to learn the package API first, see Get Started. For algorithm details, see How PNADCperiods Works.
# Install packages if needed install.packages(c("PNADCperiods", "PNADcIBGE", "fst")) # Load packages library(PNADcIBGE) library(data.table) library(fst) library(PNADCperiods)
PNADC is Brazil's primary household survey for labor market statistics, conducted by IBGE. The survey uses a rotating panel design where each household is interviewed five times over 15 months. Each quarterly release contains approximately 500,000 observations.
Why stack multiple quarters? The mensalization algorithm identifies reference months by tracking households across their panel interviews. With a single quarter, the determination rate is only ~70%. By stacking multiple quarters, the algorithm leverages the rotating panel structure to achieve over 97% determination.
| Quarters Stacked | Month % | Fortnight % | Week % | |------------------|---------|-------------|--------| | 1 (single quarter) | ~70% | ~7% | ~2% | | 8 (2 years) | ~94% | ~9% | ~3% | | 20 (5 years) | ~95% | ~8% | ~3% | | 55+ (full history) | ~97% | ~9% | ~3% |
For most applications, we recommend stacking at least 2 years (8 quarters) of data.
# Set your data directory (adjust path as needed) data_dir <- "path/to/your/pnadc_data/" dir.create(data_dir, recursive = TRUE, showWarnings = FALSE)
Create a grid of year-quarter combinations. This example uses 2020-2024, which provides a good balance between data size and determination rate:
# Define quarters to download (2020-2024 example) editions <- expand.grid( year = 2020:2024, quarter = 1:4 ) # If downloading recent years, filter out quarters not yet available: # editions <- editions[!(editions$year == 2025 & editions$quarter > 3), ]
The download loop fetches each quarter from IBGE and saves it in FST format for fast loading:
for (i in 1:nrow(editions)) { year_i <- editions$year[i] quarter_i <- editions$quarter[i] filename <- paste0("pnadc_", year_i, "-", quarter_i, "q.fst") cat("Downloading:", year_i, "Q", quarter_i, "\n") # Download from IBGE pnadc_quarter <- get_pnadc( year = year_i, quarter = quarter_i, labels = FALSE, # IMPORTANT: Use numeric codes, not labels deflator = FALSE, design = FALSE, savedir = data_dir ) # Save as FST format (fast serialization) write_fst(pnadc_quarter, file.path(data_dir, filename)) # Clean up temporary files created by PNADcIBGE temp_files <- list.files(data_dir, pattern = "\\.(zip|sas|txt)$", full.names = TRUE) file.remove(temp_files) rm(pnadc_quarter) gc() }
Important: Always use
labels = FALSEwhen downloading. The mensalization algorithm requires numeric codes for the birthday variables (V2008, V20081, V20082). Using labeled factors will cause errors.
Stack all quarterly files into a single dataset. To save memory, only load the columns needed for mensalization:
# Columns needed for mensalization cols_needed <- c( # Time and identifiers "Ano", "Trimestre", "UPA", "V1008", "V1014", # Birthday variables (for reference period algorithm) "V2008", "V20081", "V20082", "V2009", # Weight and stratification (for weight calibration) "V1028", "UF", "posest", "posest_sxi" ) # Stack all quarters files <- list.files(data_dir, pattern = "pnadc_.*\\.fst$", full.names = TRUE) pnadc_stacked <- rbindlist(lapply(files, function(f) { cat("Loading:", basename(f), "\n") read_fst(f, columns = cols_needed, as.data.table = TRUE) })) cat("Total observations:", format(nrow(pnadc_stacked), big.mark = ","), "\n")
Build the crosswalk (identify reference periods) and calibrate weights:
# Step 1: Build crosswalk (identify reference periods) crosswalk <- pnadc_identify_periods(pnadc_stacked, verbose = TRUE) # Check determination rates crosswalk[, .( month_rate = mean(determined_month), fortnight_rate = mean(determined_fortnight), week_rate = mean(determined_week) )] # Step 2: Apply crosswalk and calibrate weights result <- pnadc_apply_periods( data = pnadc_stacked, crosswalk = crosswalk, weight_var = "V1028", anchor = "quarter", calibrate = TRUE, calibration_unit = "month", verbose = TRUE )
The verbose output shows progress and determination rates for each phase (month, fortnight, week). With 20 quarters stacked (2020-2024), expect ~95% month determination.
The result contains all original columns plus reference period indicators and calibrated weights:
# Key new columns names(result)[grep("ref_|determined_|weight_", names(result))] # Distribution of reference months within quarters result[, .N, by = ref_month_in_quarter][order(ref_month_in_quarter)]
Key output columns:
| Column | Description |
|--------|-------------|
| ref_month_in_quarter | Position within quarter (1, 2, or 3; NA if indeterminate) |
| ref_month_yyyymm | Reference month as YYYYMM integer (e.g., 202301) |
| determined_month | Logical flag (TRUE if month was determined) |
| weight_monthly | Calibrated monthly weight (if calibrate = TRUE) |
The distribution is approximately equal across months 1, 2, and 3 (each around 31-32%), with the remaining observations having NA for indeterminate cases.
Save the mensalized data for future use:
write_fst(result, file.path(data_dir, "pnadc_mensalized.fst"))
To compute monthly estimates, filter to determined observations and aggregate by ref_month_yyyymm:
# Monthly unemployment rate monthly_unemployment <- result[determined_month == TRUE, .( unemployment_rate = sum((VD4002 == 2) * weight_monthly, na.rm = TRUE) / sum((VD4001 == 1) * weight_monthly, na.rm = TRUE) ), by = ref_month_yyyymm] # Monthly population monthly_pop <- result[determined_month == TRUE, .( population = sum(weight_monthly, na.rm = TRUE) ), by = ref_month_yyyymm]
For more analysis examples, see Applied Examples.
Selective column loading: Only load the columns you need with read_fst(..., columns = ...). This dramatically reduces memory usage.
Process in batches: For very large analyses, process one year at a time and combine results.
Use FST format: FST is much faster than CSV or RDS for large datasets. A typical quarter loads in seconds rather than minutes.
Clean up regularly: Use rm() and gc() to free memory after processing each quarter.
| Period | Quarters | Observations | FST Size | |--------|----------|--------------|----------| | 2020-2024 | 20 | ~8.9M | ~5 GB | | 2012-2025 | 55 | ~29M | ~15 GB |
For the best determination rate and longitudinal analysis, download all available quarters:
# Download all available data (2012-present) editions_full <- expand.grid( year = 2012:2025, quarter = 1:4 ) editions_full <- editions_full[!(editions_full$year == 2025 & editions_full$quarter > 3), ] # Use the same download and stacking workflow as above
The full history provides approximately 29 million observations and achieves the highest possible determination rate (~97% month).
"Column not found" errors: Ensure you used labels = FALSE when downloading. The algorithm requires numeric codes.
Download failures: IBGE servers can be slow or unavailable. The PNADcIBGE package will retry automatically, but you may need to restart interrupted downloads.
Memory errors: Try processing fewer quarters at a time, or use a machine with more RAM.
SIDRA API errors: Weight calibration requires internet access to the SIDRA API. If it fails, try again later or use calibrate = FALSE for reference period identification without weight calibration.
Working with annual PNADC data? Annual data (visit-specific microdata with comprehensive income variables) requires a different workflow. See Monthly Poverty Analysis with Annual PNADC Data for details on using
pnadc_apply_periods()withanchor = "year".
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.